Partial disconnect of mesh devices


#1

Anyone else notice their mesh devices disconnect from Particle recently (last day or two), but seem to be working otherwise?

I had one xenon from which I could no longer check variable values or run functions, yet it was breathing cyan. I restarted and and it reconnected fine.

I have a second xenon that is acting as my gateway. From the console it shows a gray dot. If I run a diagnostic check on it I get “THERE IS A PROBLEM”. Yet the thing is also breathing cyan, and more than that it’s the way the first xenon is connected to the network, so it has to be forwarding traffic itself.

The first one has custom code on it, but the gateway is running tinker.

Anyone see anything similar. I’m wondering if it might be related to Photons and the forum just went down


#2

Oddly my mesh died in similar fashion to yours around this time yesterday. Resetting my Argon gateway fixed it and all devices reconnected shortly afterwards. No problems since then other than a ton of IFTTT alerts on some of my non-mesh devices losing then reconnecting to particle servers, those stopped about 8hrs ago.


#3

Yes - I have experienced similar connectivity issues with Argon, Boron, and Xenon. I agree that they have become “recently worse”. With guidance from the Particle.io team I removed and added some of the problematic devices. This seemed to restore operation for some time, but then devices become “offline” again.

Perhaps important note: Despite the lack of a connection to the particle.io cloud the mesh radio peers are still inter operating - the lack of a gateway, however, is a show stopper for my telemetry applications.

Thanks to all at Particle.io and in Community Support that come together to make this ecosystem viable. I would provide my code here, however, the problems are present even when using a “blink” binary on any of my 0.90 firmware devices so I suspect it is reproducible.

Best Regards, _Robert


#4

Hey everyone, thanks for reaching out. Please refer to the status page banner at the top of the page.

We have no reason to believe this issue is with anything on the device’s end and are investigating the issue on our end right now.


#5

Hi. We have similar critical connectivity problem with Argon and xenon. For about 5 hours, our argon gateway doesn’t connect particle.io cloud. No problem in our wifi router. So serious!! we are going to lauch new sevice based particle mesh networks. Without reseting and reconnecting, is there any other way to hammer out?


#6

Hey @mstanley, can you confirm that this was the issue on the 13th as well? If so I hope that there’s a big TODO on improving monitoring so these issues are identified sooner.

Also can you tell me if a bug has been opened related to the fact that when there is an issue with the Particle backend then devices do not recover. My gateway device still fails diagnostics even though it continues to forward messages for other devices.

I hope that Particle understands the effects issues like these can have on trust. Personally I’m using particle mesh devices because I will be able to run multiple gateways and a redundant monitoring network. On the 12th I replaced two valves and placed a xenon with some wires to alert me if any leaks developed. I guess I should have set up monitoring to watch if the device went down, but it was late and I trusted the xenon. Fortunately nothing leaked but I wouldn’t have known if it did.


#7

I’ve seen this quite a bit here in my small (5node) mesh network. When the transactions across the mesh seem fine and the Xenon’s are able to particle.publish without issue - frequently my devices appear to be offline from particle.io. (can’t be pinged and functions don’t appear there)

When I reset the Argon - things seem to get back in shape immediately. So, I’ve got scripts running on my server under cron that call a pong() function (particle.io ping wasn’t working) on my Xenons and Argon and when the failure level hits a threshold it calls a bounce() function on my Argon which resets the device. This has taken me out of the woods so that I can get on with collecting data and demonstrating what this topology can do.

Here’s bounce():

int bounce(String extra) {
if (extra.equals("_MY_SECRET_RESET_STRING")) {
System.reset();
return 0;
} else
return -1;
}

Hope this is useful to some of you.

Charlie


#8

Hi @kenmacd

Great question. I was not made ware of any issues on the 13th in regards to mesh connectivity, so I cannot say for certain. With that said, a comprehensive post-incident write up is still being worked on by our engineering team and as they are able, will be sharing this with the community at large. This will get much more comprehensive insight into the issues that occurred.

I’m not aware of any specific Github issues in regards to devices not reconnecting after cloud outages. However, I do know there’s active chatter internally about this issue. So it is certainly a high priority one.

We do recognize this can have an impact on trust to our customers and this is why we aim to learn from these types of incidents to prevent the causes that lead to them from occurring again in the future.

With that said, one consideration in regards to this is that this issue was explicitly isolated to Mesh devices, and did not impact our gen 2 line of devices.

Mesh is new and exciting. It has a lot of new tools and integrations not found previously at Particle. But like anything else, it’s bound to come with its share of bugs. As such, we have been working actively since Mesh’s release to ensure we iterate and knock out issues as quickly as they have come up.

With efforts such as the Gen 3 Community Council, I’d say we’ve done a good job on making sure this feedback loop is a fast and active one.

But with that in mind, we are still working on issues in this feedback loop, as well as developing out tools to help make mesh more reliable and at scale. Until these goals are met, I’d still say that mesh should be considered for more prototyping purposes and would be hesitant to use it for high stake, high volume, or remote deployments just yet. Mesh is in a great stage for prototyping right now, and we will ensure parity with our gen 2 line of devices (and then some, in terms of new features). However, until it reaches that point, I’d be cautious about the use cases in which its deployed for now.


#9

That’s a great tidbit to share.

We are actually looking into an issue where mesh devices can end up with outdated session information and I believe that performing a reset like this is actually the best way to handle this issue as of right now. I’m awaiting more information from engineering on this issue, and will be sure to provide more information as I know more.


#10

Thanks @mstanley. I think the main concern is that the issues are being taken seriously, and it sounds like they are. Although it’s still in the prototype phase I think people are much more accepting of ‘I can’t connect a node through the app’ type bugs than cases where things fail silently.

Just in case it helps here’s what I see as the issues to fix from the recent outage:

  • Improve backend monitoring to identify similar issues more quickly
  • Fix mesh devices to not show breathing cyan when connection issues exist
  • Fix mesh devices to reconnect after an outage (the other thread shows someone having to drive to a site to reset them)

Then as some nice to haves:

  • Verify that Particle.connected() would have returned false in this case so people can include their own watchdog
  • Allow backend integrations for events like a device going offline for a period, or coming online

#11

Is this an ongoing issue or resolved? I’m currently experiencing a similar issue.


#12

I wanted to update this topic with recent experiences. Like some others, I was seeing my mesh nodes as offline and not-ping-able from particle.io. I am making a function call from my database server to each of the mesh devices every 30seconds. Here’s a dump of the most recent failures (10001=device didn’t respond)

image

I do force a reset of my gateway device (Argon) if the failure counts get too large. The list of failures for last Friday (03/15) is quite long. But things have improved dramatically in the last few days. No code changes were made on any of my devices in that time period. I haven’t seen a force reset of my Argon since Friday.

These devices are gathering temperature data and publishing that info on the mesh for controller devices to pick up and use. The gateway sees the messages from the sensors and produces an ack message back to the sensor device. Once the ack is seen, or a timeout expires, the sensor publishes the information to my db server.

Charlie

p.s. I’ll be updating this later today. I’m beginning to believe there is another failure cause…

Shortly after reporting on the improvements I’ve seen over the weekend, I gathered all five devices into my office plugged them in and restarted them all at once. After the shotgun start of all 5 devices I start to see failures calling functions from the web. The failures continue…until I force a restart of the Argon gateway then everything starts behaving well again. I did this twice during the day and both times the failures began with the shotgun start and ended with the Argon reset.

So, bounce your gateway, you’ll be glad you did. As an experiment you could have it reset itself at the top of the hour. Let us know if that helps.


#13

It’s still ongoing for me.

I still get devices no longer connected to the cloud even though they’re showing connected. I’ve added Particle.connected() checking in the hope this detect the loss of connectivity, but it hasn’t. They sometimes pass online diagnostics but events stop showing up. I emailed support but besides a Thanks for writing to us! reply 5 days ago, I haven’t heard back.

I guess I’m going to have to go the way @holobox has, and create an off-mesh solution. I don’t trust that I’ll always be able to even call a function on the gateway, so I might have them check in over UDP to a local server. Or maybe for now I’ll just have them reset every 10m.


#14

If your mesh devices are smart enough to know that they are having problem getting to the cloud - can you mesh.publish() something to the gateway and force a restart that way?


#15

The devices continue to breath cyan, and I told them to do a reset if Particle.connected() fails, so it appears they do not know they’ve disconnected.

I could probably mesh.publish() something internally, but I can’t do that from out side the mesh (afaik).


#16

We are hoping that Device OS 0.9.1 will address the “breathing cyan but unresponsive” issue for Borons/Argons. That version is currently undergoing testing. I’m hoping it’ll be released in the next few days. This thread mentions the issue.


#17

FYI, seems to be a private topic, I get this error:
Sorry, you don’t have access to that topic!


#18

Sorry about that. I didn’t realize that the link I provided is to a post in a private group. Doh!


#19

I can definitely understand your concern in regards to issue being taken seriously.

I can assure you we take any sort of downtime like this extremely seriously. The night of the outage we had much of our staff working into the evening until the issue was resolved. A post-incident writeup always occurs in such situations to extensively evaluate what went wrong and steps to be taken to more quickly identify and provide failsafes such that this does not happen again.

For this particular incident, it should also be noted that precautions were taken to separate mesh from Gen 2 to ensure that while mesh development is actively going on, we don’t compromise reliability of our earlier, more reliable production ready products.

Your feedback is definitely valued, and I’ll be sure to put it into internal channels as well.

  • We were able to fairly quickly identify the issue on our end ,but the issue in this case was that due to the complexity of it, it took a little longer to get it resolved.
  • I agree, devices should not breathe cyan when they aren’t actually connected. Due to the nature of how networking is differently handled on mesh devices, sometimes it’s a little harder to kick them out of this state when oddities occur. Nonetheless, it’s a top priority for support to be accurately representing this so that the customer has a clearer understanding of their device’s state. It helps nobody to have a false-positive reading.
  • This is something I’ve already been seeing on internal chatter. There was much frustration internally that devices don’t restore connectivity on their own. We do absolutely intend to address this. In regards to the customer having to drive on site to reset the device–this is why I am hesitant to recommend mesh in remote deployment locations at this time due to incidents such as this arising.

The nice to haves are definitely ones I’ll be sure to pass along. I agree both would be valuable. :slight_smile:


#20

Particle.connected() is closely linked with the breathing cyan issue. If the false positive breathing cyan issue can be addressed, this value should be more consistent and reliable just the same.

I do apologize on the delay to get back to your support request. At the moment, support has been working to handle our support request volume. We are working on a first-in-first-out basis to ensure everyone is able to have their questions answered. Given our current bandwidth, request are taking up to about 13 days or so, I’m afraid. We are trying to work actively to reduce this response time.