Over 20 Devices just went down

We have over 20 test devices running, all of them went down at the same time today. Around 3PM. Did something happen?!

What kind of devices? What proximity are they to each other? Whay ous common amongst the devices… what is dissimilar? It could be anything from a cellular carrier issue to someone accidentally turning off your WiFi router. Hard to say without details. Unless support (@mstanley , @ParticleD) knows something about the cloud status.

Hello All, First post here.
I too have had 4 devices (Electron) on the test bench over night that were reporting fine (publishing data that could be seen in the console and our back end) but stopped at their first connect after 4:30am (Eastern Australia Standard Time). All 4 units continued to connect to cellular every 15 minutes as their programmed schedule, flashing green then cyan with successful internet connection. Particle.Connected returning true.
To the device, and firmware, all looks like it is successfully publishing the data, but nothing appears in the console, and the Last Handshake time & date does not get updated for any of the 4 devices.
I have seen this before but after a number of hours the devices resumed regular data flow, however, not this time (after 12+ hrs). I have power cycled 1 device of the 4 to see what would happen, and it came good & started functioning normally.
With the other 3, if I remove the antenna connection, it reports a failure to connect as it should, put the antenna back, and it goes back to looking like it is connected to the cloud when it tries to connect (flashing green, then cyan) but still no data appears from webhooks in the console.
I have sent a question and got a ticket raised (and sent our device source code) but thought I would post here as well to see if others have experienced similar.
To rule out some problem with the connection to the local cell tower, I drove one of the devices that were still in the faulting state to another area to connect to another cell tower, but no difference.
Our code is built to OS 1.0.0 btw.
Regards to all.

Well, all of the devices are the same. They have been running as they are now for months with no issue. They go online, collect data and then go back to sleep. They do this about every hour. They are electron devices as I stated when making the post, so there is no WiFi connection. They all have a strong cellular signal. If I power cycle them, they all come back. But that doesn’t help for a commercial product. You can’t ask all of your customers to power cycle their devices… I want to understand what happened here that caused these devices to stop posting all of a sudden.

Sorry, missed the Electron tag in the post title.

We have 4 electrons operating remote river gauges in Alabama. This is for a non profit river gauging organization StreamBeam (streambeam.net). All 4 gauges stopped communicating with the cloud around 1:30pm local time yesterday May 2nd.

All webhook history for the last 30 days is erased! I saw a note on our integration page that a “minor cloud incident” on May 2nd may have caused this, but we are still not receiving any communication from the deployed units! It is going to be a MAJOR inconvenience to cycle power on each of these remote gauges…

I sure hope this can be resolved in the cloud and fast!

1 Like

Hey everybody, thanks for sharing your experiences. We’re having a look at this problem. @ejwords your post detailing your tests and DeviceOS version was super helpful, thank you!

We’re trying to understand what might be happening so we can solve the problem long term for a device that’s in this state (breathing cyan but without an updated last handshake timestamp). Can @bkhalterman or @ejwords DM me a couple deviceIDs that we can have a closer look at?

Thanks!

2 Likes

we lost 13 of 17 electron-based devices in the field since yesterday 18:00 UTC (1pm pacific). I hate to read there was a MINOR issue too… apparently some devices seem to be cycling, draining their battery. I read about frequent disconnects, might be the case, but I suspect it’s all related to failing webhooks. Our devices normally switch to SAT, but I suppose the connection must keep happening and failing, which prevent the backup option to kick in.

I expect Particle to tackle the issue quickly, this is a MAJOR failure, and it impacts the reliability of our system and the trust in Particle.

Thanks.

PS: please don’t sweep the dust under the carpet like this… it’s so unprofessional.

4 Likes

Yeah I’ve got approximately 200 electrons that are unable to connect to the Particle cloud right now. All on v0.6.4. Some photons appear to be affected though none locally so cannot 100% confirm. Happy to run any diagnostic tests I can if helpful for debugging @cullenshane. I have an open MQTT connection to devices and can run stats and query logs.

Edit: some devices appear to be trickling in and reconnecting. Will update if that continues to hold.

Edit2: yeah everything seems to have stabilized after the stated fix happened. My devices automatically self-reset after 7 minutes of no Particle Cloud connection, so I think once they hit that after the fix they’ve been fine for the most part.

We’re seeing the same behavior on our fleet. It appears that once devices get a new handshake, they can connect again and send webhooks. That’s probably why the manual resets/powercycles mentioned above have worked.

Which fix are you referring to in Edit 2?

Thanks for reporting the issues here everybody, we’re taking steps now to try and resolve this problem. We’re trying to address the problem here: https://status.particle.io/incidents/6rk2fk95462b

Based on what I read in other threads, the new handshake can happen in 7 days… I think Particle should offer a way to force a new handshake from the console… can’t wait 7 days for the devices to come back to life… client will be pissed. And they’re remote, so manual power cycle is NOT an option (think of it, that’s why we opted for electrons in the first place… to NOT have to go on site)

2 Likes

We have the site visit issue as well… Not able to reach devices or gateways. There is for sure something impacting connectivity to the Particle cloud. I very much hope that there is an option to recover devices without site visits… I’m looking at the documentation now and hoping there is some way to reach gateway & edge nodes. Impacted are our Argon, Boron, and Xenon devices.

_Todd

I see that Particle has labeled the issue as “solved” but it is indeed not solved. Almost all of our units (hundreds of them) are still offline from this and they check in every hour.

Also if it has been “resolved” how about some information on what caused this? It’s the least you could do…

1 Like

Engineer here that works on the original devices noted. To add a little more data:

The devices are in a couple of groups that are ~150 miles apart, one group’s devices are all within 100 feet of each other, another group is half a mile from the next device, and then there’s two more that are 50-100 miles from any other device. I think that rules out a tower problem (these are all LTE devices) at least.

The devices (at least the ones I can physically access), seem to think everything is just fine. They boot, connect, go in to a “breathing” (cloud connection?) status and pretend to post data before going back to sleep. I have a couple outside of cases that I get diagnostic data over serial, they just go through their process with no errors. Weird right?

Might be important there too: these being remote, solar powered setups they are asleep most of the time. If their batteries are particularly healthy they will stay awake for a few minutes to capture a GPS signal while not connected before connecting, posting data again, and going back to sleep. But otherwise asleep for ~29.5 minutes out of every 30.

All of them stopped registering with particle at about 2-3:00pm EST on 5/2. One (of ~20) came back online by itself today, and there are a few others that I have de-powered and reset or reloaded the firmware on that are back online. Most of the devices are in hermetically sealed cases and I wont have the physical access to unplug or reload. Hopefully the particle team can (re-provision?) them so we don’t have to have to go on a road trip and break the seals on our devices!

1 Like

DM sent with device ID’s etc.

Thanks for looking into this.
When you deploy a product that relies on someone else’s infrastructure, it is always a bit worrying when things go wrong, so it is nice to see that the issue is being investigated and dealt with, and input is being asked for to help solve the issue. Keep up the good work.

Best regards
Ewan.

2 Likes

Same experience with a developer device. Looking to launch a product on this platform and trying to understand how to mitigate this in the future from device side: Is a HW watchdog used in any of the cases above, and does it use the RST pin, EN pin or power cycle the device?

As I understand it, in this case only a power cycle can solve it. I guess the data usage optimisation on cloud reconnects is involved and that survives everything but a full power cycle?

Any experience with this is very useful, Thanks.

in my case i have 1 2G electron and it lost connection about 5 hours before i got the monthly invoice email from particle on 5/2. ( :slight_smile: ) so at first i thought there was some kind of issue with billing and/or sim authorization. my issue was a continuous blinking green which is trying to connect to cellular. this coming after about 2 months of being connected roughly 95% of that time. right now it’s running 1.0.1 and tinker. that’s all. it’s basically to test just how well the electron does by itself. although my experience is not similar to the reported cloud issues the interesting part , to me, is the time-frame of my issue is very close to the time-frames others are discussing, sometime late afternoon thursday 5/2 until ~ 1pm friday 5/3. otherwise i’d be thinking that there were currently some local cell tower problems which might still be the case but the overlapping time-frames has got me interested in the actual issue/issues the happened. i’d also like to hear/read more and hope particle will post some incident details at some point.

We use 2 watchdogs (Watchdog lib + ApplicationTimer) on the devices. It didn’t help. Devices are healthy, they do what they have to do, wakeup, sample sensors, go to sleep. They just fail connecting to particle…

We’re studying using different hardware with no intermediary provider for the next batch of devices… Reliability is a must. 3G connection has to be either working or not working, so we can attempt other ways (Sat). It can’t be “kind of working, but not really”.