As we are now experiencing a Device Service failure on the Particle cloud, We have a major issue on all our devices.
As expected, during the outage devices try to reconnect. After several reconnection attempts and failures the devices are locking up with a Solid Orange status light.
The last log activity is:
0000261800 [system] WARN: Internet available, Cloud not reachable!
0000261800 [system] WARN: Handling cloud error: 3
The only recovery is a button reset. It would seem to me this is not necessary. Is this by design? All our devices must have a button reset which is going to take days!!!
The outage is still ongoing, so it might be premature to state that the only recovery is a button reset.
Although all is possible, it might be that the device still recovers after the outage - although the solid part of it would suggest the application deadlocked with the device OS.
However, that’s something the application needs to avoid, so I’d say this is not by design but rather unfortunate chain of events on both sides.
While it’s true that the system should be resilient against cloud and/or network outages, there also needs to be a balance between being overly restrictive against application code and being 100% fault tolerant.
The cause for the outage seems to be found and remedied, but after effects may still take some time to resolve.
Thanks for the reply. Yes, I see that it is back online. However, I am not seeing any devices come back by themselves yet. I am pretty sure they are in some kind of dormant or locked state. I have one in front of me and every indication is that it will stay this way until a reset.
Of coarse, going forward we will implement an external watchdog.
Just out of interest
- are you using cellular or WiFi devices?
- are you using any HW interfaces (which)?
- are you directly talking to the radio module?
Deadlocks mostly happen when application and system request access to a shared resource (mostly true for HW interfaces) simulatneously.
Using radio through API, not directly.
I have about 12 devices out in the field that haven’t reconnected after the outage. They are all out in remote solar array fields.
One of the devices is still functioning, but the rest are (after a chat with someone in the field) just blinking green trying to connect. A button reset works, but not a software reset System.reset(); call.