Over 20 Devices just went down

Todd · May 8, 2019, 8:36pm

Particle Community,

Do we know if this event is still in progress? After 7 days we had no gateway access to a site served by an Argon gateway… Traveled there to power cycle which was immediately effective.

Same day - we have our lab Argon (same configuration) with the Argon gateway reporting offline although the local MESH network is up and processing, just no connectivity from the Particle API.

We will back out to “Blink” code, however, we have no reason to suspect our code… I usually put “Blink or Tinker” on devices that are suspect.

Anyone else still having intermittent cloud connectivity issues? Gateway elements (esp. Argon) not recovering?

Best Regards, _Todd

(1.2 b1)

mstanley · May 8, 2019, 9:18pm

Hey Todd,

I'm not certain what the cause of your Argon connectivity issues are, but based on what you are describing, they would not be occurring as a result of this incident. Devices only would have been impacted a result of this during or shortly after the caching database crash. Based on what you are describing, a device 7 days past this incident would not go offline as a result of the crash.

Efforts are being made to continue on improving feature parity and reliability of our gen3/mesh devices. For now, we do still recommend they be used for prototyping and not any sort of production deploys just yet in case incidents such as this arise.

Very recently, deployments were made to handle connectivity issues as a result of the 5/3 incident. Depending on the circumstances your device is getting itself into, it is possible that these deployments may assist in its recovery as well. However, I can't make any promises without full context of your device's state and circumstances.

ejwords · May 8, 2019, 10:50pm

Hey all.
Our devices that failed with this issue (and were left in that state) have now come back online
after the fix was applied to the production servers (repost again here for clairty):

Thanks for getting on top of this.

Regards to all.

will · May 8, 2019, 11:19pm

Very glad to hear that!

ejwords · May 8, 2019, 11:48pm

During the discussion on this issue, it seemed that there has been investigations into device side, and particle server side.

Apparently, work is being done on Device OS 1.2.x to mitigate the device side problems that have been identified as possibly contributing to the issue.

Can we have some detailed information on what this is please?
Perhaps we can identify a workaround to add to user code / firmware that will assist prior to the fully tested release of the next version Device OS?
Or, since a fix has been applied to the production servers, is anything needed to be done to the Device side anymore?

will · May 9, 2019, 12:26am

You should not need to add anything to your user application code to prevent this issue from occurring in the future.

We will provide more context in a forthcoming post-mortem, but as a result of a cacheing system crash we lost relevant context for Cloud sessions that devices were attempting to restore when they woke up up from deep sleep. As a result, devices were attempting, but failing, to restore their Cloud session before they were prompted to go back to sleep by application logic.

On the Cloud side, we implemented a fix that detects devices in this endless reconnect loop and sends them a special packet in the next reconnect cycle that prompts the device to exit from the loop and create a new session.

On the Device OS side, we’re adding logic that sets a limit on the number of times that a device will attempt to restore the same session with the Cloud before it is prompted to simply create a new one, which will act as a redundant fix for this issue as well as a preventative measure for other possible issues that result in failed session resumption.

ejwords · May 9, 2019, 12:31am

Hi Will.
Thanks very much for the explanation & info.

We are about to push out an update to our fleet of devices for another issue, and needed to
know if we should wait for a workaround or new Device OS prior to doing so.
The fleet is remote and costly to manually power cycle (which will need doing to some due to
the other issue) so wanted decide on a time frame for this to happen.
You’ve answered that question.

All the best.

Topic		Replies	Views
Simultaneous Failure Of Multiple Electrons Troubleshooting	24	2171	November 7, 2017
Electron Connection Down General	21	1733	July 11, 2018
Electrons Disconnected Simultaneously Across the US Troubleshooting	11	1235	August 23, 2018
Most recent devices stopped connecting overnight Cellular	4	887	August 26, 2021
Cloud event on 4/19 @ 12:35PM (EDT)? Troubleshooting	2	547	April 20, 2018

Over 20 Devices just went down

Related topics