Updates for the Status Incident of May 3rd


#1

Hi everyone,

My name is Matthew, and I am a Customer Support Engineer here with Particle. I understand that there have been some concerns around some unexpected behavior this past weekend, and I wanted to give you all an update on this.

Late last week, one of our caching systems unexpectedly crashed. This incident has caused hiccups on our backend in relation to webhooks and device session information with Particle’s cloud. The status page history for this event can be found here (see May 3rd):

https://status.particle.io/

Normally, devices should be able to recover whenever a cloud connection is lost. However, it came to the attention of engineering shortly after that a small number of devices were not coming back online after this incident occurred. Therefore, on Saturday, May 4th, our on-call team responded to the incident to assess the situation.

The conclusion from our on-call team is that the situation required more than just an on-call team response. As a result, a full engineering meeting was scheduled during off-hours on Sunday, May 5th. This meeting brought in full engineering teams and engineering team leads in order to assess the full scale of the issue and how to proceed in regards to mitigating the impact of the current incident and how to reprioritize engineering work in the near-term in order to address this issue to prevent similar incidents from occurring again.

We are working to determine how many devices are still attempting to connect to the cloud with outdated info. However, our early research suggests that only a handful of customers were affected.

For any devices that have not recovered on their own–the only known way to recover the device at this time is to completely remove power from the device and repower it. Simply resetting the device is not enough to move the device out of this bad state.

As a part of engineering reprioritization, our Device OS team is using the information from this incident to prioritize putting changes into the tentative v1.2.x release that will prevent a device from ending up in an unreachable state in incidents such as this. Our cloud engineering team will be working in parallel to prevent future situations like this by replacing and scaling up selected backend caching systems.

For any questions or concerns that anyone may have, I do encourage that they be posted here in this topic or on our community. Our internal teams have been monitoring the forums and are listening to all of your concerns at this time. Community feedback and concerns help drive our priorities and we will be using that feedback to ensure that our customer’s concerns continue to be our highest priority.

Matt


Over 20 Devices just went down
Xenon mesh connectivity issue
Workbench Login Problem: "operation not permitted"
Over 20 Devices just went down
Over 20 Devices just went down
split this topic #2

2 posts were merged into an existing topic: Excessive number of disconnect events on multiple devices


split this topic #4

2 posts were split to a new topic: Variables / functions going missing


split this topic #6

2 posts were split to a new topic: Electron (0.5.3) in safe mode


split this topic #7

2 posts were split to a new topic: Xenon mesh connectivity issue


#10

Hey folks – because this thread is a high priority issue for our engineering team right now, we’re going to move discussions of unrelated issues into parallel threads to prevent confusion. Thanks!


#13

Update from Particle, 5/7 at 3:30pm PDT:

The engineering team at Particle has spent the majority of the day investigating this issue to identify a root cause for devices that are still unable to reconnect to the Cloud. Although the caching system that crashed has been successfully restored, there are some devices that are still unable to successfully establish a new session.

We believe that devices affected by this issue are unable to restore their session with the Cloud due to lost session data from the caching system crash. The session restart logic in Device OS is not able to trigger a new handshake attempt before the device goes back to sleep, and the device may get stuck in an endless session restore attempt loop.

As Matthew stated previously, the only presently confirmed fix to force a device to re-handshake is to fully power cycle the device (a soft reset is not sufficient).

Our engineering team is currently exploring options for Cloud-side resolution that will help devices successfully restore their connection without the need for physical intervention. We will continue to provide updates on successful remote resolution strategies if and when they are available to us.

In the meantime, our support agents are working to determine the scope of this issue and assist customers with devices that may be suffering from this or other unrelated connectivity issues. If you have any specific device inquiries, please feel free to reach out to our team at support.particle.io


Over 20 Devices just went down
#14

Update from Particle, 5/8 at 12:00pm PDT

Our engineering team believes it has identified a Cloud-side resolution strategy that will successfully restore Cloud connectivity for many devices that are inaccessible for physical power reset. The team has successfully tested this solution in our staging environment, and is presently conducting limited testing in production environments with internal test devices as further validation of the fix.

We will post another update this afternoon or as we have updated information on the production viability of the targeted solution – whichever comes first.


#15

Update from Particle, 5/8 at 1:30pm PDT

Our Cloud-side solution has proven effective in staging and production environments, and we have deployed the fix to our production servers. Customers with devices that are affected by this endless session restore loop should see devices come back online once they wake from their sleep state and attempt to re-instantiate their connection with the Particle Cloud.


Over 20 Devices just went down
Over 20 Devices just went down
#16

Thanks everyone for your patience as we worked through developing and deploying a fix. Our next step is to follow up with affected customers directly to make sure that the issue has been resolved and to issue refunds for any runaway data consumption that occurred as a result of this bug. We will also write and publish a complete post-mortem to explain the issue and its resolution in more depth once we have completed our customer outreach.


#17

Update from Particle, 5/8 at 3:00pm PDT

We have multiple confirmed reports from customers citing successful resolution of the reconnect loop issue as a result of the fix that was deployed. We will continue to monitor the situation and work with customers who reported connectivity issues to ensure all devices are successfully reconnected.

As Zach mentioned, we will be providing a post-mortem shortly with more details. In the meantime, if you require support or have questions, please feel free to reach out to us at http://support.particle.io.


closed #18

#19

Update from Particle, 5/10 at 6:30pm PDT

We’ve been working today on a detailed postmortem that we plan to publish to both the status page and the blog. We’re currently targeting publishing it early next week.


#20

Update from Particle, 5/14 at 11:00pm PDT

Sorry for the delay folks. The postmortem is tentatively written, but there is an even smaller subset of devices still having issues, and we’re continuing to debug. I don’t want to publish until we know the full story. We’re currently targeting publishing by end of week.