Updates for the Status Incident of May 3rd

Hi everyone,

My name is Matthew, and I am a Customer Support Engineer here with Particle. I understand that there have been some concerns around some unexpected behavior this past weekend, and I wanted to give you all an update on this.

Late last week, one of our caching systems unexpectedly crashed. This incident has caused hiccups on our backend in relation to webhooks and device session information with Particle’s cloud. The status page history for this event can be found here (see May 3rd):

https://status.particle.io/

Normally, devices should be able to recover whenever a cloud connection is lost. However, it came to the attention of engineering shortly after that a small number of devices were not coming back online after this incident occurred. Therefore, on Saturday, May 4th, our on-call team responded to the incident to assess the situation.

The conclusion from our on-call team is that the situation required more than just an on-call team response. As a result, a full engineering meeting was scheduled during off-hours on Sunday, May 5th. This meeting brought in full engineering teams and engineering team leads in order to assess the full scale of the issue and how to proceed in regards to mitigating the impact of the current incident and how to reprioritize engineering work in the near-term in order to address this issue to prevent similar incidents from occurring again.

We are working to determine how many devices are still attempting to connect to the cloud with outdated info. However, our early research suggests that only a handful of customers were affected.

For any devices that have not recovered on their own–the only known way to recover the device at this time is to completely remove power from the device and repower it. Simply resetting the device is not enough to move the device out of this bad state.

As a part of engineering reprioritization, our Device OS team is using the information from this incident to prioritize putting changes into the tentative v1.2.x release that will prevent a device from ending up in an unreachable state in incidents such as this. Our cloud engineering team will be working in parallel to prevent future situations like this by replacing and scaling up selected backend caching systems.

For any questions or concerns that anyone may have, I do encourage that they be posted here in this topic or on our community. Our internal teams have been monitoring the forums and are listening to all of your concerns at this time. Community feedback and concerns help drive our priorities and we will be using that feedback to ensure that our customer’s concerns continue to be our highest priority.

Matt

8 Likes