Device Cloud incident postmortem: sleepy devices stuck using expired sessions

Originally published at: https://blog.particle.io/2019/05/16/device-cloud-incident-postmortem-sleepy-devices-stuck-using-expired-sessions/

Between May 2 and May 8, approximately 500 devices — mostly with sleepy, low-power use-cases — were unable to communicate with the Device Cloud. These devices were typically going to sleep before the Device OS could destroy their expired session keys. At Particle, we take reliability seriously, and we understand that when downtime happens, you and your team are negatively affected. We apologize for any adverse impact this incident may have caused you, and we strive to make sure the services you depend on are robust and fully operational.

As a result of our investigations and remediation, the platform is now robust to the failure mode that presented itself during this window, automatically and instantly healing any device that enters this state. Additionally, Device OS fixes are in progress to prevent devices from entering the failure mode in the first place.

Incident discovery and response timeline

On Thursday, May 2, a Redis cluster responsible for caching device session keys unexpectedly crashed, losing all data. For most devices, this triggered a handshake with the Device Cloud to establish a new session. However, unknown to us, approximately 500 devices never came back online.

On Friday, May 3, an enterprise customer reported to our customer success team that their webhooks were not firing as expected, and this issue was escalated to engineering.

On Saturday, May 4, it became clear that the enterprise support issue was not isolated to a single customer. In response, we immediately escalated the issue internally and involved significant engineering resources.

On Sunday, May 5, a large engineering meeting was held to distill what we knew and assign next steps.

  • A small number of devices, usually with sleepy, low-power use cases, were continuing to use old session keys instead of handshaking, and the cloud could not decrypt their communications.
  • The only obvious way to force a handshake in this scenario was to remove power from the device. However, that would require a person physically near the device to take action, and if at all possible, we wanted to avoid that for our customers.
  • We brainstormed several outside-the-box solutions and queued up work to evaluate them. Our best hope involved sending malformed packets to devices. Key engineers worked late into the night trying everything they could think of.

On Monday, May 6, the consensus within engineering was that no remote communications had any effect on the impacted devices, though we continued to evaluate variations of packets in different combinations of device and cloud states. The engineering team devised a multi-pronged plan to improve both short- and long-term reliability using what we had learned from the incident. People around the company began trying to assess the extent of the problem so that we could communicate effectively with impacted customers. Approximately 500 devices were affected, only a few of which had elevated data usage.

Early Tuesday morning, May 7, the team identified a malformed datagram that might trigger a device handshake. This potential solution required a great deal of testing and refinement, which was performed over the course of the day. By late afternoon we had high confidence that we could heal devices remotely; we had tested on staging and deployed behind feature flags on low-risk production infrastructure; and, we had a runbook for rolling out the fix safely to everyone on Wednesday.

By the evening of Wednesday, May 8, 480 devices had been remotely brought back online, and the flow of forced handshakes had slowed to a trickle.

Finally, on the afternoon of Wednesday, May 15, we deployed one more set of changes that helped heal a further subset of devices not fixed the previous week.

Let’s talk about impact

When everything’s perfectly healthy, depending on product fleet, device architecture, and global location, between 1,000 and 2,500 devices handshake with the Device Cloud every minute. Thus, it’s challenging from our vantage point to uncover these scenarios. While the bug impacted a tiny portion of the overall Particle ecosystem, it had a major impact on one customer who quickly informed us of the problem.

Background

The Particle platform is optimized for highly efficient data usage to lower your costs. One of the many ways we use less data is with UDP sessions that enable a sleepy device to handshake once, go to sleep, then wake up later and use the previously established encryption keys without performing a handshake again until we deem that the session is expired. (Currently, the maximum session duration is 3 days.) The problem, in this case, was that devices were continuing to use a session key that had expired.

Constrained Application Protocol (CoAP) has the concept of a “confirmable” message, which means the device waits for an acknowledgment from the cloud and retries a few times if it doesn’t receive one. Not every message in our protocol is confirmable, but many are—again, these are data-saving trade-offs. After sending a confirmable message, a device may have to wait up to 90 seconds for all the retries to fail, which closes the message channel, ending the session and causing a handshake. If a device goes into deep sleep (where waking up is like a reset) before that 90-second window finishes, then it may never realize the session is expired. The Device OS should be ensuring that sleep waits for acknowledgments, but it wasn’t. This is one of the fixes we’ve already written.

Trying to break our own security

After an intense day and night of analyzing code paths and fuzzing our devices, we were both frustrated and relieved. We had tried everything we could think of to cause Particle devices to handshake, but just as the DTLS spec says they’re supposed to, they simply ignored all the junk data we threw at them. The Mbed TLS library and our usage of it were both locked down tight, as secure as could be, and there seemed to be nothing we could do about it.

And then on Tuesday, we found something. It was small, but exactly what we needed.

The “saving grace bug” is in the Device OS’s handling of a particular error thrown by Mbed TLS. We ought to be ignoring the error, but instead, it causes the device to perform a handshake. And oh my word are we happy about it. Now, committed into the Particle device service code is a new message with the power to remotely bring devices back online. (Don’t worry; this is only a temporary workaround. A future Device OS version will drop this message too.)

When a device shows the diagnostic symptoms we observed, we now send it this new message, and it immediately performs a handshake. Here’s a time series graph of forced handshakes for one of our services, each data point representing the number of forced handshakes performed in a two-minute window. We flipped the feature flag, activating the new code on this particular service about 10:15 am PDT on May 8. Several devices were healed immediately, and more continued to be healed as they awoke from their sleepy state over the next few days.

Long term view

One by-product of this incident was that it allowed us to uncover a new class of bugs in the Device OS, ultimately enabling us to deploy fixes that establish a much more reliable platform long term. These bugs were not visible prior to this incident because they would only appear in a very small number of devices at any one time. The cache failure on May 2 had the long-term benefit of triggering the bug in a much larger number of devices—close to the number that could ever experience it.

Last week we deployed multiple cloud-side fixes to force a handshake when the problem occurs in the future. Additionally, there are three changes to the Device OS that are queued for release soon that will prevent a device from entering this state and generally improve reliability for sleepy devices.

  • Wait for confirmable messages when entering the deep sleep mode.
  • Particle.connected() should return true only after receiving a message from the cloud. Right now, it becomes true as soon as the device sends a message to the cloud.
  • Devices should expire a session after a number of unsuccessful connection attempts. This is a failsafe just in case devices ever find their way into this state in the future.

Peek behind the curtain

We care about our customers. Deeply. We have your back, and everyone at Particle takes great pride in the products we help you deliver. I want to take this opportunity to show you the way this team pulled together to understand a very complex and difficult-to-reproduce bug involving caching systems, low power states, timing constraints on device sleep and server deploy, and the low-level details of CoAP, UDP, and DTLS encryption.

I took a few screenshots of slack in the hopes that you folks might feel some of the pride that I do as a founder of Particle. Here’s the revelatory moment on Tuesday morning PDT (almost 3:00 am Wednesday his time!), after we had mostly given up hope of a remote fix, when Eli (primary owner of the device service) first shared that he might’ve found a solution.

As you can see, we had a slack channel and a clubhouse milestone devoted to the incident. The folks replying on that thread span firmware experts, cloud experts, executives, and customer success engineers.

Here’s a post from Joe (engineering manager for platform reliability) summarizing a meeting that just finished after hours of testing confirmed that we could reliably and safely use what Eli found.

Also that afternoon, some of the longer term code changes to the Device OS started getting pushed, and Eli put a smile on everyone’s face.

The engineering and customer success teams were glued to the metrics as we started rolling out the fix.

Eli kept us supplied with updates throughout the day.

Thank you to everyone at Particle who worked so hard over the last couple weeks. Take some time off. Thank you to the folks in the community who helped us understand the issue. You make us great. We’ll keep improving the platform every day. Also, we’re hiring.

5 Likes

Small update on Device OS follow-ups:

@BDub just informed me that another of those Device OS changes has been reviewed: PR 1776 expires a session after 3 unsuccessful attempts at connecting to the cloud

He also said that both that one and PR1767 “Wait for confirmable messages when entering deep sleep” will be in 1.2.0-rc.1.

2 Likes

there were some comments at the start of the incident about some log data that was lost and unrecoverable. i think it was mentioned at some point that the logs are on a rolling 30 days basis and as part of the caching server crashing that some log data is no longer available for access. i’m curious as to why it seems, if i’m understanding things correctly, that the log data is not saved to a database but only kept in a cache? also, were the lost logs limited to those roughly 500 devices or did the log issue affect a larger number of devices? finally, has particle put any processes and or changes in effect to hopefully prevent the lost log data in any future server crash? thanks.

Thank you for your thoughtful questions, @dkryder.

You are correct, the same cache server crash that dropped UDP device service session keys also dropped webhook/integration log data.

The reason why this log data was not kept in a database or backed up was because when we initially built the webhooks/integration system, we didn’t know how valuable the data would be to customers — we chose an ephemeral caching solution because we viewed the data as totally ephemeral.

Also, in comparison to other possible engineering investments in cloud infrastructure reliability, the risks posed by this specific caching infrastructure seemed less valuable. We now know that choice was incorrect and are making changes to ensure that a future failure of this kind will be far less disruptive.

Specifically, we will be replacing 1 multi-purpose, non-highly-available cache server with multiple, multi-node highly-available clusters with automated failover. After this work is completed, a failure like the kind that happened on May 2nd would result in no more than 1-minute of downtime and zero loss of data.

Stepping back for a moment, I want to add that this incident has spurred us to transform our escalation and prioritization processes to put focus acutely on reliability.

6 Likes