Originally published at: https://blog.particle.io/platform-cloud-incident-postmortem/
I’m writing to update on a significant outage we at Particle managed through last week – the kind of outage that every SRE (Site Reliability Engineering) team plans for, but hopes to never see in practice.
This outage is an event that I and, indeed, all of Particle, are taking incredibly seriously, and I want to start by offering my personal apology for the disruption to our customers’ businesses. I fully appreciate the negative impact any downtime has on a business and understand it’s imperative that we at Particle are unrelenting in the pursuit of full availability and resiliency against failures in all parts of the complex ecosystem required to provide our services.
The root cause of the outage was a catastrophic failure at the vendor hosting our most critical system, the operational database central to the operations for our Device Cloud services. It contains information relating to users, devices, firmware, and webhooks and is required for our customers to manage and monitor Particle devices and their products in the field. Our customers trust Particle to have the systems in place to seamlessly navigate such major events; we were ultimately able to do so and bring our systems back online, with full data integrity, far ahead of the vendor recovery, but not without our own downtime and degraded performance as we managed the situation.
In this post I aim to provide a transparent and detailed overview of the event, including our preparedness for this situation, the steps Particle took to reduce the impact to our customers, the steps we have taken since the event to restore normal operations, and what actions we’ve taken to isolate such events from having customer impact.
How Prepared Were We?
For such a critical system our services depend on, Particle has a 3 layered approach:
- Vendor Selection/Management: We continually, comprehensively, vet the primary vendors we rely on and proactively manage our configuration within those vendors’ services for the highest reliability, availability, and performance. We also continually survey the provider landscape for opportunities to keep Particle (and, by extension, our customers) on the latest, most optimal, solutions and manage migration to these updated systems and vendors to meet our needs.
- Maintained Backups and Fallback Solutions: We build into our systems additional, independent, backups and fallback systems to guard against vendor or regional single-points-of failure.
- Disaster Recovery Plan: We define, maintain, and continually test and practice our Disaster Recovery plan.
Below I’ll provide some more detail as it relates to our operational database.
Particle has an internal vendor management policy that outlines the vetting process for critical vendors such as the one hosting our operational database. We took the following steps to ensure the right solution for reliably and securely hosting our database:
- Our architecture intentionally separates our database hosting in a separate, redundant, environment from the rest of our infrastructure. This separation of environments has many reliability and security benefits, including the automated provisioning of new resources at times of high load, high availability deployments, and managed backups.
- Our team ensured that our host’s services were configured for use redundantly by Particle so that we could tolerate, without disruption to our services, failure of several of our vendor’s nodes.
- This vendor’s SLA quotes 99.98% availability per month, which allows for approximately nine minutes of downtime in any given month.
- We continually updated our database instance to the version recommended by our vendor for the highest reliability and performance.
Maintained Backups and Fallback Solutions
To avoid a single-point-of-failure situation by limiting Cloud services to a single region, we maintain our own independent backups of our operational database in an alternate region.
To protect against exposure from relying on a single vendor’s solution, we had another database hosting vendor lined up and maintained as an active secondary solution to transition to as part of our Disaster Recovery plan should our primary have a critical outage event. This second-sourced vendor is also part of our mission to continually pursue the latest, cutting-edge, solution to meet our strict availability, performance, and security requirements to best serve our customers.
Disaster Recovery Plan
The Particle SRE team maintains and exercises a Disaster Recovery (DR) plan for the loss of our core database, among other critical systems. We had used, for example, a recent database maintenance event in the last month to practice our plan. Finding frequent opportunities to exercise these plans gives the team confidence to act quickly, decisively, and efficiently in the face of system disasters while maintaining the integrity of the overall system.
These measures proved valuable in our response to the outage.
How the Events Unfolded and Particle’s Response
The outage event at our database hosting vendor started at 4:30am PDT on 7/15/2020. The outage was so significant that it impacted the vendor’s entire offering, including backup systems, and rendered Particle Cloud services inoperable and inaccessible.
Our internal, automated, monitoring system had fired an alert and within 9 minutes of the outage the Particle team on-call had updated our Status Page, identified the source of the issue to be with our database hosting vendor, reached out to the vendor to alert them and gather more information, and had established an internal virtual war room to coordinate our response. Around 15 minutes after the start of the event, our vendor acknowledged the significant outage, but did not provide an estimated time for recovery of their services. Our Customer Success team kicked into gear to proactively reach out to impacted customers and respond to queries in our customer help desk system and from the developer community. Subsequent communication from the vendor within the hour that followed suggested that the outage could potentially be extensive – projecting resolution within 1-2 hours, followed some time later by a longer projection of 4-8 hours.
The Disaster Recovery Response
As a result of the significance of the outage and the poor prognosis for timely resolution, Particle shifted from our standard incident response process to declaring a Disaster Recovery (DR) event, putting in motion the plans established to recover from a loss of our core database.
Per the practiced DR plan, the team kicked off the restoration of the database from our previous night’s independent database backup to our secondary vendor’s hosted service. The process of restoring the backup, validating the integrity of the data, and reconfiguring all Particle services to use the new database took around ninety minutes. At approximately 8:00am PDT the key components of the Particle Device Cloud were online, and the majority of Particle devices were able to connect and resume recording of monitoring data and managing our customer’s products.
Managing the Recovery
Though we had our systems back online relatively quickly with the new database host, our Device Cloud was operating in a degraded state while the large wave of devices reconnected, and the database “caught up” with queued data writes.
The degraded state was aggravated by the fact that we had initially underprovisioned the secondary vendor database for the higher-than normal load on the database as part of the emergency transfer. Particle worked with the secondary vendor to increase these resources during the event, in response to the higher than normal load, until such a time as we were adequately provisioned to operate within normal performance expectations.
Over the course of the next five hours, the Particle team intervened to ensure that business-critical information flows for our customers were prioritized over other, less important operations. The Particle team remained in the virtual war room and triaged individual issues as they came up during our degraded state, clearing the data backlog and restoring a reasonable level of service in the Particle Device Cloud. By 2:00pm PDT, all critical Device Cloud services were fully operational.
The Final Step to Ensure Full Data Integrity
In the end, our vendor had suffered 13.5 hours of downtime. Had we not executed our DR plan, our customers would have been exposed to an extraordinary 13.5 hours of downtime followed by the time it takes to manage the recovery, as we had in the transfer to a new vendor, to control for the flood of devices coming back online.
Because we were operating our instance at the new vendor from a backup snapshot of the database taken the night before, we were reliant on the affected vendor to bring their system back online to catch our new database up with the data collected between our snapshot and when the system went down. Bringing back online the Web IDE and Cloud Compile systems were also dependent on the vendor’s recovery and data integrity checks.
The moment our vendor recovered at around 6:00pm PDT, the team immediately began catching the new database up with the missing data. This process needed to be managed carefully in order to preserve data integrity, and avoid overwriting more recent updates to the database since the recovery, but the team had the right tools and processes to navigate this for a full and correct cleanup. All data was confirmed restored by 12:30 pm PDT on Friday July 17th.
This chart provides a visual summary of the timing of the events.
What Actions Will Particle Take Moving Forward?
The systems and processes Particle had in place were necessary in order to contain the severity of the outage and preserve the integrity of our data. Even with the containment and recovery, however, we were unable to meet our standards for the availability of our systems. Through this experience, we uncovered several opportunities for improvements to both avoid customer downtime during such an event in the first place and to speed up our own recovery processes as a part of DR failover measures. We also identified improvements to our communication flows.
I’ll share in this section some examples of short term and medium term actions our team has identified, some of which were in plan prior to the outage and some of which we identified during a postmortem review of the outage.
Preventing Customer Outage in the First Place:
- We designated our formerly secondary database host provider as our primary provider. This provider enables features not made available to us by our former provider, such as rolling point-in-time recovery (PITR) tools for higher availability guarantees. This company’s industry-leading reputation and better feature set had us already on a path, prior to our former vendor’s outage event, to transitioning our operation database to their services.
- We’re actively working with our new primary database host provider to ensure that we are configured to maximize the reliability features available to us (beyond the services we already had in place with our former vendor.)
- As part of our existing reliability roadmap, we will accelerate plans to improve redundancy in our own applications and services. This includes consistently architecting all Particle services to operate in multiple datacenter regions.
Improved Efficiency in Recovery from Outage, Failover, and Maintenance Events:
- We are continuing to refine and test our Business Continuity Plan / Disaster Recovery procedures. Most notably, in the short term, we will establish a hot site for our operational database so that we can failover to an alternative hosting environment significantly more quickly than during this event.
- We enabled hourly backups of our operational database to avoid a long tail managing data integrity in the gap between last backup and latest state of the database.
- We’ve noted the database resource requirements needed to successfully provision a replica of our Operational Database in a high-load scenario and have a system in place to monitor as the system grows with our customer deployments and features, which means will should not have a repeat of the under-provisioning issues through failover, recovery, and maintenance events.
Improved Customer Communication:
- We are working to provide more useful self-service monitoring telemetry on our status page and customer consoles. This, in turn, will provide awareness and alerts that will offer close to real-time notifications for any future potential issues.
- We are tightening our internal escalation and customer communication run books to ensure that we respond and proactively reach out to our customers and community faster.
- We will continue to provide greater transparency and increased frequency of updates on mission critical aspects of our platform on our status page
Thank you for taking the time to read this post. Our customers entrust Particle with the management and monitoring of their products and I fully understand how critical the availability of our systems is to our customer’s products. I hope this visibility and transparency is useful information for our customers and potential customers evaluating our solution and serves to communicate how seriously we take this responsibility.