March 20 incident postmortem

Yesterday, some of you may have noticed that we had some downtime. I just posted the postmortem on our status page: https://status.particle.io/incidents/0h2xjj4sqk9t

You can subscribe to incident notifications there.

My sincerest apologies for any problem this may have caused. :pray:

Please check out details in the postmortem, and reply here with any questions.

Yours, in openness,
Zachary

15 Likes

This is one of the reasons I chose Particle for my products, the transparency. Even when problems have been handled, most companies would have fielded questions as (or most likely IF) calls came in about it. Thank you guys.

8 Likes

in reading that write-up of the incident i was kinda surprised at how weak the system that failed was. it was explained that since this single system had worked well most of the time no thought was given, or, i guess, no action had been taken to improve it. that is, until it failed and then there was no choice but to take a serious look at it. another thing that really impressed me was there was no announcement of any compensation for the acknowledged hardship that the failure presented to the customer. just a geez, we are sorry, we really care about you. time to look in the mirror, guys.

For completeness sake, since there are people who won’t read the post-mortem, but do read your post.

“Don’t fix it if it ain’t broken” springs to mind. If a service that you’ve been running since the beginning has been working reliably, then why mess with it? The hardware running this had failed (an Amazon AWS server, out of Particle’s hands), and no amount of code improvements could have prevented this.

Seeing as the only way to circumvent issues like these is to make sure you have redundant hardware servers as backup, the only improvement on that end would be to make the code horizontally scalable. Coincidentally, that’s exactly what they’ve been doing, even before there was an issue.

“This work has been actively in progress for the last couple months”

As far as compensation goes, what would you have thought to be fair? That is, with keeping the following in mind:

  • It was a caused by a hardware failure out of their control.
  • They responded and fixed it as soon as they became aware of the issue.
  • A more permanent solution was being worked on prior to the issue.
  • Nobody mentioned this in the forums (as far as I’m aware). There have been other minor issues in the past, and most of these have been voiced in the community relatively quickly. The only reason I knew about this outtage is because I got the email. If the ‘hardship’ was that bad, surely someone would have at least noticed the issue and said something?
  • They’ve been really open about everything, from the cause, actions taken, and plans to prevent this alltogether.
4 Likes

Good write up. Thanks

Q. How will we be notified when the broker becomes highly available?

I’m keen to know given we will have to spin up a private MQTT server to account for the lack of high availability until the issue is solved.

Very keen to see this happen, must admit I’m surprised such a noticeable component wasn’t HA, region separated and vendor separated.

It would be great to understand the complexities involved so we can further appreciate the time it takes to get it to HA standard :slight_smile:

A big thank you guys for letting us know about this post-mortem exercise!
Gustavo.

2 Likes

Great question Cameron.

My hope is that the transition will be seamless. I’ll make sure we announce it here in the community. :+1:

2 Likes

I promised myself I wouldn’t come down hard on Particle for this and I intend to stick with that, but what you say here is just so wrong. Using software (with redundant hardware) has been a cornerstone of computer engineering for decades. It’s always a cost/efficiency/safety tradeoff and there are no simple answers.

Horizontally scalable has nothing at all to do with failure recovery, it’s a way to scale capacity and nothing more. And yes you can take the view that keeping part of your capacity running is a form of protection, but that is idiocy. There are many failure modes it gives you zero protection against and the ones where it will help; how much extra capacity does the company buy and how degraded can you allow performance?

Hardware failures will always happen. A company like particle cannot just up and say hardware failures are out of the control and wash their hands of responsibility, and they are not doing it now.

I am not blaming Particle here and this doesn’t change my opinion of them. I file this under “these things happen, let’s move on”. I love their ecosystem, but I’ve always had a nagging feeling about putting so much time, effort, and money into hardware projects that simply stop working when the cloud services aren’t there. An incident like this emphasizes how totally dependant we are on them.

3 Likes

I’ll be the first to admit I’m not an expert in this field, and the points you make are fair.

Obviously, Particle should make sure stuff keeps running when hardware fails, and sure, better coding could help alleviate some of the issues. The point I was trying to make was that they were already working on this before this issue popped up, contrary to the opinion voiced that they only bothered to look after it ‘broke’. Mostly so people who don’t read the postmortem don’t get the wrong idea by reading the comments.

:clap: :100: agree that “high availability” and “horizontal scalability” are very different. Part of the difficulty with the specific content of the postmortem (which several people at Particle edited over 24 hours) was to make it broadly understandable, with enough technical detail to be helpful, forthright, and interesting, but not so much that it’s unreadable for a large portion of the audience. Keep in mind that here in the community, we are the most technical portion of Particle’s many audiences. :wink:

Please rest assured that despite the imprecise language in the postmortem, internally, the :particle: engineering team pursues both goals (as well as security, maintainability, usability, and lots of other -ilities) with vigor for all our services. Failures (hardware and otherwise) in fact do happen all the time, and no one notices because of the ways the systems fail cleanly and gracefully and can easily scale to meet capacity.

6 Likes

Zachary, I thought it was great of you to post information about what happened. I was actually shocked that Particle did that, as I am so used to companies just simply saying “we were down for x hours, sorry. We are back up now”

3 Likes

Zachary, thanks for your postings here.

I think most users understand that these sorts of failures do happen and are inevitable. What’s important is how the company reacts and responds to the community so we can feel “in the loop” and that we matter. I believe Particle has done a fantastic job dealing with this situation and to me, that matters more than the outage itself.

A different kind of company could downplay the outage pretending nothing really happened and ignore the community. It’s great that Particle is not that kind of company.

I don’t really see all the negative response that @Moors7 is trying to head off, but to those people, sure Particle could have done a lot more up front, if you’re willing to accept a much higher price tag on the hardware and cloud. It’s all tradeoffs and a balancing act. I’d rather pay $19 for a Photon and endure this outage than $99 for a Photon that makes this 1/10th as likely.

6 Likes

I think what surprised me most about this is that Particle had a failure on AWS hardware. I thought the whole premise of AWS (and other cloud vendors) was redundancy and duplication - they provide a virtual service, not a physical machine. If the hardware pops its clogs, surely they should automatically migrate the virtual server to a working physical platform. I’ve never investigated AWS in any detail, but I’ve always assumed that as a basic premise, if not promise. Interesting if it’s not true.

Anyway, kudos to Particle for their response/transparency.

And, yes, this is why the ‘essential service product’ (if a single deployment is a product) I designed using an Electron (a fire pump controller) does not use the cloud, but mobile/cellular SMS, for comms. A dozen fewer things to go wrong in the chain!

Just gonna chime in and say Particle’s transparency and strength of community is a key differentiator. Good on y’all.

3 Likes

Thanks for the transparency. What can be done from the device firmware perspective to have workarounds when this happens? I had built in my devices a few precautions whenever the device senses there is no cloud or cellular connectivity (i.e storing some data locally and wait until connection restores before transmitting). However, these measures did not work in this event.

As a context, we have a connected-machines service. Our customers rely on us sensing when their machines are operating and providing a service - they bill in turn to their customers based on this info. Hence, the data we get from the IoT devices is not real-time sensitive but we need to ensure it gets to its destination. We could not do it during downtime and the local storage solution we had did not work as the particle devices thought they had connectivity. We use the particle cloud and webhooks to send the messages from Particle to our final dashboard/data analysis solutions. We liked this approach (as opposed to sending the data from the device directly to our cloud) because of the built-in security and data efficiency in the particle.publish methods.
Thanks in advance for the suggestions.

2 Likes

Great question. :clap: Exactly the way an IoT product ought to be built. For the general audience I’ll say there’s not one answer; it depends on each individual connected product. I’ll reply in a separate thread with some recommendations or constraints around how to think about the problem.