This is one of the reasons I chose Particle for my products, the transparency. Even when problems have been handled, most companies would have fielded questions as (or most likely IF) calls came in about it. Thank you guys.
in reading that write-up of the incident i was kinda surprised at how weak the system that failed was. it was explained that since this single system had worked well most of the time no thought was given, or, i guess, no action had been taken to improve it. that is, until it failed and then there was no choice but to take a serious look at it. another thing that really impressed me was there was no announcement of any compensation for the acknowledged hardship that the failure presented to the customer. just a geez, we are sorry, we really care about you. time to look in the mirror, guys.
For completeness sake, since there are people who won't read the post-mortem, but do read your post.
"Don't fix it if it ain't broken" springs to mind. If a service that you've been running since the beginning has been working reliably, then why mess with it? The hardware running this had failed (an Amazon AWS server, out of Particle's hands), and no amount of code improvements could have prevented this.
Seeing as the only way to circumvent issues like these is to make sure you have redundant hardware servers as backup, the only improvement on that end would be to make the code horizontally scalable. Coincidentally, that's exactly what they've been doing, even before there was an issue.
"This work has been actively in progress for the last couple months"
As far as compensation goes, what would you have thought to be fair? That is, with keeping the following in mind:
It was a caused by a hardware failure out of their control.
They responded and fixed it as soon as they became aware of the issue.
A more permanent solution was being worked on prior to the issue.
Nobody mentioned this in the forums (as far as I'm aware). There have been other minor issues in the past, and most of these have been voiced in the community relatively quickly. The only reason I knew about this outtage is because I got the email. If the 'hardship' was that bad, surely someone would have at least noticed the issue and said something?
They've been really open about everything, from the cause, actions taken, and plans to prevent this alltogether.
I promised myself I wouldn't come down hard on Particle for this and I intend to stick with that, but what you say here is just so wrong. Using software (with redundant hardware) has been a cornerstone of computer engineering for decades. It's always a cost/efficiency/safety tradeoff and there are no simple answers.
Horizontally scalable has nothing at all to do with failure recovery, it's a way to scale capacity and nothing more. And yes you can take the view that keeping part of your capacity running is a form of protection, but that is idiocy. There are many failure modes it gives you zero protection against and the ones where it will help; how much extra capacity does the company buy and how degraded can you allow performance?
Hardware failures will always happen. A company like particle cannot just up and say hardware failures are out of the control and wash their hands of responsibility, and they are not doing it now.
I am not blaming Particle here and this doesn't change my opinion of them. I file this under "these things happen, let's move on". I love their ecosystem, but I've always had a nagging feeling about putting so much time, effort, and money into hardware projects that simply stop working when the cloud services aren't there. An incident like this emphasizes how totally dependant we are on them.
Iāll be the first to admit Iām not an expert in this field, and the points you make are fair.
Obviously, Particle should make sure stuff keeps running when hardware fails, and sure, better coding could help alleviate some of the issues. The point I was trying to make was that they were already working on this before this issue popped up, contrary to the opinion voiced that they only bothered to look after it ābrokeā. Mostly so people who donāt read the postmortem donāt get the wrong idea by reading the comments.
agree that āhigh availabilityā and āhorizontal scalabilityā are very different. Part of the difficulty with the specific content of the postmortem (which several people at Particle edited over 24 hours) was to make it broadly understandable, with enough technical detail to be helpful, forthright, and interesting, but not so much that itās unreadable for a large portion of the audience. Keep in mind that here in the community, we are the most technical portion of Particleās many audiences.
Please rest assured that despite the imprecise language in the postmortem, internally, the engineering team pursues both goals (as well as security, maintainability, usability, and lots of other -ilities) with vigor for all our services. Failures (hardware and otherwise) in fact do happen all the time, and no one notices because of the ways the systems fail cleanly and gracefully and can easily scale to meet capacity.
Zachary, I thought it was great of you to post information about what happened. I was actually shocked that Particle did that, as I am so used to companies just simply saying "we were down for x hours, sorry. We are back up now"
I think most users understand that these sorts of failures do happen and are inevitable. Whatās important is how the company reacts and responds to the community so we can feel āin the loopā and that we matter. I believe Particle has done a fantastic job dealing with this situation and to me, that matters more than the outage itself.
A different kind of company could downplay the outage pretending nothing really happened and ignore the community. Itās great that Particle is not that kind of company.
I donāt really see all the negative response that @Moors7 is trying to head off, but to those people, sure Particle could have done a lot more up front, if youāre willing to accept a much higher price tag on the hardware and cloud. Itās all tradeoffs and a balancing act. Iād rather pay $19 for a Photon and endure this outage than $99 for a Photon that makes this 1/10th as likely.
I think what surprised me most about this is that Particle had a failure on AWS hardware. I thought the whole premise of AWS (and other cloud vendors) was redundancy and duplication - they provide a virtual service, not a physical machine. If the hardware pops its clogs, surely they should automatically migrate the virtual server to a working physical platform. Iāve never investigated AWS in any detail, but Iāve always assumed that as a basic premise, if not promise. Interesting if itās not true.
Anyway, kudos to Particle for their response/transparency.
And, yes, this is why the āessential service productā (if a single deployment is a product) I designed using an Electron (a fire pump controller) does not use the cloud, but mobile/cellular SMS, for comms. A dozen fewer things to go wrong in the chain!
Thanks for the transparency. What can be done from the device firmware perspective to have workarounds when this happens? I had built in my devices a few precautions whenever the device senses there is no cloud or cellular connectivity (i.e storing some data locally and wait until connection restores before transmitting). However, these measures did not work in this event.
As a context, we have a connected-machines service. Our customers rely on us sensing when their machines are operating and providing a service - they bill in turn to their customers based on this info. Hence, the data we get from the IoT devices is not real-time sensitive but we need to ensure it gets to its destination. We could not do it during downtime and the local storage solution we had did not work as the particle devices thought they had connectivity. We use the particle cloud and webhooks to send the messages from Particle to our final dashboard/data analysis solutions. We liked this approach (as opposed to sending the data from the device directly to our cloud) because of the built-in security and data efficiency in the particle.publish methods.
Thanks in advance for the suggestions.
Great question. Exactly the way an IoT product ought to be built. For the general audience Iāll say thereās not one answer; it depends on each individual connected product. Iāll reply in a separate thread with some recommendations or constraints around how to think about the problem.