This is one of the reasons I chose Particle for my products, the transparency. Even when problems have been handled, most companies would have fielded questions as (or most likely IF) calls came in about it. Thank you guys.
in reading that write-up of the incident i was kinda surprised at how weak the system that failed was. it was explained that since this single system had worked well most of the time no thought was given, or, i guess, no action had been taken to improve it. that is, until it failed and then there was no choice but to take a serious look at it. another thing that really impressed me was there was no announcement of any compensation for the acknowledged hardship that the failure presented to the customer. just a geez, we are sorry, we really care about you. time to look in the mirror, guys.
For completeness sake, since there are people who wonât read the post-mortem, but do read your post.
âDonât fix it if it ainât brokenâ springs to mind. If a service that youâve been running since the beginning has been working reliably, then why mess with it? The hardware running this had failed (an Amazon AWS server, out of Particleâs hands), and no amount of code improvements could have prevented this.
Seeing as the only way to circumvent issues like these is to make sure you have redundant hardware servers as backup, the only improvement on that end would be to make the code horizontally scalable. Coincidentally, thatâs exactly what theyâve been doing, even before there was an issue.
âThis work has been actively in progress for the last couple monthsâ
As far as compensation goes, what would you have thought to be fair? That is, with keeping the following in mind:
It was a caused by a hardware failure out of their control.
They responded and fixed it as soon as they became aware of the issue.
A more permanent solution was being worked on prior to the issue.
Nobody mentioned this in the forums (as far as Iâm aware). There have been other minor issues in the past, and most of these have been voiced in the community relatively quickly. The only reason I knew about this outtage is because I got the email. If the âhardshipâ was that bad, surely someone would have at least noticed the issue and said something?
Theyâve been really open about everything, from the cause, actions taken, and plans to prevent this alltogether.
I promised myself I wouldnât come down hard on Particle for this and I intend to stick with that, but what you say here is just so wrong. Using software (with redundant hardware) has been a cornerstone of computer engineering for decades. Itâs always a cost/efficiency/safety tradeoff and there are no simple answers.
Horizontally scalable has nothing at all to do with failure recovery, itâs a way to scale capacity and nothing more. And yes you can take the view that keeping part of your capacity running is a form of protection, but that is idiocy. There are many failure modes it gives you zero protection against and the ones where it will help; how much extra capacity does the company buy and how degraded can you allow performance?
Hardware failures will always happen. A company like particle cannot just up and say hardware failures are out of the control and wash their hands of responsibility, and they are not doing it now.
I am not blaming Particle here and this doesnât change my opinion of them. I file this under âthese things happen, letâs move onâ. I love their ecosystem, but Iâve always had a nagging feeling about putting so much time, effort, and money into hardware projects that simply stop working when the cloud services arenât there. An incident like this emphasizes how totally dependant we are on them.
Iâll be the first to admit Iâm not an expert in this field, and the points you make are fair.
Obviously, Particle should make sure stuff keeps running when hardware fails, and sure, better coding could help alleviate some of the issues. The point I was trying to make was that they were already working on this before this issue popped up, contrary to the opinion voiced that they only bothered to look after it âbrokeâ. Mostly so people who donât read the postmortem donât get the wrong idea by reading the comments.
agree that âhigh availabilityâ and âhorizontal scalabilityâ are very different. Part of the difficulty with the specific content of the postmortem (which several people at Particle edited over 24 hours) was to make it broadly understandable, with enough technical detail to be helpful, forthright, and interesting, but not so much that itâs unreadable for a large portion of the audience. Keep in mind that here in the community, we are the most technical portion of Particleâs many audiences.
Please rest assured that despite the imprecise language in the postmortem, internally, the engineering team pursues both goals (as well as security, maintainability, usability, and lots of other -ilities) with vigor for all our services. Failures (hardware and otherwise) in fact do happen all the time, and no one notices because of the ways the systems fail cleanly and gracefully and can easily scale to meet capacity.
Zachary, I thought it was great of you to post information about what happened. I was actually shocked that Particle did that, as I am so used to companies just simply saying âwe were down for x hours, sorry. We are back up nowâ
I think most users understand that these sorts of failures do happen and are inevitable. Whatâs important is how the company reacts and responds to the community so we can feel âin the loopâ and that we matter. I believe Particle has done a fantastic job dealing with this situation and to me, that matters more than the outage itself.
A different kind of company could downplay the outage pretending nothing really happened and ignore the community. Itâs great that Particle is not that kind of company.
I donât really see all the negative response that @Moors7 is trying to head off, but to those people, sure Particle could have done a lot more up front, if youâre willing to accept a much higher price tag on the hardware and cloud. Itâs all tradeoffs and a balancing act. Iâd rather pay $19 for a Photon and endure this outage than $99 for a Photon that makes this 1/10th as likely.
I think what surprised me most about this is that Particle had a failure on AWS hardware. I thought the whole premise of AWS (and other cloud vendors) was redundancy and duplication - they provide a virtual service, not a physical machine. If the hardware pops its clogs, surely they should automatically migrate the virtual server to a working physical platform. Iâve never investigated AWS in any detail, but Iâve always assumed that as a basic premise, if not promise. Interesting if itâs not true.
Anyway, kudos to Particle for their response/transparency.
And, yes, this is why the âessential service productâ (if a single deployment is a product) I designed using an Electron (a fire pump controller) does not use the cloud, but mobile/cellular SMS, for comms. A dozen fewer things to go wrong in the chain!
Thanks for the transparency. What can be done from the device firmware perspective to have workarounds when this happens? I had built in my devices a few precautions whenever the device senses there is no cloud or cellular connectivity (i.e storing some data locally and wait until connection restores before transmitting). However, these measures did not work in this event.
As a context, we have a connected-machines service. Our customers rely on us sensing when their machines are operating and providing a service - they bill in turn to their customers based on this info. Hence, the data we get from the IoT devices is not real-time sensitive but we need to ensure it gets to its destination. We could not do it during downtime and the local storage solution we had did not work as the particle devices thought they had connectivity. We use the particle cloud and webhooks to send the messages from Particle to our final dashboard/data analysis solutions. We liked this approach (as opposed to sending the data from the device directly to our cloud) because of the built-in security and data efficiency in the particle.publish methods.
Thanks in advance for the suggestions.
Great question. Exactly the way an IoT product ought to be built. For the general audience Iâll say thereâs not one answer; it depends on each individual connected product. Iâll reply in a separate thread with some recommendations or constraints around how to think about the problem.