Just got alerted our inbound webhook data from Particle plummeted. We lost about half our incoming traffic, even though our devices seem fine and their event streams show all the data we’re not getting. What’s going on?
Also, no integration errors being logged, and it seems like maybe some products are fine, while others are showing delays and/or large gaps in messages?
Same. For a specific device one webhook kept firering while the other did not. I deleted the stale webhook (got a spinning wheel) and set up a replacement, and it worked again.
One of the services responsible for webhooks was acting up, so most of the integrations worked without issue. Creating a new one had a good chance of landing on a working service.
Thanks for raising the issue here. Service was restored this morning.
I want to apologize for the slow response on this one. In general we strive to have automated alerting set up on all of our mission critical services so that when something goes wrong we deal with it immediately. In this case we didn’t have alerting set up on this particular failure mode (the root cause of which was a request overload on one of four servers that handle webhooks), so we became aware of the issue this morning when our customer support team logged into the forums and saw concerns being raised.
We’ll be doing a postmortem both on this specific issue and work that we intend to do to expand the scope of our load testing and alerting systems to ensure that we don’t encounter similar unexpected failures again. We’ll post the results of that postmortem here next week.
@zach WOW! I’m worked as an SRE for years, lead reliably revolutions at multi-billion dollar dot coms, and, honestly, if you’re staff isn’t alerting on sometime as simple as - load balancer ingest vs server ingest, load balancer error rates, server load vs past performance, then your SRE team has absolutely missed one of the MOST IMPACTFUL area of alerting on your systems. Heck, in AWS, you could build a lambda to do this in less than a day, even roll out cloud watch alerts for your entire infrastructure automatically if you wanted.
Hey folks - here’s a link to the postmortem we published to the status page last night:
Thanks for raising the flag on this one and if you have any questions on the postmortem, the work that we’ve done to address the root cause of the issue, or the reliability investments we’ll be making in Q4, feel free to raise those questions here.