Webhook requests gone

polystyrene · October 2, 2020, 6:13am

Just got alerted our inbound webhook data from Particle plummeted. We lost about half our incoming traffic, even though our devices seem fine and their event streams show all the data we’re not getting. What’s going on?

Also, no integration errors being logged, and it seems like maybe some products are fine, while others are showing delays and/or large gaps in messages?

drbreakdown · October 2, 2020, 8:58am

Same here, still happening randomly.

electronweather · October 2, 2020, 9:04am

Same issue here…

drbreakdown · October 2, 2020, 9:17am

As of usual, no response in 3h time from particle and the problem persists.

electronweather · October 2, 2020, 11:15am

@rickkas7 Any idea? Some webhooks are still not working properly. If I edit the webhooks I get a timeout error…

marekparticle · October 2, 2020, 11:20am

Thank you for reporting this issue. If you are able, please DM me affected Product IDs.

no1089 · October 2, 2020, 11:54am

Please keep an eye on https://status.particle.io for updates as we work on the problem.

thrmttnw · October 2, 2020, 12:06pm

Same. For a specific device one webhook kept firering while the other did not. I deleted the stale webhook (got a spinning wheel) and set up a replacement, and it worked again.

polystyrene · October 2, 2020, 1:24pm

5+ hours from incident to acknowledgement, another 2 hours to the fix. another night of lost sleep for me. awesome.

no1089 · October 2, 2020, 1:42pm

One of the services responsible for webhooks was acting up, so most of the integrations worked without issue. Creating a new one had a good chance of landing on a working service.

hwestbrook · October 2, 2020, 2:14pm

Our products were also affected.

Maybe there is a scheduled once per month bug for the device or connectivity services to go down?

zach · October 2, 2020, 5:28pm

Hey folks,

Thanks for raising the issue here. Service was restored this morning.

I want to apologize for the slow response on this one. In general we strive to have automated alerting set up on all of our mission critical services so that when something goes wrong we deal with it immediately. In this case we didn’t have alerting set up on this particular failure mode (the root cause of which was a request overload on one of four servers that handle webhooks), so we became aware of the issue this morning when our customer support team logged into the forums and saw concerns being raised.

We’ll be doing a postmortem both on this specific issue and work that we intend to do to expand the scope of our load testing and alerting systems to ensure that we don’t encounter similar unexpected failures again. We’ll post the results of that postmortem here next week.

AlbertZeroK · October 4, 2020, 11:32am

@zach WOW! I’m worked as an SRE for years, lead reliably revolutions at multi-billion dollar dot coms, and, honestly, if you’re staff isn’t alerting on sometime as simple as - load balancer ingest vs server ingest, load balancer error rates, server load vs past performance, then your SRE team has absolutely missed one of the MOST IMPACTFUL area of alerting on your systems. Heck, in AWS, you could build a lambda to do this in less than a day, even roll out cloud watch alerts for your entire infrastructure automatically if you wanted.

zach · October 9, 2020, 5:29pm

Hey folks - here’s a link to the postmortem we published to the status page last night:

Thanks for raising the flag on this one and if you have any questions on the postmortem, the work that we’ve done to address the root cause of the issue, or the reliability investments we’ll be making in Q4, feel free to raise those questions here.

system · April 10, 2021, 5:29am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Webhook interruptions today? Troubleshooting	2	803	July 19, 2016
Webhooks server down? Cloud	10	2120	February 16, 2016
Todays cloud outage Cloud	20	1000	January 21, 2021
Notice: Partial System Outage (Stable / Presently Monitoring) General	2	530	October 17, 2019
Particle cloud down? [NOPE] Troubleshooting	1	744	October 17, 2015

Webhook requests gone

Related topics