[Solved] Flashing Red after Handshake (since this morning)

Hi All,

Just echo’ing what Joe said, sorry we were late to catch this (sleeping), normally pagerduty wakes us up within minutes when this sort of thing happens. We suspect it’s related to database maintenance done by a vendor that we weren’t warned about ahead of time.

Thanks,
David

2 Likes

Alrighty…so, @Dave and @zachary and others on the Spark Cloud team have analyzed our logs and graphs the causes behind this issue and conjured up a summary and plan. In a nutshell:

The root cause appears to be a flood of connection attempts early this morning that caused a process to fail due to another bug we’re rolling out a fix for this week. When the service attempted to recover, a race condition prevented the server from recovering normally. We’ve patched that bug to help things recover more gracefully, and we’re rolling out a patch for the original issue soon as well.

Additionally, we’re going to add an alert to our monitoring infrastructure based on connection duration to make sure a device can maintain a connection for at least a few minutes. We’re also spinning up extra, more powerful, servers this week to help manage the load on the cloud.

I hope that helps clarify what happened and what we’re doing about it. Thanks again for your patience and promptness in reporting what you were seeing here.

-joe

6 Likes