Discourse Down for about 2 hours tonight (due to AWS issues)

Hi there, Spark Community - Thanks for your patience as our crack team of engineers ( @jgoggins @Dave!! ) worked for the last two hours to bring back the Forum. Please enjoy yourselves responsibly, and thank you for being amazing.

2 Likes

Hey Folks,

Thanks for sticking with us there folks. The cause of community.spark.io being down, according to a very helpful and knowledgable Amazon Tech Rep was due to “an underlying hardware fault which caused the instance to become unresponsive. Unfortunately, this issue affected one piece of hardware and you were unlucky to be have this occur to your instance.”

It sounded like it was a freak occurrence. When I asked why it didn’t appear in the in the data center’s status page he said "the status page updates are for wide scale issues. " This surprised me. If the physical hardware was what failed, there were likely other virtual machines that were impacted too. At what scale does a system failure warrant a status update or notification of those who manage that infrastructure? I would have expected to receive some notification. But of course from Amazon’s perspective, the recommendation is to architect for high availability so systems self-heal under scenarios like this (like the other parts of the Spark Cloud already do). Doing this for Discourse is on the roadmap.

It is a shabby thing that the computing power behind this wonderful digital habitat got benched for a few hours due to a freak occurrence and I’m glad we’re up again. If anyone has any follow up questions please post them here or reach out directly to me.

Thanks for your patience and happy hacking.

-joe

4 Likes