Reliability of Device Broker for the cloud connections

Hi,

as I was preparing to order some more products, I started thinking if they are actually what I want. I was curious about reliability.

I spotted this outage from March: https://status.particle.io/incidents/0h2xjj4sqk9t that had a single server handling all devices conectivity in the world basically, and failing.

I have few questions.

Was rearchitecture of the device broker mentioned in postmortem done and is it now replicated horizontally and geographically? Could you provide information about distribution of your servers in the world that acts like an entrance to the cloud? How are they discovered (static list of IPs, DNS load balancing, two level load balancing with DNS and hardware TCP load balancers)? How would one track this independently (other than having one Photon dedicated to pinging and establishing connections and doing dummy operations for testing, and sniffing traffic in the mean time)?

What actions were performed to ensure monitoring (and alerting to oncall personel) is monitored and do not fail? (i.e. liveness checks, meta monitoring or cross monitoring, fallback to other notification methods and escalation to other people, if needed with few minute delays max). I hope you realize that is a core of your business, and reliable operation of monitoring is critical.

I guess this is what you mention under “Device Service”, on https://status.particle.io/, however there is no access to graphs, or daily / monthly reports about availability and latency or error ratios targets. And if the service is now distributed is not broken down by location.

PS. Building architecture and services to be horizontally scalable, not only make them easier to scale (handle more traffic and devices), and more reliable, but also allow you to do frequent (i.e. every day) software updates and machine maintenance, or service moves between machines with zero external impact. As well testing new version of the software on limited amount of devices (i.e. on 10% of all devices using one server for few hours, and rest still using previous version on other servers). So this is very important.

PS2. Looking at https://github.com/particle-iot/firmware/blob/37116a564253442c1e2c25e379ab16d5cef7230c/system/src/system_cloud_internal_constants.c it appears that it is using DNS connection to device.spark.io and $id.udp.particle.io as a backup. First one appears to be hosted on Amazon and my computer resolves this to 52.204.226.242 , and it is in US. I used https://www.whatsmydns.net/#A/device.spark.io to quickly check from other locations, and it does resolve to about 10 different locations all hosted in AWS, but all of them appear to be in US (none in Europe). Not sure if they are in one or multiple data centers.