Electrons Disconnected Simultaneously Across the US

Around 5:30pm CST approx 10 of our remote Electron devices stopped being able to connect to the Particle Cloud. They all connect to my other network services (MQTT) fine, but the constant attempts to reconnect to the Particle Cloud end up blocking my code enough to be problematic since they can never reconnect.

These devices all went offline at the same time across the US (Chicago, Grand Rapids, Monroe City, MO, Boston), and on both T-Mobile and AT&T Networks. A hardware watchdog is resetting the devices with the reset pin, and the firmware is also explicitly resetting the cellular modem.

Was there any possible Particle outage at that time? This behavior looks very similar to when I’ve programmed Electrons with bad keys, but this happened to devices that have been connected months or hours prior to this event.

It seems the Particle cloud is continually rejecting the connection request as far as I can tell. Besides bad keys, are there any other possible causes I could explore for such behavior?

After 4 hours of disconnection, all but one of the devices have come back online at the same time and with no apparent trigger.

After further investigation of my cellular providers detailed connection logs, I have discovered that similar but far shorter events have potentially occurred in the past, all at the same time across various, changing subsets of devices. So while the immediate need for resolution is maybe gone, I am still very concerned about what could be causing behavior like this.

The weirdest part is that the only common variable seems to be the Particle Cloud (or my MQTT service in theory, though that part was working just fine as far as I could tell) given the distribution across network providers and geography but also synchronized in time with each other to within tens of seconds.

And for further details on my configuration:

System Firmware:
0.6.4 & 0.7.0

Application Firmware across a wide number of versions

System-thread: Enabled

3rd party SIM with keepAlive set properly through the particle cloud connection event in application firmware

All networking code (Particle cloud connection maintenance and MQTT publishing) is in the main loop thread context. A couple other realtime data IO things happen in other threads, but nothing particularly CPU time intensive.

I don’t publish to Particle or interact with the Cloud in any way except for the keepAlive and reconnecting in case I get disconnected.

Same happened here in NZ for about 24hrs they are coming back on line now about 90mins ago.

I tracked it down… It was a particle DNS resolution error on the electrons, our MQTT server was resolving fine so the issue is at particles end.

I sent the logs to Particle via the support web page so unfortunately i don’t have a copy of the error, other than the log I sent to particle. (I had full trace enabled so saw what the modem was doing)

in short. the Electron tries to resolve an address like this.
<device_id>.udp.particle.io this was failing.
yet immediately after this failed, the Electron successfully resolved our MQTT server address.

The electron would try and resolve it 3 times, and then wait and then do it again and then reboot. I was able to prevent the reboot using

System.disable(SYSTEM_FLAG_RESET_NETWORK_ON_CLOUD_ERRORS);

Which allowed a bit more debug and I could see that the MQTT session would connect and run fine, even though the Electron was hammering away at trying to resolve the DNS for particle.

Interestingly enough I could ping the <device_id>.udp.particle.io from a command line on my PC and it would resolve. (not reply though but this was expected.

I have about 50 units that just rebooted constantly. each time they reboot I got an email. several thousand of them!

Same issue here at about the same time. However, I was showing handshake attempts (I currently have my devices performing a new handshake at each connection attempt for another issue). The handshake would register on the console but no other communication is happening. Mine have not successfully published yet, though.

It does look like a handshake attempt, it will flash cyan quickly and then flash orange twice. then start again.
The flashing orange is the symptom, as it never normally flashes orange on a connection from a reboot.

@rickkas7 Any idea what might be going on?

There does not appear to be a widespread outage yesterday. However, some services were migrated to different servers.

Under normal circumstances, this just causes the devices to briefly disconnect and reconnect. However, for a very small number of people, the Electrons took a while to switch to the correct server. This appears to be DNS related, but the exact cause is not known yet.

It appears that for some users, the Electron either failed to resolve the DNS lookup for the cloud server, or got the wrong address.

Thanks everyone for those insights, this is already super helpful.

Showing Handshake Attempts
@dcliff9 - How are you monitoring the handshake attempts? Are you just using the cloud connection events with a handler that logs the event, using something related to the debug system firmware version, or something else?

Seems like monitoring the cloud connection events would be a good way to go at a minimum - I’ll put that in my backlog to do. Also - any luck with devices coming back online yet?

Electron Network Reset Behavior With This Issue
@marshall - What system_thread are you running? I’m having trouble determining if the network was getting reset or if I was just timing out the MQTT (based on logs from mosquitto) due to the particle connection attempt blocking my application code for 20-30 seconds. I’m comfortable just turning off that reset regardless, but I would be curious to learn more about how you identified that specific state as the thing you were hitting.

I had Threads enabled but only for non consequential stuff the MQTT is in the main loop

I had semi automatic mode enabled.

The network was definitely getting reset by the system firmware.

Here is a log of the issue.

Modem::join ] = = = = = = = = = = = = = = = =
180.780 AT send 12 "AT+CGATT=1\r\n"
180.792 AT read OK 6 "\r\nOK\r\n"
180.792 AT send 14 "AT+UPSND=0,8\r\n"
180.802 AT read + 17 "\r\n+UPSND: 0,8,0\r\n"
180.812 AT read OK 6 "\r\nOK\r\n"
180.812 AT send 23 "AT+UPSD=0,7,“0.0.0.0”\r\n"
180.824 AT read OK 6 "\r\nOK\r\n"
180.824 AT send 22 "AT+UPSD=0,1,“direct”\r\n"
180.836 AT read OK 6 "\r\nOK\r\n"
180.836 AT send 15 "AT+UPSD=0,6,0\r\n"
180.846 AT read OK 6 "\r\nOK\r\n"
180.846 AT send 14 "AT+UPSDA=0,3\r\n"
183.056 AT read OK 6 "\r\nOK\r\n"
183.056 AT send 14 "AT+UPSND=0,0\r\n"
183.066 AT read + 31 "\r\n+UPSND: 0,0,“IP Address of Electron”\r\n"
183.078 AT read OK 6 "\r\nOK\r\n"
0000183079 [system] INFO: ARM_WLAN_WD 2
0000183079 [system] INFO: CLR_WLAN_WD 1, DHCP success
0000183081 [system] INFO: Cloud: connecting
0000183081 [system] TRACE: sparkSocket Now =-1
0000183083 [system] INFO: Read Server Address = type:1,domain:$id.udp.particle.io
183.084 AT send 56 "AT+UDNSRN=0,“device_id.udp.particle.io”\r\n"
184.006 AT read ERR 9 "\r\nERROR\r\n"
184.007 AT send 56 "AT+UDNSRN=0,“device_id.udp.particle.io”\r\n"
184.919 AT read ERR 9 "\r\nERROR\r\n"
184.920 AT send 56 "AT+UDNSRN=0,“device_id.udp.particle.io”\r\n"
185.772 AT read ERR 9 "\r\nERROR\r\n"
0000185774 [system] ERROR: Cloud: unable to resolve IP for device_id.udp.particle.io
0000185774 [system] WARN: Cloud socket connection failed: -1
0000185776 [system] TRACE: Internet test socket
socketSocket(TCP)
185.775 AT send 12 "AT+USOCR=6\r\n"
185.785 AT read + 13 "\r\n+USOCR: 0\r\n"
185.795 AT read OK 6 "\r\nOK\r\n"
Socket 0: handle 0 was created
0000185796 [system] TRACE: socketed testSocket=0
0000185796 [system] TRACE: Connect Attempt
socketConnect(0,port:53)
185.797 AT send 25 "AT+USOCO=0,“8.8.8.8”,53\r\n"
185.889 AT read OK 6 "\r\nOK\r\n"
0000185890 [system] TRACE: socket_connect()=success
0000185890 [system] TRACE: Close
socketClose(0)
185.891 AT send 12 "AT+USOCL=0\r\n"
186.161 AT read OK 6 "\r\nOK\r\n"
socketFree(0)
0000186162 [system] WARN: Internet available, Cloud not reachable!
0000186162 [system] WARN: Handling cloud error: 3
0000189864 [system] INFO: Cloud: connecting
0000189864 [system] TRACE: sparkSocket Now =-1
0000189866 [system] INFO: Read Server Address = type:1,domain:$id.udp.particle.io
189.867 AT send 56 "AT+UDNSRN=0,“device_id.udp.particle.io”\r\n"
190.739 AT read ERR 9 "\r\nERROR\r\n"
190.739 AT send 29 "AT+UDNSRN=0,“my_mqtt_server.com”\r\n"
191.070 AT read + 27 "\r\n+UDNSRN: “11.22.33.44”\r\n"
191.080 AT read OK 6 "\r\nOK\r\n"
socketSocket(TCP)
191.080 AT send 12 "AT+USOCR=6\r\n"
191.090 AT read + 13 "\r\n+USOCR: 0\r\n"
191.100 AT read OK 6 "\r\nOK\r\n"
Socket 0: handle 0 was created

Sorry for the delay. I am traveling at the moment so I don’t have access to the field devices. Basically, they are showing handshakes from the particle console. The “Last Handshake” date and time has been updating successfully as I would expect. I have a bit of code that gives an SOS flash on an external LED when there is an error connecting to the particle cloud and one of my customers has indicated his device is showing this error. So, from this I can surmise that I am getting cellular connections (according to my cell provider), am able to successfully handshake (according to the particle console), but am not actually getting a successful “Particle.connected”. So of course, a particle.publish will not work.
But, if there is no known outage, I suppose there must be something else happening. I will troubleshoot more in depth once I have access to one of these devices. Just seems strange that OP and I had devices in different geographical locations have such an issue around the same time.

Some of the device-facing servers that the Electron sends packets to moved to different IP addresses. Normally, when this happens, the Electron/E-Series will do a DNS lookup for its cloud server, discover the new IP address and be on its way.

The DNS entries had a time-to-live (TTL) of 10 seconds, so this process should have been quick.

For a number of devices, the Electron failed to retrieve the new IP address of the cloud server, and this is what caused the connections to fail, as they were connecting to the old, no-longer-in-use IP address.

We’ve found some evidence that some intermediate DNS servers incorrectly handle TTLs of less than 60 seconds, disregarding the TTL and instead using a much longer TTL. We’ve increased the TTL to 60 seconds to prevent that problem.

Also, we’ll try to avoid moving servers that way in the future.

3 Likes