Multiple devices still offline in the fleet after yesterday's Ethersim incident

I understand Particle experienced a significant outage yesterday regarding ethersim devices as reported here: Particle Status - EtherSIM connectivity incident

Although rare, I can understand how these events can happen and appreciate how the Particle team communicates the status and quickly resolves these issues.

As of 10:30 AM CST on 3/31, 98% of my fleet is back online. However, I still have a handful of devices that are offline yet from the fleet. Here are 3 examples that I'm aware of. I suspect there are a few more. Likely less than 10 though. I emailed Particle support this morning but their response was for me to post it here in the community and to tag Particle Support. Thus, I am making this post. @Support @no1089

Device IDs:
e00fce683179339af3d78071 | Last heard: Mar 30th 2025, 6:33 pm | Powered at 100% battery and charging
e00fce688a1b66d1bc02b5b4 | Last heard: Mar 30th 2025, 6:41 pm | Powered at 98% battery and charging
e00fce68af5c5fa6c0cf928d | Last heard: Mar 30th 2025, 6:41 pm | Powered at 95% battery and charging

Here is an images from the particle console indicating this device is still offline:

Furthermore, I am using the AB1805 watchdog per the application note so these devices are likely power cycling themselves every 11 minutes so I don't think walking upto the device to turn power off/on will correct the issue.

Since all 3 of these devices all went offline at the same time of this Ethersim incident and within 8 minutes of themselves but all have remained offline, it seems this incident is not fully resolved yet. I'd appreciate any attention the fine team at Particle can do to resolve.

2 Likes

I too have a site (two devices) that is still offline as of 1135 PDT 3/31/25 (both are Boron 404X, one is 6.2.0, the other 6.2.1). All my others devices come back last evening (even those with identical user firmware build, but different geographical location).

My devices operate with external power (remote sensing/control in an industrial setting) and while I make judicious use of both the Application Watchdog, and Hardware Watchdog, I only have a simple

if (!Particle.connected()) { Particle.connect(); } 

in my main().

I too would appreciate any additional insight/updates - thank you for all your efforts thus far!

1 Like

@laughlin - thanks for posting your details here as well. Sounds like Iā€™m not the only one. Hopefully @Support recognizes this and is in the process of addressing it. The more people can point this out the more info they will have on hand to resolve so thank you!

1 Like

Hi Jeff,

e00fce683179339af3d78071 - we've heard from it about an hour ago. It was attempting to connect to Verizon which is not an allowed carrier for your devices.
e00fce688a1b66d1bc02b5b4 - Heard from in the last hour. Multiple attempts on AT&T.
e00fce68af5c5fa6c0cf928d Was online 45mins ago - had multiple connection attempts to AT&T.

One crucial question is how much time do you allow your devices to connect?
Devices need 3-5 minutes to change carrier profiles, and 10mins for a modem reset. Without allowing a profile change, the device cannot attempt other towers in the area.
The modem power cycle is sometimes required if a device is really stuck.

I suspect a short connection time, couple with carrier issues during the outage caused the devices to simply struggle to find a workable carrier in the area.

Hi there,

As I stated for Jeff's devices - connection time is crucial. If your Watchdog fires because the device is stuck trying to connect (likely due to the logic you are using) it will not have time to switch carriers as it's constantly resetting. I don't have DeviceIDs to review, but I'm pretty sure this is the case.

I would suggest using waitFor() instead and setting a timeout that's shorter than your WD timer.
During adverse connection events devices should extend their connection time - with at least one last-ditch effort >10mins, as well as back-off connection attempts in order to preserve battery life.

1 Like

I am allowing for up to 11 consecutive minutes of connection time before issuing a ab1805.deepPowerDown(). This is the code base I am using:

        //Failure to connect exceeds connectMaxTime (typically 11 minutes).
        else {
            if (cloudConnected) {
                cloudConnected = false;
                cloudConnectStarted = System.millis();
                Log.info("lost cloud connection");
            }
            uint32_t elapsed = (uint32_t)(System.millis() - cloudConnectStarted);
            if (elapsed > connectMaxTime.count()) {
                Log.info("failed to connect to cloud, doing deep reset");
                delay(100);

              //If the watchdog is Enabled (a power relay is NOT in the ON position) then do a Deep Power Down Reset
              if(watchdogEnable == 1){
                ab1805.deepPowerDown();
              }

In this case, I am pretty sure the watchdog is enabled on these devices as those are only disabled when the device is actively controlling something.

I appreciate the response, I'll continue to work with the end user and monitor a bit more on this end. Maybe something else is at play in the handful of devices. I'll post here once I determine more. Thank you for your support!

1 Like

Hi Chris,

My two deviceIDs in question are below, if you have any additional insight, I would appreciate a look from your end.
e00fce68b6a12b2cb3dcbdac
e00fce6845c037b44b292d94

To be more verbose: I have the hardware WDT set for 30s, and the App WDT set for 20s; failure to pet the App WDT will immediately set a persistent variable, and resets the device.

Both App and HW WDTs are activated promptly upon entry into setup(). At the end of setup (no heavy lifting going on there, GPIO+Serial+I2C+SPI setup/config), I attempt to wait for a connection before entering my main by calling await_cloud_connection(30) (see below).

Unless Particle.connect() is blocking for >20s, I don't think my WDTs should be firing due to lack of connection.

void await_cloud_connection(const unsigned int timeout_sec) {
  const unsigned int start = millis();
  while (!Particle.connected()) {
    if (millis() - start > timeout_sec * 1000) {
      Log.error("Timed out waiting for cloud connection");
      break;
    }
    Particle.connect();
    delay(100);
    Particle.process();
    pet_watchdogs();
  }
}

edit: Further reading of docs indicates Particle.connect() is potentially a blocking call, and perhaps not occurring within the 20s of my app WDT?

Can @Support please provide some clarity around the implications of calling Particle.connect()?

From the docs as of 3/31/25:

...
After you call Particle.connect(), your loop will not be called again until the device finishes connecting to the Cloud. Typically, you can expect a delay of approximately one second.
...

Is Particle.connect() a BLOCKING call, or does it enqueue something in a system thread that will in turn BLOCK execution of the user loop on the next iteration, until success/failure/timeout?

Thus, am I to understand (based on @no1089's comment above) I need to plan for edge cases where my thread can be blocked for 3-5, or even 10 minutes if I call Particle.connect()

That sentence only applies when system thread is disabled. It does not block when threading is enabled, which we recommend to always use (and is the default is 6.2.0 and later). I removed the sentence to avoid confusion.

Thanks for the update and the additional detail in the docs.

Are you able to see anything from your end for the two device ID's I mentioned above, as Chris did for Jeff's devices?

Based on your description, and the fact that I operate in SEMI_AUTO with SYSTEM THREADS, and never sleep the devices, Particle.connect() should not block, nor delay execution of my thread at any point, and thus should not trigger for my WDTs (Chris' hypothesis as to why my two devices are still offline).

The two devices in question are running completely different firmware, but both have the above-mentioned WDTs structure, and have been operating in the wild for some time. Both and been reset remotely, via WDTs, and locally via power cycling during this time, with no issue.

Any additional insight to device status would be helpful, as I'm now looking at potential cross country flights to bring these two devices back up on a customer's site.

Do we have any additional insight into what caused the cellular outage from the EtherSim devices? Was there a system "upgrade" that was supposed to be seem less and blind to the end users but that didn't go to plan? I'm wondering if a system wide change was made that negatively impacted a small percentage of devices. (Certain towers or maybe carriers had a cut over to new hardware, old cell towers taken offline, changed/eliminated a previously supported cellular band, or some other cellular network backhaul change was deployed? During deployment of that change, unintendedly caused the outage, they recovered from the outage but whatever change was made is still active?

I still have 3-4 end users and the conversation goes like this:

  • The device worked flawlessly here in this exact location for the last 2+ years
  • The device went offline as part of the major cellular outage that you describe
  • Since then, it wouldn't connect at all. Even many hours after the outage is over
  • I take it home/to a different location and it connects.
  • I take it next to another device that is connected and it connects there.
  • I take it back to the original location where it's worked fine for the last 2 years and it doesn't connect.

I completely understand most of this is outside of my control, Particle's control, even the EtherSim provider's control as the black magic of cellular connectivity is hard and has many layers of abstraction. I am also not naive in thinking cellular connectivity won't change in certain areas over time with technology changes. But having some insight in what all took place during this cellular outage can help users like myself understand why this is happening and/or what we can do about it for these few edge case end users.

I still have 3-4 devices that all exhibit this same behavior from 3 different end users. The only logical conclusion I can draw was something within the Cellular network changed. Is there any other logical explanation or things I can try/do/test. My only other course of action I can think of is having these end users attempt to use a cell phone booster (We Boost or equivalent) to try and strengthen the cellular signal in that area and/or find a different location on the property where the device is able to connect. It's just very hard for me to explain that to the end user since from their perspective, the device has worked fine in this exact location for 2+ years. @no1089 - When you indicated those device ID's connected, you are correct, they did, but per my end users, they connect when they took them home to try. They connected at home but won't connect back in the original location.

The only thing I can think of to look at is to interrogate the Device Vitals data that was published from these few devices over the last few months, identify the cell tower it was connected to, and see if it's ever connect to that same cell tower since the outage. Just to help validate this hypothesis so we can figure out how to best get these few devices operational again.

Any other words of advice?

1 Like

I'm happy to report that the 4 devices that have been offline since the outage and only connected in a different location since the outage, as of this morning are connected again in the original locations they were the last 2 years. Oddly enough all 4 devices started behaving correctly earlier this morning. This was across 3 different locations. No idea if this was something Particle.IO identified, the EtherSim provider identified, or maybe the specific carrier identified. Whatever it was just super excited that it's fixed. :slight_smile:

My only guess was something was decommissioned/upgraded/new went live by the cellular carriers or cell towers at the same time of this outage but whatever that was, it is now corrected. I'll Chaulk this up to the "black magic" of cellular connectivity.

1 Like