Device Recovery after 03-14-19 Gen3 Cloud Outage

Prior to the Gen3 Outage yesterday, I’ve been testing (2) Boron LTE’s with similar user firmware on 0.9.0 for Solar Powered River (Water Level) Monitors.

Both use SYSTEM_MODE(MANUAL); & SYSTEM_THREAD(ENABLED);, perform a NO_ACK Publish, and sleep for 1 hour with System.sleep( {}, {}, sleepTime);.
I call System.reset(); after the Sleep Event in the hopes that the Boron can handle future problems without a physical reset, since these will be installed on the sides of bridges at remote locations.

Boron1 uses internal Particle SIM
Boron2 uses 3’rd Party SIM

I’m still testing these in my office with good results until the Outage yesterday.
However, 8 Hours after the Outage was reported “resolved”, neither Boron has successfully Published.
Both Boron’s still connect to Cellular and the Particle Cloud quickly (every hour), breathing Cyan, and go back to sleep like normal, but neither Boron’s Publish makes it to the Cloud since the Outage.

I pressed RESET button on both Borons, and still no sign of them in each Device’s Console & no publish in the Event Stream, even though they both breath cyan and go back to sleep as normal.

The Strange Part:
I briefly unplugged the Li-Po on Boron2, and it can now successfully publish again.
I haven’t unplugged Boron1 yet, in an effort to determine if it will ever Recover on it’s on without physical intervention.

What happens when removing power that doesn’t happen with a call to System.reset() after the Boron wakes up? Note: I “believe” the Modem is successfully powered-off prior to Sleeping as the Sleep Current is 1.2 mA.

Does anyone have thoughts or suggestions on how I can mitigate this problem in the future after a similar Cloud Outage ?

The two primary goals are:

  • not stay awake and blow through the battery if something goes wrong (working fine so far)
  • not require physical intervention at a remote site (didn’t recover after yesterday’s outage).
    .
Basic Code, reduced
SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);

int sleepTime      = 60 * 60   ;
int connectionFail = 3 * 60000 ;    
inline void softDelay(uint32_t t) {
  for (uint32_t ms = millis(); millis() - ms < t; Particle.process());
}
void setup()  {}

void loop()   {
  read_sensor();
  Cellular.on();
  softDelay(2000);
  Particle.connect();
  softDelay(2000);
 
 if (waitFor(Particle.connected, connectionFail)) {  //  Limit the Connection attempt
    Particle.publish(eventName, msg, PRIVATE, NO_ACK);
    softDelay(2000);
  }
  Particle.disconnect();
  softDelay(2000);
  Cellular.off();
  softDelay(3000);
  System.sleep( {}, {}, sleepTime);
  System.reset();  // Boron Wakes up here.... force a reset after wake.
}

Thanks in Advance.

1 Like

Pressing the reset button will not force the modem to undergo a full re-boot. Thus, any problems that have a localized effect on the modem cannot be fixed with a reset only. That said, calling Cellular.off() would normally have a similar effect, which you have in your code. It’s possible something has changed from the Electron, though.

You can try calling a modem reset via Cellular.command before your Cellular.off:

 #if Wiring_Cellular
 SINGLE_THREADED_BLOCK()
{
     // 16:MT silent reset (with detach from network and saving of NVM parameters), with reset of the SIM card
     Cellular.command(30000, "AT+CFUN=16\r\n");
}
delay(2000);
Cellular.off();
#endif

However, if your boron is connecting to Cellular, and connecting to the particle cloud, the issue may not lie with your modem.

Nonetheless, there is a simple solution for your stated goals. Use a TPL5110 timer to control the period of your wake time. It will turn on power for your device, and then your device can send a “done” signal to tell it it can turn off power for the remainder of the cycle. Then, at the beginning of the next cycle it turns on your power again. If no done signal is received, it will act as a standard watchdog timer that performs a power reset. This is a fairly traditional method for doing low power iot stuff that happens periodically.

My Boron reconnected and published just fine but the 8 Xenons required a system.reset before they responded to the Mesh Publishes sent to them from the Boron and Particle is working on a fix for that now.

Just wanted to say my Boron LTE recovered and reconnected to Particle Cloud and was pingable without needing any manual intervention. I was not sleeping in my code, just running constant Marco Polo code while connected to a USB plug and 2 amp power adapter.

1 Like

@justicefreed_amper, Thank you for the Reply. I've used TPL5111 successfully with the Boron on several projects. I should have mentioned a 3'rd goal for this project:

The two Three primary goals are:

  1. not stay awake and blow through the battery if something goes wrong (working fine so far)

  2. not require physical intervention at a remote site (didn’t recover after yesterday’s outage).

  3. Use Sleep Modes for future secondary goals. The Boron LTE will eventually vary the Awake run-time, Sleep Schedule, etc based on environmental conditions and by user request via particle functions.

For instance: during those times when the Solar Panel can easily produce the 20 mA average demand, the Boron LTE will remain awake to not require a handshake every hour(per sleep event), and be available for OTA, etc.

UPDATE: I still haven’t unplugged the Li-Po on Boron1 yet, as a test.
It continues to Sleep for 1 hour, startup and breath Cyan, and return to sleep.
But with no evidence of it's existence in the Console.

I just booted into Safe Mode, and it's now working properly.
The Handshake Time was updated, it's ping-able, etc.
A subsequent RESET and Boron1 is once again publishing to the Cloud.

The fact that both Borons "thought" they successfully performed Particle.connect (breathing cyan) after the Cloud Outage of several hours means it's going to be hard to identify this case in the future.

I could reconsider the NO_ACK Publish and take actions if the publish isn't acknowledged...... but that would usually be a System.reset() which didn't help in this case.

Did anyone have Sleeping Boron LTE's Recover after the Outage last night without having to Remove Power or use Safe Mode ? If I'm the only one, then it's not that big of a deal for Particle.

2 Likes

@peekay123 Has a Boron 3G in Canada and it also came back online successfully and was sending data to the cloud again automatically, it was the Xenons that were not Pingable.

He was not using Sleep Modes on the Boron that I know of.

None of our network “gateway” nodes have yet recovered (Argon / Boron). Our current plan is to go to site as soon we can to reset the gateway devices. This behavior is also seen in our lab, even with the Tinker & Blink applications so I do not suspect code.

If anyone finds a way to recover gateways (and their Networks) which have been isolated since ~5:43 PM ET 3/14/19 please share, we continue to hope our far Boron still has power and may recover?

1 Like

I’d like to Pull the EN pin LOW once a week for remotely installed Borons, since Cellular.off() or System.reset() didn’t allow my Boron’s to recover from the 03-14-19 Cloud Outage.

It’s easy to connect a I/O pin to the EN Pin. Lets say I use D6 to pull EN low once a week with a millis() check. Since the EN pin is pulled high via a 100k resistor on the Boron, should I wait to Declare D6 as an output just before I pull the plug, or do it in Setup() and keep D6 High the entire week until I want to Shutdown/Restart?
The Boron LTE is Solar & Li-po, and sleeps for 1 hour periods using System.sleep( {}, {}, sleepTime);
Where is the best place for pinMode(D6, OUTPUT); and digitalWrite(D6, HIGH/LOW); to not introduce a power sink while sleeping ?

Thanks in Advance.

1 Like

After the interruption site visits were required for us. We used an external switch to interrupt power to our project gateways (Argon / Boron). Power was more accessible than the module buttons. In one case we tried a reset of an Argon / Safe Mode but this did not seem to help so again we went to power.

We have implemented a software watchdog, which calls System.reset(). This runs in our software and is also a published function (just to try!). For the devices involved in the interruption none recovered via System.reset() AFAIK.

In related “good” news - the behavior of devices becoming inaccessible from the WUI (Console) - seems to be on the decline.

I agree with others comments that we should not rely upon the console nor module lamp to signal/prosecute isolated Gen3 devices. Now our watchdog periodically burns some data to let us know it’s alive.

Thanks to all contributing to this thread. We appreciate the coaching, good ideas, and feedback.

Best, _Todd