deviceOS@2.2.0 - Timeout on Particle Cloud reconnection?

I happen to notice this event when testing an update. In this case, the device had connected to Particle Cloud as evidenced by this log entry:

0000046111 [system] INFO: Cloud connected
0000046112 [app.ab1805] INFO: setRtcAsTm 2021-09-29 12:34:35
0000046117 [app.ab1805] INFO: getRtcAsTm 2021-09-29 12:34:35
0000046118 [app.ab1805] INFO: set RTC from cloud Wed Sep 29 07:34:35 2021
0000046120 [app] INFO: Connected in 44 secs
0000046120 [app] INFO: Cloud connection successful
0000046131 [app] INFO: From Connecting to Idle

But, I suspect there was an issue with the connection because a short while later this happened:

0000046805 [comm.protocol] TRACE: rcv'd message type=8
0000062000 [app.ab1805] INFO: setWDT -1
0000062000 [app.ab1805] TRACE: watchdog set fourSecs=31 bResult=1
0000124000 [app.ab1805] INFO: setWDT -1
0000124000 [app.ab1805] TRACE: watchdog set fourSecs=31 bResult=1
0000140064 [gsm0710muxer] ERROR: The other end has not replied to keep alives (TESTs) 5 times, considering muxed connection dead
0000140065 [ncp.client] TRACE: NCP connection state changed: 1
0000140065 [net.pppncp] TRACE: NCP event 2
0000140066 [net.pppncp] TRACE: State changed event: 1
0000140067 [gsm0710muxer] INFO: GSM07.10 muxer thread exiting
0000140067 [net.ppp.client] TRACE: PPP thread event LOWER_DOWN
0000140068 [net.ppp.client] TRACE: State CONNECTED -> DISCONNECT
0000140068 [net.ppp.client] TRACE: State DISCONNECT -> DISCONNECTING

The device disconnected and did a hard reset on the cellular modem (BRN402) and then, since the SYSTEM_MODE(SEMI_AUTOMATIC) , the device immediately attempted to reconnect. This is as expected.

Here is the problem, these are solar powered devices with low energy budgets. As such, I limit how long I let them attempt to connect (11 minutes) and how often (once an hour) using a specific finite state for connecting. But, this reconnect was not using my state to connect and therefore it broke both of these rules (18 minutes to connect and more than once in an hour).

So, here is the question, is there any limit on how long a device will try to connect when doing a reconnection? Worst case, a device could attempt to connect indefinitely and, unless it was successful, you would be hard pressed to know what happened remotely.

Thanks, Chip

The short answer is there is no limit on 2.2 and it could go on forever.

The long answer, courtesy of @rickkas7 is:
This is likely due to a firmware update.

With 2.2.0, the reset occurs after the user firmware is loaded into the OTA slot, then the device resets. From that point, the user has no control because safe mode healer will take over to upgrade Device OS. It will keep trying forever, as it should because the device is not running user firmware yet. And there will be at least one more reboot and reconnect required after Device OS is updated.

1 Like

Colleen,

OK, that is a bit scary. But, I understand that in the middle of a firmware update there is a point where a connection is needed to move forward and, it seems, you cannot go back. Still, that means that is it possible to lose a remote device during a firmware update. My sense is that this must be a rare event but it does change the risk / reward calculus for an update.

However, I want to clarify something: I was not performing a firmware update when this happened. I don’t have great network coverage at my home which is actually a blessing for my development work.
This device was connected and, I believe, the connection was lost which triggered the reconnection.

I have seen some of my devices exhibit a significant “battery charge loss event” so this could be a rare but important possibility. Might it be possible, outside of the firmware update case discussed above, to add some guard rails to this reconnection attempt. For example, might it be possible to prevent the battery being drained or enforce a “back off” after repeated attempts so a network outage might not drain the batteries of a fleet of devices?

All this is in the spirit of making Particle devices more resilient and better able to recover from an unplanned outage.

Thanks,

Chip

There are two parts to this:

How are you handling timeouts to reconnect? The best way is to monitor Particle.connected() from loop and keep track of when it disconnects and establish a limit for the time to reconnect. That will work for spontaneous disconnections, which can happen at any time and should be interruptible.

Using Device OS 3.0 or later on Gen 3 solves the issue on firmware update in two ways: The modem will not (usually) be reset after the reboot, so it should connect very quickly (essentially no blinking green phase). If the update can be in a combined update, then user firmware and Device OS both be updated in a single reboot and no safe mode healer (breathing magenta) phase, also reducing the chance of a long time to reconnect.

2 Likes

@rickkas7 ,

Thank you for taking a look. To answer your first question, I handle timeouts to reconnect with a non-blocking state called CONNECTING_STATE in my state machine. In this state, I device whether a connection is needed and if so, if it is needed at this time. It also allows for a user over-ride using a button on the carrier so that a device can always be brought on-line manually. Finally, it records the details of a successful or unsuccessful connection.

// Before Setup()
unsigned long connectMaxTimeSec = 11 * 60;   // Timeout for trying to connect to Particle cloud in seconds

// Main loop Finite State machine
case CONNECTING_STATE:{                                              // Will connect - or not and head back to the Idle state
    static State retainedOldState;                                     // Keep track for where to go next (depends on whether we were called from Reporting)
    static unsigned long connectionStartTimeStamp;                     // Time in Millis that helps us know how long it took to connect

    if (state != oldState) {                                           // Non-blocking function - these are first time items
      retainedOldState = oldState;                                     // Keep track for where to go next
      sysStatus.lastConnectionDuration = 0;                            // Will exit with 0 if we do not connect or are connected or the connection time if we do
      publishStateTransition();

      // Let's make sure we need to connect
      if (sysStatus.connectedStatus && Particle.connected()) {
        Log.info("Connecting state but already connected");
        (retainedOldState = REPORTING_STATE) ? state = RESP_WAIT_STATE : state = IDLE_STATE;
        break;
      }
      // If we are in a low battery state - we are not going to connect unless we are over-riding with user switch (active low)
      if (sysStatus.lowBatteryMode && digitalRead(userSwitch)) {
        Log.info("Connecting state but low battery mode");
        state = IDLE_STATE;
        break;
      }
      // If we are in low power mode, we may bail if battery is too low and we need to reduce reporting frequency
      if (sysStatus.lowPowerMode && digitalRead(userSwitch)) {         // Low power mode and user switch not pressed
        if (sysStatus.stateOfCharge <= 50 && (Time.hour() % 4)) {      // If the battery level is <50%, only connect every fourth hour
          Log.info("Connecting but <50%% charge - four hour schedule"); 
          state = IDLE_STATE;                                          // Will send us to connecting state - and it will send us back here                                             
          break; 
        }                                                              // Leave this state and go connect - will return only if we are successful in connecting
        else if (sysStatus.stateOfCharge <= 65 && (Time.hour() % 2)) { // If the battery level is 50% -  65%, only connect every other hour
          Log.info("Connecting but 50-65%% charge - two hour schedule"); 
          state = IDLE_STATE;                                          // Will send us to connecting state - and it will send us back here                                             
          break;                                                       // Leave this state and go connect - will return only if we are successful in connecting
        }
      }

      // OK, let's do this thing!
      connectionStartTimeStamp = millis();                             // Have to use millis as the clock will get reset on connect
      Cellular.on();                                                   // Needed until they fix this: https://github.com/particle-iot/device-os/issues/1631
      Particle.connect();                                              // Told the Particle to connect, now we need to wait
    }

    sysStatus.lastConnectionDuration = int((millis() - connectionStartTimeStamp)/1000);

    if (Particle.connected()) {
      sysStatus.connectedStatus = true;
      sysStatus.lastConnection = Time.now();                           // This is the last time we attempted to connect
      recordConnectionDetails();                                       // Record outcome of connection attempt
      Log.info("Cloud connection successful");
      attachInterrupt(userSwitch, userSwitchISR,FALLING);              // Attach interrupt for the user switch to enable verbose counts
      if (retainedOldState == REPORTING_STATE) state = RESP_WAIT_STATE;
      else state = IDLE_STATE;
    }
    else if (sysStatus.lastConnectionDuration > connectMaxTimeSec) {
      current.alerts = 2;                                              // Connection timed out alert
      sysStatus.connectedStatus = false;
      recordConnectionDetails();                                       // Record outcome of connection attempt
      Log.info("cloud connection unsuccessful");
      disconnectFromParticle();                                        // Make sure the modem is turned off
      if (sysStatus.solarPowerMode) setLowPowerMode("1");              // If we cannot connect, there is no point to stayng out of low power mode
      if ((Time.now() - sysStatus.lastConnection) > 3 * 3600L) {       // Only sends to ERROR_STATE if it has been over three hours - this ties to reporting and low battery state
        state = ERROR_STATE;     
        resetTimeStamp = millis();
        break;
      }
      else state = IDLE_STATE;
    } 
  } break;

What I am unclear about is how my code would know a reconnection attempt is underway as a reconnection does not invoke my CONNECTING_STATE.

I would assume that, if I could detect a reconnection attempt, I could call my disconnectFromParticle function (below) after it went for more than 11 minutes. Would this work?

bool disconnectFromParticle()                                          // Ensures we disconnect cleanly from Particle
                                                                       // Updated based onthis thread: https://community.particle.io/t/waitfor-particle-connected-timeout-does-not-time-out/59181
{
  Log.info("In the disconnect from Particle function");
  Particle.disconnect();
  waitForNot(Particle.connected, 15000);                               // make sure before turning off the cellular modem
  Cellular.disconnect();                                               // Disconnect from the cellular network
  Cellular.off();                                                      // Turn off the cellular modem
  waitFor(Cellular.isOff, 30000);                                      // As per TAN004: https://support.particle.io/hc/en-us/articles/1260802113569-TAN004-Power-off-Recommendations-for-SARA-R410M-Equipped-Devices
  sysStatus.connectedStatus = false;
  systemStatusWriteNeeded = true;
  detachInterrupt(userSwitch);                                         // Stop watching the userSwitch as we will no longer be connected
  return true;
}

That said, would changing the System Mode to “Manual” solve any of this? In manual mode, will the system initiate a reconnect attempt if the current session stops working?

As for the firmware update case, I have done a large number of these using the Intelligent OTA process. From my experience, this process works very reliably and, when it fails, the devices continue to operate so I believe this is less of an issue.

As always, any comments for suggestions are welcome.

Thanks, Chip

Manual mode will not help. The problem is that any time you are connected, it is possible that the cloud could disconnect and reconnect. You need to detect this in parallel with your state machine, which is much easier to do when you are using a state machine.

1 Like

@rickkas7 or @Colleen ,

OK, makes sense. Would this make sense - in the main loop outside the FSM:

  if (sysStatus.connectedStatus && !Particle.connected()) {            // If the system thinks we are connected, let's make sure that we are
    state = CONNECTING_STATE;                                          // Go the connecting state - that way we will have limits on connection attempt duration
    sysStatus.connectedStatus = false;                                 // At least for now, this is the correct state value
    Log.info("Particle connection failed, reverting to the connecting state");
  }

The question I have is what happens if, by the time I catch this and get to the CONNECTING_STATE, the reconnection is already underway? Should I also include a disconnectFromParticle() call to make sure I am starting from a clean attempt?

Thanks,

Chip