Intermittent Connection Timeouts - Looking for Advice

All,

I have been working with support for over a month to figure out why, at infrequent intervals dozens of my devices will timeout attempting to connect to the Particle cloud. This is not the everyday here and there a device can’t connect - this is a “mass failed connection event” where many geographically separated devices experience this issue simultaneously. Since this happens infrequently and with remote devices, I have not been able to capture logs to help the support team - though this effort is on-going.

I am posting this data / issue for two reasons:

  1. To see if anyone else is seeing issues like this
  2. So solicit any suggestions on how I might address the issue

Here is what I know:

  • In my code, I keep track of how many seconds each connection attempt takes. Based on the example in AN023, I set an 11 minute timeout for a connection attempt.
  • On any given day, some devices will reach this timeout which will be reported on the next successful connection and which, in turn, triggers an alert from Ubidots to a custom Slack channel I monitor. With this, I can see all missed connection events and review their history.
  • Occasionally, 3-4 times a month, I will see a “mass missed connection event” where dozens of devices will all miss connections at the same time. These devices are spread across North Carolina and California. These events typically last two hours and then things go back to normal.

No data is lost since I use PublishQueuePOSIX to queue web hooks so, why is this a big deal? First, these events seems to be occurring slightly more frequently (see below) and second, attempting to connect for 11 minutes consumes significant battery for these solar devices.

Here is what these events look like in my Ubidots console:

This is for a single device but, as you can see in the Slack logs from yesterday’s event, many devices are “participating” in these events.

My code is open source so, please feel free to have a look here but I believe, that if it is possible for my code to be at issue, it would be in the “connecting state” which looks like this:

 case CONNECTING_STATE:{                                              // Will connect - or not and head back to the Idle state
    static State retainedOldState;                                     // Keep track for where to go next (depends on whether we were called from Reporting)
    static unsigned long connectionStartTimeStamp;                     // Time in Millis that helps us know how long it took to connect

    if (state != oldState) {                                           // Non-blocking function - these are first time items
      retainedOldState = oldState;                                     // Keep track for where to go next
      sysStatus.lastConnectionDuration = 0;                            // Will exit with 0 if we do not connect or are connected or the connection time if we do
      publishStateTransition();

      // Let's make sure we need to connect
      if (sysStatus.connectedStatus && Particle.connected()) {
        Log.info("Connecting state but already connected");
        stayAwake = stayAwakeLong;                                       // Keeps device awake after reboot - helps with recovery
        stayAwakeTimeStamp = millis();
        (retainedOldState == REPORTING_STATE) ? state = RESP_WAIT_STATE : state = IDLE_STATE;
        break;
      }
      // If we are in a low battery state - we are not going to connect unless we are over-riding with user switch (active low)
      if (sysStatus.lowBatteryMode && digitalRead(userSwitch)) {
        Log.info("Connecting state but low battery mode");
        state = IDLE_STATE;
        break;
      }
      // If we are in low power mode, we may bail if battery is too low and we need to reduce reporting frequency
      if (sysStatus.lowPowerMode && digitalRead(userSwitch)) {         // Low power mode and user switch not pressed
        if (sysStatus.stateOfCharge <= 50 && (Time.hour() % 4)) {      // If the battery level is <50%, only connect every fourth hour
          Log.info("Connecting but <50%% charge - four hour schedule");
          state = IDLE_STATE;                                          // Will send us to connecting state - and it will send us back here
          break;
        }                                                              // Leave this state and go connect - will return only if we are successful in connecting
        else if (sysStatus.stateOfCharge <= 65 && (Time.hour() % 2)) { // If the battery level is 50% -  65%, only connect every other hour
          Log.info("Connecting but 50-65%% charge - two hour schedule");
          state = IDLE_STATE;                                          // Will send us to connecting state - and it will send us back here
          break;                                                       // Leave this state and go connect - will return only if we are successful in connecting
        }
      }

      // OK, let's do this thing!
      connectionStartTimeStamp = millis();                             // Have to use millis as the clock will get reset on connect
      Cellular.on();                                                   // Needed until they fix this: https://github.com/particle-iot/device-os/issues/1631
      Particle.connect();                                              // Told the Particle to connect, now we need to wait
    }

    sysStatus.lastConnectionDuration = int((millis() - connectionStartTimeStamp)/1000);

    if (Particle.connected()) {
      sysStatus.connectedStatus = true;
      sysStatus.lastConnection = Time.now();                           // This is the last time we attempted to connect
      stayAwake = stayAwakeLong;                                       // Keeps device awake after reboot - helps with recovery
      stayAwakeTimeStamp = millis();
      recordConnectionDetails();                                       // Record outcome of connection attempt
      Log.info("Cloud connection successful");
      attachInterrupt(userSwitch, userSwitchISR,FALLING);              // Attach interrupt for the user switch to enable verbose counts
      if (retainedOldState == REPORTING_STATE) state = RESP_WAIT_STATE;
      else state = IDLE_STATE;
    }
    else if (sysStatus.lastConnectionDuration > connectMaxTimeSec) {
      current.alerts = 2;                                              // Connection timed out alert
      sysStatus.connectedStatus = false;
      recordConnectionDetails();                                       // Record outcome of connection attempt
      Log.info("cloud connection unsuccessful");
      disconnectFromParticle();                                        // Make sure the modem is turned off
      if (sysStatus.solarPowerMode) setLowPowerMode("1");              // If we cannot connect, there is no point to stayng out of low power mode
      if ((Time.now() - sysStatus.lastConnection) > 3 * 3600L) {       // Only sends to ERROR_STATE if it has been over three hours - this ties to reporting and low battery state
        state = ERROR_STATE;
        resetTimeStamp = millis();
        break;
      }
      else state = IDLE_STATE;
    }
  } break;

However, it is unclear to me how an error in my code could cause intermittent, coordinated events across devices that are hundreds or thousands of miles distant.

I will continue to work with support but any advice, suggestions or similar experiences would be helpful.

Thank you,

Chip

1 Like

@rickkas7 ,

I had received some advice from @rftop separately and will be updating this thread as I think this work could help others too. However, before I do, I was wondering if you could comment on the basis for this statement from your Watchdog Application Note:

You can configure the amount of time to fail to connect to the cloud before doing a deep power off for 30 seconds. The default is 11 minutes, and you should not set it less than 10. You can set it higher if you want.

const std::chrono::milliseconds connectMaxTime = 11min;

A temporary solution suggested by Particle support is to shorten the maximum connection time so events like this will not have as significant an impact on battery life. However, I seem to remember that allowing the full connection time was important as the cellular modem might go through some process of resetting that would be pre-empted if this time was too short.

Could you please weigh in on this?

Thank you, Chip

The 11 minute timeout only applies if you are not powering off the modem.

If you are powering off the modem and sleeping for more than 10 minutes, you can stop attempting to connect at a shorter interval, though I wouldn’t set it shorter than 2 or 3 minutes.

The reason is that Device OS will completely power down the modem after 10 minutes of attempting to connect. This will sometimes clear a problem connecting, and if you go into a cellular standby sleep after a few minutes, you’ll end up never fully powering down the modem.

However, if you power down the modem anyway, then it doesn’t matter as much.

2 Likes

Thanks to some suggestions from @Rftop , I made some changes to my code to differentiate a failure to connect to cellular from a successful cellular connection and a failure to connect to Particle. I deployed this code to my devices and waited for the next “mass missed connection event” - which happened this morning.

I can now state that these events are a failure to connect to the cellular network. Today, at 7am EST, 38 devices from across North Carolina failed to connect and then connected successfully at 8am.

Here is my updated code:

case CONNECTING_STATE:{                                              // Will connect - or not and head back to the Idle state
    static State retainedOldState;                                     // Keep track for where to go next (depends on whether we were called from Reporting)
    static unsigned long connectionStartTimeStamp;                     // Time in Millis that helps us know how long it took to connect

    if (state != oldState) {                                           // Non-blocking function - these are first time items
      retainedOldState = oldState;                                     // Keep track for where to go next
      sysStatus.lastConnectionDuration = 0;                            // Will exit with 0 if we do not connect or are connected or the connection time if we do
      publishStateTransition();

      // Let's make sure we need to connect
      if (Particle.connected()) {
        Log.info("Connecting state but already connected");
        stayAwake = stayAwakeLong;                                     // Keeps device awake after reboot - helps with recovery
        stayAwakeTimeStamp = millis();
        (retainedOldState == REPORTING_STATE) ? state = RESP_WAIT_STATE : state = IDLE_STATE;
        break;
      }

      // If we are in a low battery state - we are not going to connect unless we are over-riding with user switch (active low)
      if (sysStatus.lowBatteryMode && digitalRead(userSwitch)) {
        Log.info("Connecting state but low battery mode");
        state = IDLE_STATE;
        break;
      }
      // If we are in low power mode, we may bail if battery is too low and we need to reduce reporting frequency
      if (sysStatus.lowPowerMode && digitalRead(userSwitch)) {         // Low power mode and user switch not pressed
        if (sysStatus.stateOfCharge <= 50 && (Time.hour() % 4)) {      // If the battery level is <50%, only connect every fourth hour
          Log.info("Connecting but <50%% charge - four hour schedule");
          state = IDLE_STATE;                                          // Will send us to connecting state - and it will send us back here
          break;
        }                                                              // Leave this state and go connect - will return only if we are successful in connecting
        else if (sysStatus.stateOfCharge <= 65 && (Time.hour() % 2)) { // If the battery level is 50% -  65%, only connect every other hour
          Log.info("Connecting but 50-65%% charge - two hour schedule");
          state = IDLE_STATE;                                          // Will send us to connecting state - and it will send us back here
          break;                                                       // Leave this state and go connect - will return only if we are successful in connecting
        }
      }
      // OK, let's do this thing!
      connectionStartTimeStamp = millis();                             // Have to use millis as the clock will get reset on connect
      Cellular.on();                                                   // Needed until they fix this: https://github.com/particle-iot/device-os/issues/1631
      Particle.connect();                                              // Told the Particle to connect, now we need to wait
    }

    sysStatus.lastConnectionDuration = int((millis() - connectionStartTimeStamp)/1000);

    if (Particle.connected()) {
      sysStatus.lastConnection = Time.now();                           // This is the last time we attempted to connect
      stayAwake = stayAwakeLong;                                       // Keeps device awake after reboot - helps with recovery
      stayAwakeTimeStamp = millis();
      recordConnectionDetails();                                       // Record outcome of connection attempt
      attachInterrupt(userSwitch, userSwitchISR,FALLING);              // Attach interrupt for the user switch to enable verbose counts
      (retainedOldState == REPORTING_STATE) ? state = RESP_WAIT_STATE : state = IDLE_STATE;
    }
    else if (sysStatus.lastConnectionDuration > connectMaxTimeSec) {
      recordConnectionDetails();                                       // Record outcome of connection attempt
      Log.info("cloud connection unsuccessful");
      disconnectFromParticle();                                        // Make sure the modem is turned off
      if (sysStatus.solarPowerMode) setLowPowerMode("1");              // If we cannot connect, there is no point to stayng out of low power mode
      if ((Time.now() - sysStatus.lastConnection) > 3 * 3600L) {       // Only sends to ERROR_STATE if it has been over three hours - this ties to reporting and low battery state
        state = ERROR_STATE;
        resetTimeStamp = millis();
        break;
      }
      else state = IDLE_STATE;
    }
  } break;

and

void  recordConnectionDetails()  {                                     // Whether the connection was successful or not, we will collect and publish metrics
  char data[64];

  if (sysStatus.lastConnectionDuration > connectMaxTimeSec+1) sysStatus.lastConnectionDuration = 0;
  else if (sysStatus.lastConnectionDuration > current.maxConnectTime) current.maxConnectTime = sysStatus.lastConnectionDuration; // Keep track of longest each day

  if (Cellular.ready()) getSignalStrength();                           // Test signal strength if the cellular modem is on and ready

  snprintf(data, sizeof(data),"Connected in %i secs",sysStatus.lastConnectionDuration);                   // Make up connection string and publish
  Log.info(data);

  if (Particle.connected()) {
    Log.info("Cloud connection successful");
    if (sysStatus.verboseMode) Particle.publish("Cellular",data,PRIVATE);
  }
  else if (Cellular.ready()) {                                        // We want to take note of this as it implies an issue with the Particle back-end
    Log.info("Connected to cellular but not Particle");
    current.alerts = 30;                                              // Record alert for timeout on Particle but connected to cellular
    snprintf(data, sizeof(data), "{\"alerts\":%i,\"timestamp\":%lu000 }",current.alerts, Time.now());
    PublishQueuePosix::instance().publish("Ubidots_Alert_Hook", data, PRIVATE);
  }
  else {
    Log.info("Failed to connect");
    current.alerts = 31;                                              // Record alert for timeout on cellular
    snprintf(data, sizeof(data), "{\"alerts\":%i,\"timestamp\":%lu000 }",current.alerts, Time.now());
    PublishQueuePosix::instance().publish("Ubidots_Alert_Hook", data, PRIVATE);
  }

  systemStatusWriteNeeded = true;
  currentCountsWriteNeeded = true;
}

Again, looking for any advice or anyone else experiencing these events.

Thanks,

Chip

I see mass disconnection events with our always online B5 devices on 2.0.1 and 2.2.0. Two patterns I regard as “normal” due to rarity/duration.

On average once per month, multiple devices on different locations, disconnect for seconds within a span of maybe half an hour, so not at exactly the same time, but clearly connected.

The other pattern is where all devices disconnect at exactly the same time for minutes. Happens maybe 1-4 times in a year, but as a first, the other night it happened twice 7 minutes apart.

Our B5’s connect to an Azure service bus, shared with other non-Particle wifi and cellular devices on the same network operator. And these devices do not follow the same pattern. So in our case, we can rule out the service bus, and the network operator on their own. Us not seeing you patterns likely rules out the Particle cloud.

Left, it seems, is the firmware(s) on the device, or how the service provider is being serviced by the network operator.

With a quick look at the code, one thing I never do again, after previous experiences of random meaningless errors, is LOG_LEVEL_ALL .

From experience with products on other platforms running on battery with periodic updates, I would consider making that a permanent solution. With a buffering scheme, it is best to determine typical response time on uploads on wake, add a small margin, and stick with it, to not deplete the battery.

Periodically, a longer “phone home” is scheduled, like every [24] hours, to ensure status info and firmware updates.

Depending on the application, a question is how quickly to trigger a modem reset/power cycle. Here it looks like the modem is powered down every sleep?

If you have retained memory left for it, I would consider using a lightweight-wireless-device-log at strategic points in the code, to see what actually happens “out there”.

For this, the particle platform is still missing the option for user firmware to get connection status during cellular connection attempts (blocked by operator, no net, etc.). With that status, it could be clear to you, Particle and the service provider what action needs to be taken. When you need it, you really need it.

3 Likes

thank you for that reference, those lines of code look great.

@thrmttnw ,

Great idea on the remote logging approach. This is the path I am going down with support and I hope to have more to report back on once it is in place and providing data.

One update of note on these events. As noted above, I can now differentiate between a failure to connect to cellular and failure to connect to Particle. I had not seen many of the latter events until this week’s Particle API outage. Almost all of my devices reported (once the outage was over of course) that they connected to cellular but not Particle. This gives me confidence that I will be able to detect and differentiate between these events going forward.

Thanks,

Chip

1 Like

To eliminate issues related to sessions, I added this to my setup before connecting (would be great to know if there is a better way to do this):

	static bool dummyer = true;
	Particle.variable("_" + String(random(10, 1000)), dummyer); // Ensure init of cloud API

BUT I currently only work with always on devices on this platform, where a restart should be a rare thing. For devices restarting every hour, the above will use up somewhere around 4-6K extra each time (if I remember correctly). So better to only do it when an issue is detected, but I am not sure how to do that.

@thrmttnw ,

Thank you for sharing. I guess in the brave new world of data operations the 4-6k would be less of a concern.

I am curious however, how does calling a Particle.variable function reset the API?

I used to use the /particle/device/session/end (note: deprecated - don’t use this) when I suspected a lost connection to the Particle API - this worked well until some changes to the Particle backend was changed with very serious consequences. Now, there is a clear session flag for Cloud options:

auto opts = CloudDisconnectOptions().clearSession(true);
Particle.disconnect(opts);

Perhaps I could use something like this to resolve suspected issues with API response.

Does anyone else do this?

Thanks,

Chip

It is actually a particle variable registration in setup(). By making a change to the cloud function and variable registrations, the previous session can not be resumed, forcing a full handshake.

I do it to reduce the space of possible errors during abnormal situations, also in recognition of my limited understanding of the overall system.

Yes, I see now the documentation is updated with a dedicated function :+1: also explaining this in more detail:

https://docs.particle.io/reference/device-os/firmware/#clear-session

1 Like