Intelligent OTA Updates - Best Practices for Devices with Poor Connectivity

@All,

With the release of deviceOS@2.2.0, I am making an effort to bring all of my devices onto the same firmware and deviceOS releases. Previously, I had allowed devices deemed “low bandwidth” to stay on deviceOS@2.0.1 as I did not see the benefit in 2.1.0 but there are some indications (such as resetting DTLS sessions) that that 2.2.0 will be worth the effort.

So, here is the issue: Too many of the Intelligent OTA updates fail for low-bandwidth devices and I am wondering if there are some steps I could take to address this. This issue is impacting about 15% of my deployed devices (mainly rural state parks). Importantly, these devices function acceptably in daily operation, they just don’t seem to have the bandwidth to update.

Here is what a typical failure looks like:

These devices use the AB1805 Watchdog timer as outlined in this application note:

https://docs.particle.io/datasheets/app-notes/an023-watchdog-timers/

and I am wondering if this issue could be related to the watchdog timer resetting the device during the update process. I have struggled to come up with a concrete method to prove this given the remote and intermittent nature of the update process.

I use a finite state machine approach in my code, should I add an “FIRMWARE_UPDATE” state to my state machine? In this state, we would:

  • Disable the watchdog timer
  • Set an application timer to set an upper limit on the update
  • Keep track of failed attempts and stop trying at some point and sending an alert to my back-end. This is important for folks like me that are still on the old data plan as failed updates consume data. My understanding is that in the new data operations plans, failed updates don’t have a financial repercussion but they do consume battery. Does this make sense?

@rickkas7 has such a state in this application note but it does not account for the watchdog or the “stop trying after x failed attempts” counter.

https://docs.particle.io/datasheets/app-notes/an029-wake-publish-sleep-cellular/

If anyone has thoughts or experience on this, I would love to hear from you. If not, here is my plan:

  • develop a new release for my low bandwidth devices with a “FIRMWARE_UPDATE” state
  • deploy this update to the low-bandwidth devices on deviceOS@2.0.1 so they will only have to update my sketch.
  • Bring them into the main line production release including deviceOS@2.2.0 and the latest firmware

Make sense? Thoughts?

Thanks, Chip

There are a lot of variables here, but if you use the AB1805 and have it configured as in the app note, you should not run into a problem with the watchdog. The update process consists of several parts:

  • If SYSTEM_THREAD(ENABLED) then your application firmware is downloaded while your code is still running, though a bit more slowly.
  • After successfully downloading the binary, the device resets. The AB1805 library example disables the watchdog before this reset.
  • The device starts up in safe mode, because the user firmware binary is targeting a Device OS version later than what is installed on the device. (Breathing magenta.)
  • Safe mode healer in the cloud starts downloading and installing the necessary dependencies, including the system parts and boot loader. In some cases, also soft device. This is why the watchdog is disabled, because this step can take a very long time with weak cellular, and your user firmware binary is not running yet, so it can’t service the watchdog.
  • Once all of the dependencies are met, your user firmware run.

One thing that can cause problems that you have control over is if you have any special sleep logic. Once the download starts, if you don’t wait long enough before sleep, the download will start over from the beginning with 2.x and may never complete. With 3.x on Gen 3, then it will resume from where it left off before sleep.

2 Likes

@rickkas7 ,

Thank you very much for your explanation of the update process as it relates to the watchdog timer. This is one of the reasons I moved to the implementation as laid out in AN0023, to benefit from the thoughtful design you have implemented there.

Now that you mention it, it is possible that sleep could be interfering, I could make an explicit deferral of sleep though a test in my code or by adding a new state in the state machine. Since I am not running 3.x code on these devices, I don’t get the benefit of restarting the download. So, I need to make sure the download has the time to complete.

I will take a stab at this and update the post as I make progress.

Thanks,

Chip

2 Likes

I have made some progress on this and it seems to be improving the odds of a successful update. Here is what I have learned so far:

  1. Firmware updates have a much higher odds of success than deviceOS updates. So, I am pursuing a two-step update process for low bandwidth devices, update firmware (shown below) in the first step that, in turn improves the odds of a deviceOS update.
  2. While I don’t know what the “sweet spot” is, allowing up to 10 minutes for an update seems to improve success. Beyond a certain point, I think it is diminishing odds of success.
  3. I wanted to provide some visibility into the update process by reporting “alerts” to Ubidots where I could then trigger alerts and reports.
  4. I needed to limit the number of update attempts otherwise, device can just sit and repeatedly attempt updates and churn though data and battery. Like the example below.

I based my first pass of AN029 posted above. Here are some excepts from my code to give you an idea what this looks like:

First, I defined some specific alert codes for updates:

/* Alert Count Definitions
* 0 = Normal Operations - No Alert
* 1 = Battery temp too high / low to charge
* 2 = Failed to connect to Particle 
* 3 = Failed to get Webhook response when connected
* 4 = Firmware update completed
* 5 = Firmware update timed out
* 6 = Firmware update failed
* 7 = Update attempt limit reached - done for the day
*/

In setup(), I now test for daily updates and will reset the updateAttempts count each day

System.on(firmware_update, firmwareUpdateHandler);// Registers a handler that will track if we are getting an update

if (current.updateAttempts >= 3) {
    System.disableUpdates();                                          // We will only try to update three times in a day 
    current.alerts = 7;                                                // Set an alert that we have maxed out our updates for the day
}

I created a webhook that just reports the Alerts and a timestamp as the device may not be on-line when the alert handler fires. Here is the handler:

/**
 * @brief The Firmware update handler tracks changes in the firmware update status
 * 
 * @details This handler is subscribed to in setup with System.on event and sets the firmwareUpdateinProgress flag that 
 * will trigger a state transition to the Firmware update state.  As some events are only see in this handler, failure
 * and success success codes are assigned here and the time out code in the main loop state.
 * 
 * @param event  - Firmware update 
 * @param param - Specific firmware update state
 */

void firmwareUpdateHandler(system_event_t event, int param) {
  switch(param) {
    char data[64];                                                     // Store the date in this character array - not global
      
    case firmware_update_begin:
      firmwareUpdateInProgress = true;
      break;
    case firmware_update_complete:
      firmwareUpdateInProgress = true;
      current.alerts = 4;                                              // Record a successful attempt
      snprintf(data, sizeof(data), "{\"alerts\":%i,\"timestamp\":%lu000 }",current.alerts, Time.now());
      PublishQueuePosix::instance().publish("Ubidots_Alert_Hook", data, PRIVATE); // Put in publish queue
      current.updateAttempts = 0;                                      // Zero the update attempts counter
      break;
    case firmware_update_failed:
      firmwareUpdateInProgress = false;
      snprintf(data, sizeof(data), "{\"alerts\":%i,\"timestamp\":%lu000 }",current.alerts, Time.now());
      PublishQueuePosix::instance().publish("Ubidots_Alert_Hook", data, PRIVATE); // Put in publlish queue
      current.alerts = 6;                                              // Record a failed attempt
      current.updateAttempts++;                                        // Increment the update attempts counter
      break;
  }
  currentCountsWriteNeeded = true;
}

Finally, here is the state for firmware updates:

case FIRMWARE_UPDATE: {
      static unsigned long stateTime;
      char data[64];

      if (state != oldState) {
        stateTime = millis();                                          // When did we start the firmware update?
        Log.info("In the firmware update state");
        publishStateTransition();
      }
      if (!firmwareUpdateInProgress) {                                 // Done with the update 
          Log.info("firmware update completed");
          state = IDLE_STATE;
      }
      else
      if (millis() - stateTime >= firmwareUpdateMaxTime.count()) {     // Ran out of time
          Log.info("firmware update timed out");
          current.alerts = 5;                                          // Record alert for timeout
          snprintf(data, sizeof(data), "{\"alerts\":%i,\"timestamp\":%lu000 }",current.alerts, Time.now());
          PublishQueuePosix::instance().publish("Ubidots_Alert_Hook", data, PRIVATE);
          current.updateAttempts++;                                    // Increment the update attempt counter
          state = IDLE_STATE;
      }
    } break;

I hope this is helpful and please feel free to chime in with questions or comments.

Thanks, Chip

3 Likes

Nice Chip! Thanks for sharing the details. I especially like reporting it to Ubidots on the status and gain quick visibility into it. I’m likely in a similar scenario and will have several devices in low signal strength. Previously I just deployed user firmware and only did deviceOS updates on a select few. I’ll add this item to the development list as it seems very handy! It’s bookmarked for me. Thanks for sharing.

One comment…

if (current.updateAttempts >= 3) {
    System.disableUpdates();                                          // We will only try to update three times in a day 
    current.updateAttempts = 7;                                       // Set an alert that we have maxed out our updates for the day
}

is this correct? should this be current.updateAttempts =7; or should this be current.alerts = 7?

@jgskarda ,

I am happy you are finding this useful. I plan to tweak some of the values (like is 3 update attempts a day the right answer or how long should I let an update download before timing out) and will update the thread.

Thank you for catching my mistake. As you had guessed, it should be current.alerts = 7. I will correct my original post.

Thanks,

Chip

1 Like

Thank you for sharing. I am not crazy about disabling the watchdog during updates. If anyone has experience on what a safe timeout could be to let an update go through on a slow 2G connection on, I would love to hear about it.

@thrmttnw ,

I agree, about disabling the watchdog and I thought that would be required.

However, @rickkas7 pointed out that the watchdog will still be serviced during the download and will be disabled automagically when the divice resets to safe mode for the update itself. This state is non-blocking so the program continues to cycle through the main loop() while it is downloading servicing the watchdog. So, there is not need to disable the watchdog is not disabled in this approach.

Here is the net of what is done:

  • It delays sleep for a specified time (I am using 10 minutes) to give the download time
  • It triggers an alert that record the outcome of the update attempt (failed, timed out, succeeded)
  • It keeps track of the number of times an update is attempted in a day (I am limiting to three)

Chip

1 Like

Makes sense. I am still on TPL5010 with a 6 minute timeout as in the application note. As I understand this, that watchdog get’s serviced as well during firmware download with SYSTEM_THREAD(ENABLED), but not during the OS download. There I am wondering if the 6 minutes is enough on slow 2G.

1 Like