Robustness of DeviceOS 4.x.x to RESET mid-OTA DeviceOS update?

I'm hoping somebody comment on or provide insight about the possible/likely outcomes if a device were to be reset while performing a Device OS OTA update.

I have some Boron devices that are currently in the field and and I would like to upgrade them to 4.2.0 (mostly for this bugfix).

The (potentially) spicy bit: The devices have hardware watchdogs which will toggle the RESET line if loop() doesn't complete at least once every 30s AND... the devices are expensive to access (i.e. cross country flights, lodging, engineer's time, etc.). Thus, bricking isn't a good option. :slight_smile:

My application is compiled with SYSTEM_THREAD(ENABLED), but my understanding is that some/all of the Device OS update occurs in Safe Mode.

My assumption is that Safe Mode doesn't run my loop(), but I haven't found much documentation on Safe Mode to confirm this. Is this an accurate assumption?

So, if my 30s countdown timer starts ticking as soon as loop() is no longer being called, what's the (rough/handwavy, I understand this could be difficult to put a figure on) probability of the device becoming bricked if RESET is toggled? I hope, for example, that system and user updates have a "double-buffer-like" arrangement where the critical operation(s) are actually quite small and fast, but I haven't looked under the hood to see how it's done. But I would also hazard a guess that each device device doesn't have two full sets of Device OS + user application, and that the bootloader isn't smart enough to fall back to a known good combination after failing N times if one set were to be corrupted... but gosh, would I love to be wrong!

Thanks.

There are a few different parts to this.

Downloading the user firmware when system thread is enabled occurs simultaneously with your code, so your watchdog will still be serviced. This occurs into the OTA sectors, which are separate from the running firmware.

After boot, if the CRC of the firmware in the OTA sectors is valid, it will be swapped into the active sectors. This happens quickly, within a few seconds, so it should be fine. If you just update the firmware and not Device OS, your firmware will boot here.

If the user firmware targets a version of Device OS newer than what is on the device, the device will go into safe mode, and the cloud will download any missing dependencies to the device, including Device OS, bootloader, and soft device.

Safe mode does not run application code, so your loop will not run. Actually, user firmware could not run at this point because it depends on a version of Device OS that the device is not yet available on the device.

With a watchdog, this is where you can run into problems because the device is in safe mode and attempting to both download and apply updates. The Tracker turns off its watchdog during updates for this reason. Setting a long timeout (6-10 minutes) is also a good choice if possible.

The device won't brick, it will eventually get the update, but it may require multiple attempts depending on the speed of your network and the length of the watchdog timer. This is better if the device previously had Device OS 3.x or later installed, because it supports resumable updates and will continue from where it rebooted. If it had an older version, it will start from the beginning again, which can cause the device to constantly consume cellular data but never complete the upgrade if it takes longer than the watchdog period.

Thank you for the concise and helpful response, @rickkas7.

You description of the update process makes sense. It's reassuring to learn that system updates have the ability to resume their progress given my devices are on v4.0.x. Sadly, I can't adjust my WDT timeout remotely, so if I update I'll just have to live with the fact that they're likely cycle until completion.

Cheers.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.