Device OS versions and Firmware OTA of pre-compiled binaries

We have a fleet of Electron’s across two products. Most are running Device OS 0.7.0 but some are now running 1.0.1.

Using the CLI we compile our source code into binary firmware against a specific version (0.7.0 or 1.0.1) and use the firmware manager to release.

Yes, you guessed it, this morning I bricked a unit around 1000 miles away from me. I flashed OTA a binary compiled against 0.7.0 to a unit running device OS 1.0.1. As the unit uses a third-party SIM I can now no longer connect to either downgrade the device OS and/or flash a new binary.

The device has since power-cycled hence completely offline now. We have dispatched a replacement unit but now I am worried we will accidentally do this again, or worse, deploy a binary to the fleet compiled for a different version. I could see this going horribly pear-shaped if we blasted out a new firmware across the product fleet.

Does anyone have any recommendations for managing a mixture of 0.7.0 and 1.0.1 device OS fleet with regular firmware updates?

That shouldn’t usually pose a problem since the version you target an application build at is only the minimum device OS version required.
0.7.0 applications can normally run on 1.0.1 (unless your application is too big to fit into the space left by the newer system).

1 Like

Your absolutely minimum amount of release work must be to test your new Application Firmware on every single System Firmware it’s being released on, locally. Otherwise something will always go wrong eventually.

v0.7.0 is a fairly far cry from v1.0.1 as far as under-the-hood changes go (as ScruffR mentioned the RAM usage general goes up and up of the System Firmware). You really shouldn’t be upgrading device OS unless you have a specific reason to (just like a computer running essential software). If you only have a few devices on a different fw version, I seriously recommend finding a way to consolidate and downgrade safely so that you can have a consistent fleet. The more branches your product has the harder it is to safely manage.

That said, the fact that your unit uses a 3rd-party SIM shouldn’t make a difference with what you are talking about. It sounds like you’ve just got a bug or two in your application firmware that you haven’t properly tested yet on that combination.

Now, what has changed with device OS 1.0.1 is the way keepAlive is handled (OS 1.0.1 is sticky, before was not). This means it may connect for 30 seconds and then disconnect for 27min from the cloud. If this is the root of your problem, simply setting a corrected firmware version in a Particle product will ensure that it receives new firmware in that 30 second window (and the transfer itself will keep the UDP hole alive).

Are you using Particle Products? If so, you should be using groups to distinguish between OS versions, and use groups to do controlled firmware releases to specific customers or device groups (delineated by device OS).

If you are truly compiling parallel fw paths for different deviceOS versions, you should have two separate products so that everything is safe. However, you should not do this unless strictly necessary. Instead, you should compile to the lowest firmware version (0.7.0) and simply test properly against all relevant release OS candidates.

After some more debugging, yes it appears to be code related. Here is what I have found…

The offending portion is where I am querying the cellular device (Sara U260). The code block in question is dynamically setting the 3rd party APN based upon querying the ICCID and looking for the first 6 digits to determine which SIM is inserted (Telstra, Particle, Vodafone or Optus). Code relevant to this post is…

void setAPNFromSIMCard() {
	CellularDevice dev;
	cellular_device_info(&dev, NULL);
	...
}

There appears to be a breaking change between 0.8.0-rc4 and rc5 with cellular_device_info.

Where Device OS = 0.7.0 and target compile 0.7.0 = works (our happy path)
Where Device OS = 0.8.0-rc4 and target compile 0.7.0 = works
Where Device OS = 0.8.0-rc5 (and beyond), target compile 0.7.0 = fails

The fail is there appears to be some memory corruption. I have a uint8_t variable that tracks connection state, per the FSM. When running 0.8.0-rc4 the variable behaves as expected, but 0.8.0-rc5 the variable becomes corrupted. As the FSM moves through the connection state, the variable changes from 0 > 1 > 2 > 3 etc. The corruption I am seeing in rc5+ is the variable is showing 217, clearly not right.

Is anyone aware of a breaking change in 0.8.0-rc5 that effects cellular_device_info?

I’d take a peek around mdm_hal.cpp and compare the before and after, particularly with respect to usage of the _dev variable which holds that device info.

See here for a difference between 0.8.0-rc4 and 0.8.0-rc5 (on the right)

Looks like there was an int dev; in CellularDevice added on 5/10/2018 that is updated by cellular_device_info per that same commit. This is likely the source of the incompatibility.

That’s the conclusion I came to also. Looks like I will have to carefully manage a fleet wide upgrade of the device OS using an intermediary firmware that excludes the function call that invokes cellular_device_info (i.e. use a hard coded APN), then once all device OS’s upgraded, re-compile against 1.x then release that firmware as the new production version. Thanks all for your help.

Loading a firmware targeted at 1.0.1 to a 0.7.0 device will invoke a Device-OS update prior to app firmware running so the intermediary firmware isn’t necessary. Larger risk is using a 3rd party APN if the cell modem is reset for any reason. During safe-mode Device-OS update your app firmware won’t be able to restore the APN (as it can’t run).

But this is the case for any automatic update of Device-OS using 3rd party APN at this time, not because of the cellular_device_info incompatibility.