Corrupt OTA Update With Third-Party SIM Bricks Device - Any Mitigation?

I’ve avoided updating any of my device’s System Firmware for stability and consistency reasons in a production environment, but a recent need to be able to deal with issues with Particle device private keys has forced me to consider using v0.7.0 on the Electron.

When testing the OTA update process on the bench, I flashed my firmware OTA, got a “success” fw-update event, the device reset, but is no longer running my user firmware! It is obvious because one of the first lines of code turns an LED blue.

Further, I use a third party sim. For some reason, that reset seems to have also wiped the SIM credentials from the Modem flash. So now the device is unable to connect to the cell network.

My assumption is that the device is running whatever is in the factory backup section of memory, which obviously is very problematic in my use case as is.

How can I best plan for and mitigate this? Should I flash a bare minimum firmware that sets my APN into the backup flash when initially programming devices? With the hundreds of remote devices I have, how can I ensure my OTA update doesn’t brick my devices?

I feel caught between a rock and a hard place because if I don’t update the system firmware, my devices get bricked by bad particle keys (happened to a customer within an hour of replacing a previously bad key device recently), and if I do update, there seems to be a case where the devices get bricked anyways. Obviously sh*t happens sometimes, but is there any way I can mitigate this?

1 Like

I am guessing that if you have hundreds of devices you have a product(s) and can use the console to manage firmware OTA updates? If you are already then apologies- this will no doubt be too simplistic an answer. I have been using groups to very carefully manage upgrading a fleet of about 500 photons making sure to force updates to small groups and then monitor their restarts to check that device OS has updated correctly and they appear normal. To be sure and not impact users of devices this is done out of hours so has been a slow process given that some sites switch off wifi at night!

On Gen 3 devices, the APN information is saved in configuration flash. On the Electron and E Series, it’s stored in the user firmware and in the RAM in the modem.

In the normal upgrade sequence you upgrade the user firmware. If a Device OS upgrade is required, the device enter safe mode, then the safe mode healer flashes the necessary system parts and sometimes bootloader.

The problem with 3rd-party SIM cards is that if this process get interrupted such that the modem is completely powered off, then the APN settings will be lost. However, since you’re in safe mode, there’s no way to get the APN set back, because it can only be set from user firmware, which doesn’t run in safe mode.

There is no good solution that will work all of the time for this, unfortunately.

1 Like

How does Particle.keepAlive() play into this?
While Gen3 stores the APN in configuration flash is the keep alive also stored there to get reapplied automatically or do we still need to code it?

The keepAlive is also an issue. However, in theory during a safe mode OTA flash it should all happen quickly, so even a 30 second keep alive shouldn’t be a problem.

However if you were to sit in safe mode and didn’t get a safe mode healer update, that could cause problems. But that should not normally happen.

1 Like

I’ve dealt with this issue in the past, and I’ve used a custom compiled device-os binary to get around this. The short version of how to do this is:

  • Clone device-os from Particle’s GH repo
  • Find the relevant parts regarding default APN and change them to your APN or add them to the list the device-os checks for. I think current device-os checks for IMSI and defaults to either Twilio or Telefonica.
  • Compile and flash onto device

This has worked fine for me with 0.6.1 device-os, but its not ideal. My devices with this device-os are basically going to stay on 0.6.1 forever, as I cannot foresee a reason to update them.

Also, I stopped doing this last year for new devices, since I made a deal to use Particle SIMs. I came to the conclusion that using Particle SIMs is much safer, as 3rd party SIMs will always be second-class citizens on Particle’s device-os.

One more thought on this – I’d be interested to know if Particle would accept pull requests for IMSI based APN settings? This would be one step in the direction of making 3rd party sims easier and more reliable.

1 Like

I am guessing that if you have hundreds of devices you have a product(s) and can use the console to manage firmware OTA updates?

@armor Yes, I do use that. In this case I am testing pre-production firmware that I have not yet released to my product formally, but the flashing process would otherwise be identical to if I changed firmware versions via the product. Unfortunately this isn't the kind of issue that can be fixed by monitoring properly - it either will succeed (99.9% of the time) or fail catastrophically (in this particular scenario).

The keepAlive is also an issue. However, in theory during a safe mode OTA flash it should all happen quickly, so even a 30 second keep alive shouldn’t be a problem.

Also in theory during OTA because there is a constant flow of data back and forth the keepAlive isn't required to keep the UDP connection alive since the normal data flow is already fulfilling that function, I believe.

The problem with 3rd-party SIM cards is that if this process get interrupted such that the modem is completely powered off, then the APN settings will be lost.

Gotcha! This at least makes sense to me. I use a hardware watchdog that actually triggers a power reset of my entire hardware (I have some other things devices powered by a custom POE by my board). When I watched this happen, I think for some reason that may have been triggered during the update.

I may be able to find a safe way to kick my watchdog (It's a 10minute timer, so normally shouldn't be an issue for OTA) upon successful completion of an OTA update, but before reset. This should give me a statistically reasonable chance to avoid impact from the OTA update.

I think that properly addresses the meat of my question (the mitigation is to avoid power resets during System Firmware OTA like the devil), so I'll mark that as the solution, at least for now. Thanks as always, @rickkas7!

I’ve dealt with this issue in the past, and I’ve used a custom compiled device-os binary to get around this.

Thanks @hwestbrook! Appreciate hearing your experience. This was something I also considered doing. I would be comfortable doing that in theory, but feel like it may be overkill if I can mostly mitigate the issue by better managing my watchdog. I generally am very conservative when it comes to deviceOS updates, so I could probably stomach it, but sounds difficult to scale with.

I use a lot of data per month (250-750 MB), so what I've had communicated to me so far is that there will not be a possibility for me to have a similar deal in the foreseeable future.

One more thought on this – I’d be interested to know if Particle would accept pull requests for IMSI based APN settings? This would be one step in the direction of making 3rd party sims easier and more reliable.

Instead of a PR, wouldn't it be easier to provide a way to manage APN settings like Photons do with WiFi credentials? Have some permanent defaults in System Firmware Program Flash, but also have a special spot in flash where an APN setting can be stored that can be loaded in a way where it is simply added to the available options in the System Firmware, if such a setting is set and readable from flash. The normal APN setting in user firmware would then still work as normal.

Unfortunately after some more testing, I’ve had more devices go dead while trying to do this, even in very ideal situations. I guess this means I cannot upgrade System Firmware safely and will have to make v0.6.4 work for me for this hardware iteration. Really a big bummer, and something I was counting on being able to do OTA. Doesn’t sound like I have much of an option, given that this just happened the past 3/3 times I’ve tried it.