Product OTA handler

Hi there fellow Particle geeks!

I am having some trouble with auto-updates (OTA) with Particle Electrons that belong to a product group. Wondering if someone can point me in the right direction.

First, the devices are currently on 0.6.0 with a 3rd party SIM.
Keepalive is set at 30 currently.
SYSTEM_MODE(SEMI_AUTOMATIC);
Typical setup where the device comes out of sleep, collects some sensor date, publishes it, then goes back to sleep.

I have the following in my setup function.
System.on(firmware_update_pending, otaHandler);

For the otaHandler, I have the following.
void otaHandler () {
Serial.println("Update is pending. Will reboot after update.");
if (Particle.connected() == false) {
Particle.connect();
}
Particle.publish("Updating pending.", "True", 60, PRIVATE); //for troubleshooting
loop_count = 0;
while (loop_count < 6) {
Particle.process();
delay(10000);
loop_count++;
}
}

This might be overkill, but I’m just trying to keep it alive long enough to get the update.
Currently in the event logs on the console, I will see
spark/status - auto-update

I will get my “Update pending” publish indicating that the otaHandler is running.

The device will continue breathing cyan. Then after about a minute I will see
spark/flash/status - failed.

The device will continue breathing cyan for a while and then will go back to sleep.

I have let this run through it’s normal course several times and it just fails to update the device with the new firmware.

I have also re-compiled the firmware a few times to make sure I was targeting the correct system firmware. Have made sure I had a device marked as DEV running this firmware successfully before releasing it to the product.

I have also attempted to update the DEV device to 0.7.0, compiling the firmware targeting 0.7.0 and releasing. This gives some interesting results.

When updating a device OTA, the same as above happens, but the device goes into safe mode to update the system firmware from 0.6.0 to 0.7.0. This indicates to me that the process is happening as I would expect. Then the device will go to sleep and wake up as expected for the next check in, but the process starts over indicating that it had failed to update the system firmware as well.

There are potentially a few things going on, but there are two major things to fix first:

  • You probably should not block for 60 seconds in your otaHandler. It’s not intended to work that way. You should set a flag and return from it immediately, and do the waiting in loop().
  • Don’t delay for 10000 milliseconds 6 times. Instead delay for 100 ms. 600 times. The reason is that only one chunk of the OTA upgrade occurs on each call to delay and Particle.process(). You’re not calling it enough to finish the update properly.

Hey Rick,
Sorry for being dumb but when you say “should not block for 60 seconds” are you referring to the 10 second delay in the loop?

You should always returns from system event handlers as quickly as possible. Set a flag variable and return.

Check the flag in loop() and if an update is in progress makes sure you don’t sleep for 60 seconds. Actually I’d use a millis() based test and return from loop as well.

Gotcha. So is this kinda what you are thinking?

bool updateOccuring = false;

//AFTER WE GRAB SENSOR DATA, CONNECT, AND PUBLISH
   if (updateOccuring){
        downloadUpdate();
   }

void otaHandler () {
    Serial.println("Update is pending. Will reboot after update.");
    updateOccuring = true;
}

void downloadUpdate(){
    Particle.publish("Updating pending.", "True", 60, PRIVATE); //for troubleshooting
    loop_count = 0;
    while (loop_count < 600) {
      Particle.process();
      delay(100);
      loop_count++;
    }
}


Yes, that should work better, I think.

Oh man. That worked like a charm!
How do I leave a tip and mark this issue as closed?! :slight_smile:

2 Likes

It looks like firmware_update_pending is no longer working, at least as of Device OS 1.2.1. Is this because it was moved into the “System updatesPending()” function that is now only available to enterprise customers?

OTA has become a bit of a nightmare for sleepy devices.

It seems that neither firmware_update_pending or System updatesPending (yes, we are an Enterprise customer) are working.

The only thing I have been able to make work is to periodically kill the current session using Particle.publish(“spark/device/session/end”, “”, PRIVATE); and then wait for a couple minutes to see if an OTA might be needed. This is BAD. It wastes a ton of battery and it is a brute force kludge, IMHO.

I would welcome suggestions on another approach!

Depending on system firmware version, I think you may be misunderstanding how the OTA process works. Historically, OTA only happens on a handshake with the cloud. Hence why it works when you kill the session and reconnect (which btw shouldn’t take “minutes”, it usually takes 1-5 seconds). The firmware level update pending things don’t know about OTA updates at the cloud level until they hit the device after the handshake.

The brand new Device OS has some changes, but you haven’t specified that. If so, have you read through all of the relevant documentation?

Can you be more clear about your System Firmware version and what you are actually trying to do? Saying that it is “not working” unfortunately is not specific enough to provide a recommendation on addressing your issue.

1 Like

Currently running 1.4.0 on most of the fleet. Just tried 1.4.1-rc1 as well. Neither firmware_update_pending or System updatesPending were having any effect on 1.3.1 or 1.3.0 either.

I have found that if I only give OTA a few seconds after forcing a new session that the update will fail to complete.

Sure, I should clarify.

Regarding "not working" : I was unable to get a status from this handler. But since I posted that reply, I have come across why that is.

Currently on 1.0.1 to 1.2.1, my issue is not with detecting when an OTA update has started, which is what this handler does, but with knowing when to check so that I can hang around longer to let the update process play out. This is particularly important on the ever increasing devices that spend most of their time asleep and are purposefully programmed to wake, connect, and sleep as quickly as possible to save battery. Staying connected for "10 seconds or so" with fingers crossed just isn't a good solution.

In my experience, a handler can be set up to detect when an update is started so that you can keep the device connected/on long enough for the update process to do it's thing. That's not a problem. The problem is that there is no apparent documented "step" or "time" when this happens aside from the "give it 10 seconds or so" posts I've seen in the community forums. In my experience it can happen after 3 seconds and sometimes takes 20+. So if your code runs through, the device connects, publishes, etc before your handler can see that there is an update coming through, it either doesn't happen or fails spectacularly. This is particularly troublesome when you have SYSTEM_THREAD(ENABLED) because you are likely to send your device to sleep during an update.

Two big selling points for the Electron/Boron are remote placement (ie. battery/solar installations) and OTA firmware update deployments. So I am hopeful that there is an elegant solution that I am overlooking.

Here is another thread with some logs and some better descriptions of what is happening.

With SYSTEM_THREAD the issue is partly that you need to make sure you are calling Particle.process() appropriately (since this is what checks to see if the handler needs to be called).

In my memory of when I investigated OTA thoroughly in the past, there is a period of time 10-30 seconds where the OTA update is downloaded BEFORE any of the firmware flags are set to notify you. Thus the time is dependent on your connection.

From the docs for System.updatesPending():

System.updatesPending() indicates if there is a firmware update pending that was not delivered to the device while updates were disabled.

This means that this is only applicable for the use case where you leave updates disabled normally, and need a very controlled device update process in your user firmware.

Intelligent OTA simply means that the update is attempted immediately instead of waiting for the next handshake.

I don't believe there is any flag accessible to the user set when the cloud intends to update but before the update gets downloaded. If you need that you may need to edit the system firmware yourself.

HOWEVER, I think you can easily address your core issue here. Do you need to check for OTA every single wake? Is it acceptable for a check to happen once a day? If so, do the following in order to perform a longer "wait to check for OTA" once a day.

Assumptions:
Device comes online periodically, say once an hour.
Updates happening within a day is acceptable.
If you need a different time scale and only want to check once, you can add a minute window as well to the time check.

Solution:

const uint32_t delay_for_OTA = 30000;
bool ota_ongoing = false;
bool normal_stuff_done_ready_to_sleep = false;

void firmware_update_handler(system_event_t event, int param)
{
    if (param == firmware_update_begin) {
        ota_ongoing = true;
    }
    else if (param == firmware_update_progress) {
        // do nothing, this is kinda spammy
    }
    else if (param == firmware_update_complete) {
        ota_ongoing = true;
        delay(200);
        Resets.reset_now(RESET_REASON_FIRMWARE_UPDATE);
    }
    else if (param == firmware_update_failed) {
        ota_ongoing = false;
        // retry
        if (Particle.connected()) Particle.publish("spark/device/session/end", "", PRIVATE);
    }
}

void firmware_update_pending_handler()
{
    // System.enableUpdates();  // if desired and previously disabled
    ota_ongoing = System.updatesEnabled();
}

setup() {
        System.on(firmware_update, firmware_update_handler);
        System.on(firmware_update_pending, firmware_update_pending_handler);
        // your setup stuff here

}

loop() {
  // all your other stuff, sets 'normal_stuff_done_ready_to_sleep = true;' when done

  if (Particle.connected()) Particle.process();
  if (normal_stuff_done_ready_to_sleep) {
    // wait until particle connects to ensure valid timestamp
    if (Time.isValid()) {
      if (Time.hour() == 0) {
        if (millis() > delay_for_OTA && !ota_ongoing) {
          System.sleep(...);
        }
      }
      else {
        System.sleep(...);
    }
  }
}
1 Like

This will blow a bunch of data as several (up to several dozen) update packets will come in every time an update is aborted by an early shutdown.

Not a great solution.

As I said, there are levers you can pull to adjust based on your needs. You didn’t say that data use was a huge constraint here. Instead of saying “Not a great solution.” say “this may not work as well if you are very data use constrained”. No need to be so brisk.

First of all, you only use extra data when you perform an OTA. This shouldn’t be very frequent (how often are you updating production device firmware?) in most cases so for most users this is an acceptable tradeoff. I also don’t even explicitly know that multiple packets are coming in before this process. That’s an assumption that should be verified. Might not be the case.

Nonetheless, if you are concerned about the data use, check for an update more frequently, or simply reset your device earlier and more aggressively to cut if off. Or, use intelligent OTA to explicitly DISABLE updates normally and re-enable during your daily check. In theory this should prevent the update from even attempting unless force updates is on for your device. However this approach doesn’t work unless you are running an enterprise product with intelligent OTA, so it’s not as universal of a solution.

You may have to make tradeoffs and sacrifices to get the features you want. In this case, that may be battery life vs data use. Thats just the reality of engineering. The fact that you have to make a tradeoff doesn’t make something “not a great solution” across the board.

I'm not meaning to be short with you-- I usually assume people have better things to do than read my prose, so I try to keep it as short and direct as possible!

Unfortunately the replies so far have been limited to things we've already tried or considered, and have dismissed for one reason or another.

What I require out of a "good solution" is something that I can check against that reveals the state of the system. Currently that state is hidden and the only alternative is to wait for an undefined period of time that may change based on unknowable variables.

Fortunately these forums are monitored by devs and community outreach people who can pick up on issues if they feel they are important enough to address. In that we're lucky, this is a far cry from trying to convince an FAE from XXXX large IC company to maybe look into an issue that is preventing your project from working correctly.

I just want to be clear that the possible ways to mitigate this issue do not really "solve" the problem but we will probably end up doing something like that to get by.

I’m not sure I understand. The state of which system, and from where are you checking it? I’m not immediately sure how this relates to the above.

But to your other points, I’ve found that there usually is something that can be done with a few workarounds that gets a full-coverage solution. Just requires some legwork and digging around in the less-documented parts of the system firmware.

Not to oversimplify, but I think “ideally” what I would look for is some way for our code to see that there is an update available/pending. If yes, then we can wait around for the update to start. If no, we shut down/go to sleep/whatever.

Currently, as far as I can tell, there is no sure-fire way to do this. It seems the current answer is to just stay awake and online for an undetermined amount of time with fingers crossed that the update starts. Regardless if this is on each connection, once a day, once a week, etc, it still seems shaky.

For instance, say we decide to check once a day and we let our device stay connected and ready with Particle.process() firing as needed for 10 seconds. Are we going to have some devices in rural areas with less than perfect connections that are effectively going to be unable to update? So is it 20 seconds, before it would “know” that an update has started? Maybe 30?

You can see how this just seems unreliable for production equipment.

Are we going to have some devices in rural areas with less than perfect connections that are effectively going to be unable to update?

Sure, that's why you have it be 30 seconds after it has connected to the particle cloud, perhaps, to account for differences in connection timeframes.

You can see how this just seems unreliable for production equipment.

Most importantly, and I can't stress this enough, you can't assume any third party hardware or software is 100% production reliable, and you must do your own testing and validation and tweak to your individual project's standards

Particle isn't perfect. It's a fantastic platform, but you still need to do some testing and validation. There will always be bugs you can't predict. Thus it's critical that you test these in your edge case environments, run pilots, etc.

The approaches I've laid out are based on approaches I've taken with nearly 1000 photons + electrons that I have across North America in both very rural and urban environments. I may not have had the precise same constraints you have, but I think they are good enough to be worth trying on your devices and validating how reliable they are.

So is it 20 seconds, before it would “know” that an update has started? Maybe 30?

This is pretty straightforward to qualitatively check. Get a sense of the limit's you are seeing in adverse conditions and add an appropriate buffer for safety. Run tests over and over with a script and a test product by programatically creating new firmware versions and measure success. You could get 1000s of data points in a week per device under test. Would be a great way to measure the reliability.

If the download time frame feels too long, then maybe you don't do it once a day but maybe you do it once a week or once a month. There are some very achievable options that give you reliability. They just may require tradeoffs with other needs.

I agree in some ways, completely disagree in others. Particle provides hardware specifically for commercial customers as an OEM component (and supporting service). For instance, they ship you a wifi chip, you expect that it works on common customer wifi networks and won’t refuse to connect to network ssids with a space or start with a number. There is a base level of expectation because as small operators we don’t have a laboratory and 5 engineers running tests, that’s Particle’s responsibility and this example I provided is just a sample of the things gone wrong this year.

It sounds like a bug if OTA doesn’t work for battery constrained appliances considering that would be a large chunk of the market.

Personally I’d roll my own OTA mechanism. Go online, post a webhook to receive a JSON response on whether you’ve flagged that device for update. If there is one, stay online and post another to say you’re ready. Your server then triggers a particle flash to that device. Each time you go restart, publish your firmware version, which removes (or adds) the update flag on the server based on desired fleet state.

Bit like your point, the less you have to trust another company the better.