Electron OTA Hard Fault Requiring Power Reboot


#21

If you are using Particle pub/sub than I would expect you shouldn’t have any UDP related issues (and if its an electron, TCP isn’t even remotely a factor). I’m guessing your issues stem from something else.

I use a TPL5010 timer as an external watchdog. I don’t believe there is an accessible way to use the IWDG. The software watchdog can be helpful depending on your application, though! Much better than nothing.

I assume you are using Particle SIMs?

Can you at least share some pseudo code for the steps you are taking? When you say “lock up our app” and “shutdown/stop operations” those details could be important.

Regardless, my recommendation stands to try and make sure your code can handle OTA without safe mode (but also handle it with safe mode if something goes wrong). Why did you make the move to using safe mode? Essentially the device already puts itself into safe mode once you flash user firmware with a higher system firmware target.

Also, just so you know, the OTA process should be completely safe as far as upgrading through the different version thresholds. You can just flash user firmware with 1.4.1 and it will take care of everything for you. It’s just different for flashing locally / explicitly.


#22

Our firmware management flow is a relic of a time before the improved firmware flow and products.

Essentially, when our cloud (not particle’s, we do our own firmware management) sends a message to device, it just enters safe mode as soon as possible. Our cloud reads the safemode state and sends firmware to the device.

handle_message(Message msg)
{
  switch(msg.event)
  {
    case FirmwareUpdateIncoming:
       internal_message_bus->prepareForUpgrade(); // this will send ReadyForUpgrade when done
       break;
    case ReadyForUpgrade:
        System.safeMode();
        break;
  }
}

Right now we only have one Electron that has encountered this state (Processor in stop mode? Status LED un-powered, pressing reset button/power cycle fixed). We can’t send out people to fix/replace all our devices, and we can’t reproduce this condition in a lab.
I’m looking for a software solution for this problem. We’re looking to safely upgrade our existing fleet without needing physical intervention.

Essentially:
I’m using SYSTEM_THREAD(ENABLED) and enter safe mode pre-OTA.
Assuming Safe-Mode is entered, is there anything else I should be doing to prevent my app from interfering with the OTA process (for versions 0.5.3 -> 0.6.4 -> 0.7.0 -> 1.2.1)?

Trying to avoid a state where OTA occurs, but the device doesn’t reset automatically.


#23

So, you are sending the firmware incoming message separately from the beginning of the firmware update itself?

Entering safe mode will reset the device and interrupt any existing update if applicable.

If you are in safe mode itself there is nothing you can do in your app because your user code isn’t running. But here is a hypothetical:

You send your “prep for fw update” message
Device boots into safe mode,
Your receive a “in safe mode” message and start the fw update
The device has timed out of safe mode (or for any other reason exited it)
The device receives an OTA update while user firmware is running, not in safe mode.

What have you done to prevent this? Are updates enabled? Are these events coming from the system event bus or a user firmware event bus? When you say your own cloud, are you running a customized version of the particle cloud? How confident are you that the issue isn’t in your implementation of the cloud code?


#24

We disable updates from particle and use the API to send our own I believe. Essentially: we monitor events from particle and flash via the API when we know a device has entered safe mode (AFAIK there is no way to check it’s still in safe mode without seeing a reboot event via event)

The timeout might be possible.
Our control events come from the particle pub/sub system and we don’t send any update until we see the spark/safe_mode event.

There might be an edge case where we retry sending from the cloud when the device isn’t in safe mode. We can patch that behavior.
Ultimately I think we’d like to move to particle’s modern update infrastructure, but we’re trying to avoid bricking any devices while getting there. We do think this is and edge case, but even on a failed OTA, it’s worrying that a device would be unable to reset itself (maybe a flash failure? It looks like the processor isn’t running anything - no status led at all).

It’s pretty expensive to visit a site to repair this and we have little insight into how it could happen, 99%+ upgrades go well, but if others have had this issue we want to try and control for it.


#25

If the CPU hangs for whatever reason, you’re going to get that behavior. If you’re not using the software watchdog, that’s definitely step number one (rescues you from many thread lock-ups for example). If you somehow mess up the OTA the device is supposed to boot into safe mode if the checksum doesn’t match.

I think unfortunately there is a limit to how much we can help if you are using a custom version of the Particle Cloud, since it’s a big unknown, and it seems you are using it in a slightly different manner than the current way it’s done.

My suggestion for next steps is to try to reproduce the issue in a test fixture. Have your lab device running with verbose serial output attached to a computer running particle serial monitor --follow and pipe that to a file, ideally with an on screen copy as well. Write a script that basically flashes different user firmware over and over again. Run that for a week.

If it locks up, stop the serial logging and inspect where it hung up. If it doesn’t lock up, maybe the issue is somehow isolated to safe mode / increasing system firmware version. That’s a more involved thing to script out, but you would need to flash the appropriate system firmware stuff manually to downgrade OTA (follow the instructions in the forum) and then upgrade using a scripted version of your OTA process. Maintain server logs and timestamps for that process as well as serial logs on the device. You may be able to get a serial logging verbose build of the system firmware on there in order to get logs to correlate what is happening during safe mode as well.

In either case, you should really put a cheap watchdog timer chip on your pcb if that kind of reliability is critical for you. I ended up retrofitting one for existing pcbs that hadn’t hit final assembly yet. If you don’t need it now you’ll need it for the next bug.