Electron OTA Hard Fault Requiring Power Reboot

Summary of Problem:

When flashing firmware OTA to my electron devices, they hard fault and do not reset, requiring a power cycle. I have a mitigation strategy that so far has resulted in no permanent hard faults but is still very unreliable at a successful flash.

I’m looking for help in getting my OTA updates to A) flash reliably in general and B) never cause a device to go into a permanent hard fault condition.

Relevant Configuration Summary:

  • Electron System Firmware 0.6.4
  • Some affected Production Devices are remote and inaccessible
  • SYSTEM_THREAD(ENABLED)
  • SYSTEM_MODE(AUTOMATIC)
  • 3rd Party SIM with 20 second KeepAlive set
  • Device sends a packet of data every 5 seconds via MQTT

Other Configuration Details

I interface with a uSD card via SPI, an RS-482 transceiver via Rx/Tx and GPIO, and two RGB LEDs with GPIO, and otherwise have no functional interfaces with the Electron. My device is powered by a beefy regulated 5V, 3A power supply, and the LiPo battery is not installed.

What Happened:

After the outage Tuesday night I had 3 devices (using user firmware A) permanently lose connection to the cellular network. There was a firmware dependance on the Particle cloud connection to process some incoming message responses quickly enough and I think I hit a strange edge case that I’ve since been able to fix (using user firmware B). When testing flashing OTA to a development device on my bench the device hard faulted and could not recover until the reset button or power was reset. This happened consistently across several devices in my tests.

What I’ve Tried So Far:

My gut thought based on reading tangentially related community posts was that my MQTT packets were somehow disrupting the firmware update, causing data corruption and thus the hard fault. My firmware has a cloud function that disconnects from MQTT and ceases publishing data. When I trigger that function, wait a minute, and then update, the device usually updates correctly, albeit after a while. Even still, however I may have to try 2 or 3 times, as it still hard faults often, though it successfully resets itself back into the previous firmware.

To mitigate that on a firmware level, I created passthrough functions for all MQTT calls in my code, and predicated the final function call on if (!System.updatesPending()). I know that that function only returns true if an update is already in progress, but I thought it might be helpful since I only should need to disable sending data if I’ve already started the update, in theory. This has seemed to be more consistent at avoiding permanent hard faults on the OTA updates, but it still hard faults regularly during OTA updates and occaisionally permanently. Again, if I explicitly close the MQTT connection ahead of time and wait a bit, I’m more successful on average, with no permanent hard faults yet.

I call Particle.process() in a thread that runs once a second (since my main loop can be blocking for a while due to network disconnection handling timeouts with MQTT). In previous firmware with this same issue I called Particle.process() once in the main loop only with the same result, but slightly less responsiveness to cloud requests. All MQTT interactions are handled in the context of the main loop thread in all versions of my firmware.

My Suspicion as to the Root Cause

The problem is most likely to be that there is an assumption or dependency in the OTA process that I don’t fully understand, or some best practice that I’m missing. I would love to hear any suggestions for achieving more reliable OTA updates in general. However, it’s also possible that something else is going on with SYSTEM_THREAD(ENABLED).

I suspect that some part of the OTA update process might not be fully thread safe. Specifically, I have a hunch that the OTA update process is controlling the modem and expecting it to be the only process communicating with it, so when my code calls a function that ultimately results in a command being sent to the modem, either the modem or the OTA code sees something it wasn’t expecting or has an operation fail. I can’t use a SINGLE_THREADED(){} block for all my MQTT code because it is blocking for too long (I have to be immediately responsive to incoming data via RS-482), and even if I did I don’t think it would help if that was the case.

I suspect you are running into this bug:

Publishing in a multi-thread application running on UDP device during OTA results in hard fault


Fixed 0.8.0-rc.2

1 Like

Thanks for the reply. My MQTT library uses the TCPClient for all the network activity, so I don’t think I can put a lock on my entire process, since that could lock all Cellular Modem comm for up to the MQTT timeout period of 20 seconds, which feels dangerous to me for many other reasons. None of my code or libraries directly command the Cellular Modem.

It wasn’t clear to me precisely what the resolution to that issue was, besides the global lock.

I’m guessing part of the issue may be related to the details of TCP compared to UDP in the github issue. Turning off MQTT explicitly closes the TCP socket, whereas any operation that occurs once an update has started will either A) leave the TCP socket open or B) by closing the TCP socket introducing Cellular modem commands (I believe) that would also interfere.

Should the TCP vs UDP difference have that kind of effect, or am I barking up the wrong tree? I need TCP for MQTT and UDP isn’t currently an option.

I can put the device into an “update only” state via a cloud or MQTT command, but to me that defeats the purpose of automated OTA updating since I have to lock each device and explicitly toggle any new firmware.

Edit: for clarity, with MQTT specifically, the TCP socket is kept continually open.

Sorry, the bug title is misleading, and now that I think about it, that might not be what you’re seeing.

Are you using Cellular.command anywhere? The problem in that bug can cause a fault during OTA updates because it didn’t guard against the OTA updates using the modem at the same time.

This might not be what you’re seeing. You should be able to use TCP at the same time as an OTA update, because those accesses should be guarded correctly, however if there’s another location that’s missed you could get a fault for the same reason.

Not on the old firmware. On the firmware I’m trying to flash I use it one place only, and it’s if I’m resetting the modem after a failure condition that triggers my reset thread. That error information is retained and transmitted after reset, and is in a single threaded block that ends in System.reset, so that shouldn’t be the case (code below just in case)

if (resetAfterLoop)
{
    SINGLE_THREADED_BLOCK()
    {
        // A reset was triggered somewhere, handling after operations complete
        if (alsoResetModem)
        {
            // The reset is related to cellular connectivity, so also resetting modem first
            debugPrint(MSG_TYPE_DEBUG, "Resetting Modem");

            Particle.disconnect();

    		// 16:MT silent reset (with detach from network and saving of NVM parameters), with reset of the SIM card
    		Cellular.command(30000, "AT+CFUN=16\r\n");

    		Cellular.off();

    		delay(1000);
        }

        // Now let's reset the electron

        debugPrint(MSG_TYPE_DEBUG, "Resetting Device");

        reset_handler();
        System.reset(reset_reason_code);
    }
}

Other than that, I never use Cellular.command.

However, I call Cellular.ready() once per main loop, and I also call Cellular.RSSI, as per the below code. My understanding was that Cellular.RSSI just uses the information from the last transmission, but could that be still crossing threads somehow? My code for that part of things (all called by my main loop) is below:

void updateCellularLED() {
#if Wiring_Cellular
    sig = Cellular.RSSI();
    rssi = sig.rssi;
    cell_sig = "RSSI: "+ String(sig.rssi) + "dB; Qual: " + String(sig.qual) + "/49";
    if (rssi < 0) {
        if (rssi > -80) cell_color = green;
        else if (rssi > -92) cell_color = yellow;
        else if (rssi > -104) cell_color = red;
        else cell_color = red;
    }
    led_cell.setColor(cell_color.withBrightness(default_brightness));
#endif
}


void checkConnection() {
    // Update connectivity LED and reconnect to MQTT if necessary
#if Wiring_WiFi
    if (WiFi.ready() == false)
#elif Wiring_Cellular
    if (Cellular.ready() == false)
#endif
    {
        client.isConnected();
        #if Wiring_Cellular
        led_cell.setColor(black.withBrightness(default_brightness));
        #endif
    }
    else
    {
        updateCellularLED();
.
.
.

The Cellular.RSSI function does call the modem, sending a AT+CSQ. It looks like it does guard access correctly with a LOCK() in 0.6.3 however.

I’m not sure it’s safe to call Particle.process from another thread. It’s intended to only be called from the main loop thread, and takes care of dispatching event subscription handlers, which should only be done from the main loop thread. In system thread enabled most of the cloud processing is done from the system thread.

Turns out I had already moved Particle.process back onto the main thread, so that is a non-issue.

I removed the calls to Cellular.RSSI() and Cellular.ready() and the problem still manifested itself, as expected. I’m continuing to step through things and remove them.

In the meantime however, there hasn’t been any progress on a system event for the start of a firmware update, is there? I wish I could just listen for that event and then immediately put the device into a halt state for all user firmware.

If I had a thread called pretty often that checked for System.updatesPending() and then called a SINGLE_THREADED_BLOCK with a large delay inside it, would that allow the system firmware to still run for the OTA update, effectively pausing all user threads?

Edit: My gut says that it wouldn’t, since a SINGLE_THREADED_BLOCK is different than SYSTEM_THREAD(DISABLED), but curious

I’ve narrowed down at least one cause. I created a new build does the following (code below):

  1. Device configured as SYSTEM_THREAD(ENABLED)
  2. Wait for Particle Cloud connection, publish a test message, and set a keepAlive for my 3rd party SIM
  3. Connect to a TCP server (test.mosquitto.org as a test)
  4. Reconnect if TCP connection lost (don’t get here with this server in my testing)

This firmware will cause an OTA firmware update to hard fault the device every single time. Thus it appears that an open TCP connection blocks OTA updates in SYSTEM_THREAD(ENABLED) operation. Can someone from Particle confirm if that appears to be the case and provide any suggestions for how I should move forward in the medium term? For the short term I will be manually closing my TCP connection via a cloud function and then manually flashing my devices OTA, as necessary. That’s workable, but definitely not scalable for the medium/long term.

Code that causes the error:

// WARNING - this firmware will prevent subsequent OTA updates, do not flash
//           on inaccessible devices.

SYSTEM_THREAD(ENABLED);   // Enable system threading
SYSTEM_MODE(AUTOMATIC);      // Allow Particle to manage it's own connectivity

STARTUP(cellular_credentials_set("hologram", "", "", NULL));

const int keepAliveInterval = 20;      // keepAlive interval in seconds

bool particleNotYetConnected = true;

byte server[] = {37,187,106,16}; // test.mosquitto.org  unencrypted MQTT server

TCPClient client;

void setup() {
}

void loop() {

    // Perform first time initialization of networking features
    if (particleNotYetConnected && Particle.connected())
    {
        particleNotYetConnected = false;
        Particle.publish("firmware test", "test1");
        Particle.keepAlive(keepAliveInterval);
    }

    // If TCP client is disconnected, reconnect
    if (Particle.connected() && !client.connected())
    {
        if (!client.connect(server, 1883))
        {
            Particle.publish("connection result", "unsuccessful");
            delay(3000);
        }
    }

    delay(100);
}

Were you able to establish when during the OTA update the hard fault happens?
If you have a few seconds into the firmware download phase, you could use System.updatesPending() or the respective system event to trigger a TCPClient.stop() automatically.
You could also use System.disableUpdates()/System.enableUpdates() to only allow updates when it's safe to do so.
Could you also check how many network sockets are available/free before the crash happens?

1 Like

Yeah, I previously tried calling System.updatesPending() in all my MQTT (uses the TCP client) function calls. One of those gets called 4-5 times per main loop call. If updatesPending() returned true, I called the client.stop() function immediately. It did not make any significant change to the issues I was having. I can try with the system event, but in what thread context is that system event called? I’m concerned that if I stop the tcp client between when my MQTT library checks for an open connection and performs a TCP client operation, that I might get some unknown results.

I would have expected calling client.stop() before the hard fault condition manifested itself to have fixed the problem, but given that I didn’t appear to, I wonder if the TCP connection, while open, somehow alters or adds to the incoming firmware data, but without the device knowing until it checks at the end.

The timing of the condition was that the device would go into update mode for at least a couple seconds, occasionally more, before hard faulting.

How can I check the number of available/free network sockets? And not sure how to be any more specific about the timing of the hard fault without somehow setting up a more complex hardware setup I can scope over a few seconds.

Any further thoughts on this? Since such a large part of the Particle Device Cloud benefits is in the OTA updates and ability to manage devices in a product, I’m concerned about the stability of that feature. I’d really like to avoid having to do a bunch of tests to get confidence in a particular OTA workaround. I can absolutely continue to help figure it the root cause, but I’m still not fully acquainted with the system firmware, so I’ll probably need some additional pointers on what to dig around for.

Maybe @rickkas7 can jump in here.

There may in fact be a bug in system firmware, but I’m not sure.

I set up a minimal test case for this. I have a server that echos data and an Electron app that connects to it and sends 1 byte per second, and receives whatever is sent.

I tested with both 0.6.4 and 0.7.0 and I had a hard time getting it to receive an OTA update. Most of the time it would timeout. It also was unable to receive a ping from the console.

I tested a version that waited 15 seconds before connecting by TCP to the server and was able to OTA flash in that interval, but once after blinking magenta, it seemed to lock up, with the LED off. Resetting solved the problem.

Then I modified the code to not use SYSTEM_THREAD(ENABLED). It worked flawlessly in non-threaded mode. I was able to receive OTA updates and the reboot afterwards worked perfectly even while connected by TCP.

Here’s the code I used. It includes both the Electron and server (node.js) code.

Though the interesting thing is that with packet level debugging turned on

SerialDebugOutput debugOutput(9600, ALL_LEVEL);

I’m not actually seeing packets from the cloud trying to start the OTA flash. That kind of surprised me. That may or may not be significant.

The next step would probably to compare the ALL_LEVEL logs from threaded and non-threaded mode to look for differences but I ran out of time today. I figured you’d want the intermediate results.

3 Likes

Yeah, that last point about not seeing the packets for starting the OTA flash feels very relevant to me. In all my testing, if I was to flash OTA from the Desktop IDE, the IDE would try and then claim “timeout”, and then consistently immediately after the timeout the flash start event would come up in the Particle console for that device (strangely consistently). That said, even once I had explicitly closed the TCP connection, I often saw that timeout behavior even for successful flashes. So could be a separate issue entirely or theoretically a triggering issue for the panic behavior.

Thanks for looking into it / the update!

Hey!

I might be hitting this problem trying to upgrade from 0.5.3 to 0.6.4 (the ultimate target is 1.4.1).
We use SYSTEM_THREAD(ENABLED) but we also use custom upgrade logic.
When our cloud tells us a firmware update is coming we swap into Safe Mode.

My understanding was that Safe Mode doesn’t run the user app, but a field unit has locked up in this way (dark status LED after OTA, physical reset fixes the issue). We can’t get to many of our sites.

Should we be flashing Tinker before upgrading?

The safest should be CLI particle update -v (in DFU Mode).

Assuming I have no physical access to the device, is Safe Mode an acceptable remote equivalent?
Essentially: How do I prevent the user app from running or UDP/TCP messages being sent from the DeviceOS/User App during upgrade to avoid remotely bricking my devices?

We’re in a bit of a catch22: we can’t upgrade the device without some guarantee of safety, and we can’t guarantee safety without upgrading the device. What is the recommended process for OTAing a device?

Right now, our cloud waits for a spark/safe_mode message before flashing each part of the firmware (4 for 0.6.4+).

So this particular issue I think ended up being a thread lockup with the modem during TCP calls, in addition to the system firmware flags for pending updates not getting set until the modem was already in use for the update.

I currently use SYSTEM_THREAD(ENABLED) on hundreds of updates and now have 0 permanent issues with OTA (over the past ~2-3k OTA updates I’ve ran). However, I also use a hardware watchdog on the reset pin. There’s always going to be some things that fail, so a watchdog is pretty much the ONLY way to guarantee reliability. That said, I think my code has been strongly reliable and I’m happy to give advice on making improvements to your code.

The problem with swapping into safe mode is that if something gets interrupted you might boot out of safe mode and be in a problem state. Generally better to shut things off and manage priorities in your firmware itself.

Can you share the parts of your code that pertain to TCP/UDP and OTA updates? We won’t be able to really give suggestions without that. The issue is almost always addressable within user firmware.

I’m unable to share the code I’m afraid.

We use particle pub/sub rather than opening raw TCP/UDP connections.
We encountered this state a few weeks ago during upgrade and had to send someone to a remote location in a different state.

Is the watchdog you use the IWDG from the STM32? It might be possible for us to set up that device, performing a system reset if we can’t pet the watchdog.

Right now we do lock up our app when we know firmware is coming. We perform any shutdown/stop operations we require and then call SafeMode.