Summary of Problem:
When flashing firmware OTA to my electron devices, they hard fault and do not reset, requiring a power cycle. I have a mitigation strategy that so far has resulted in no permanent hard faults but is still very unreliable at a successful flash.
I’m looking for help in getting my OTA updates to A) flash reliably in general and B) never cause a device to go into a permanent hard fault condition.
Relevant Configuration Summary:
- Electron System Firmware 0.6.4
- Some affected Production Devices are remote and inaccessible
- SYSTEM_THREAD(ENABLED)
- SYSTEM_MODE(AUTOMATIC)
- 3rd Party SIM with 20 second KeepAlive set
- Device sends a packet of data every 5 seconds via MQTT
Other Configuration Details
I interface with a uSD card via SPI, an RS-482 transceiver via Rx/Tx and GPIO, and two RGB LEDs with GPIO, and otherwise have no functional interfaces with the Electron. My device is powered by a beefy regulated 5V, 3A power supply, and the LiPo battery is not installed.
What Happened:
After the outage Tuesday night I had 3 devices (using user firmware A) permanently lose connection to the cellular network. There was a firmware dependance on the Particle cloud connection to process some incoming message responses quickly enough and I think I hit a strange edge case that I’ve since been able to fix (using user firmware B). When testing flashing OTA to a development device on my bench the device hard faulted and could not recover until the reset button or power was reset. This happened consistently across several devices in my tests.
What I’ve Tried So Far:
My gut thought based on reading tangentially related community posts was that my MQTT packets were somehow disrupting the firmware update, causing data corruption and thus the hard fault. My firmware has a cloud function that disconnects from MQTT and ceases publishing data. When I trigger that function, wait a minute, and then update, the device usually updates correctly, albeit after a while. Even still, however I may have to try 2 or 3 times, as it still hard faults often, though it successfully resets itself back into the previous firmware.
To mitigate that on a firmware level, I created passthrough functions for all MQTT calls in my code, and predicated the final function call on if (!System.updatesPending())
. I know that that function only returns true if an update is already in progress, but I thought it might be helpful since I only should need to disable sending data if I’ve already started the update, in theory. This has seemed to be more consistent at avoiding permanent hard faults on the OTA updates, but it still hard faults regularly during OTA updates and occaisionally permanently. Again, if I explicitly close the MQTT connection ahead of time and wait a bit, I’m more successful on average, with no permanent hard faults yet.
I call Particle.process()
in a thread that runs once a second (since my main loop can be blocking for a while due to network disconnection handling timeouts with MQTT). In previous firmware with this same issue I called Particle.process()
once in the main loop only with the same result, but slightly less responsiveness to cloud requests. All MQTT interactions are handled in the context of the main loop thread in all versions of my firmware.
My Suspicion as to the Root Cause
The problem is most likely to be that there is an assumption or dependency in the OTA process that I don’t fully understand, or some best practice that I’m missing. I would love to hear any suggestions for achieving more reliable OTA updates in general. However, it’s also possible that something else is going on with SYSTEM_THREAD(ENABLED)
.
I suspect that some part of the OTA update process might not be fully thread safe. Specifically, I have a hunch that the OTA update process is controlling the modem and expecting it to be the only process communicating with it, so when my code calls a function that ultimately results in a command being sent to the modem, either the modem or the OTA code sees something it wasn’t expecting or has an operation fail. I can’t use a SINGLE_THREADED(){}
block for all my MQTT code because it is blocking for too long (I have to be immediately responsive to incoming data via RS-482), and even if I did I don’t think it would help if that was the case.