Electron OTA Flash Times Out, Even in Safe Mode

Hi all,
We have been working through a persistent OTA flash issue with some Electrons over the past couple of days. We've looked through a lot of the topics here and most seem to suggest that issues with user code that can be solved by putting the device into Safe Mode and then flashing OTA. In our case, we are still not able to flash OTA successfully in Safe Mode, or when running tinker, unless the app we are flashing is Tinker, suggesting perhaps some kind of problem with the incoming new firmware app rather than the one running already on the Electron.

Failure looks like: we flash from the CLI and get back a "Flash Device OK: Update Started", we see the device blink magenta for a few moments, then it will hang on either magenta or off for about 2 minutes, then reset, running the old user firmware. After reset it publishes a "spark/device/last_reset" - "update_timeout" event.

More details, same for both devices we have tried:
Electron 2G - running 0.6.4 - using third party SIM - part of a Product on the console
SYSTEM_MODE(MANUAL) - Located in Nigeria (though we have another product that flashes OTA just fine here, different code but same Electron models, same system fw, same 3rd party SIM)

Flashing orders we have tried:
Running User FW -> Flash User FW -> Fails
Running User FW -> Enter Safe Mode -> Flash User FW -> Fails
Running User FW -> Enter Safe Mode -> Flash Tinker -> Succeeds
Running Tinker -> Flash User FW -> Fails
Running Tinker -> Flash Tinker -> Succeeds

During a couple of our attempts when flashing directly to our user firmware, we caught this warning in the Serial log outputs (repeated 7 times):
0000574030 [comm] WARN: handle received message failed - aborting transfer

We've tried all of these with SYSTEM_THREAD(ENABLED) and without.
We've tried catching the device in a tight loop with Particle.process() as suggested in this post: Particle-CLI OTA Electron flashing timed out - #6 by ScruffR
And we've also tried using the firmware_update system interrupt callback to set a flag fwUpdating = true and then placed this at the start of loop:
if (fwUpdating) { Particle.procress(); return;}

We also used 'particle serial inspect' (suggested here: Flash via Serial or OTA fails but OK via DFU - #20 by ScruffR) are received the output posted below. Looks like our dependencies are ok, not sure if someone has insight on the specific versions of either the bootloader or our user module that could be problematic?

Any advice/ideas are extremely appreciated, we're running out of ideas for how to fix this!

particle serial inspect output:

Platform: 10 - Electron
Modules
Bootloader module #0 - version 11, main location, 16384 bytes max size
Integrity: PASS
Address Range: PASS
Platform: PASS
Dependencies: PASS
System module #1 - version 110, main location, 131072 bytes max size
Integrity: PASS
Address Range: PASS
Platform: PASS
Dependencies: PASS
System module #3 - version 110
System module #2 - version 110, main location, 131072 bytes max size
Integrity: PASS
Address Range: PASS
Platform: PASS
Dependencies: PASS
System module #1 - version 110
System module #3 - version 110, main location, 131072 bytes max size
Integrity: PASS
Address Range: PASS
Platform: PASS
Dependencies: PASS
User module #1 - version 4, main location, 131072 bytes max size
UUID: E09E5577FBCF6BA881E85FD5660CB694A8F8C35121016E00A024027957201CF5
Integrity: PASS
Address Range: PASS
Platform: PASS
Dependencies: PASS
System module #2 - version 110
empty - factory location, 131072 bytes max size

If you are running 3rd party SIM you need to have the correct Particle.keepAlive() set in your code and due to this bug this has to happen after the connection gets established and each time after the connection is lost.

So Tinker and Safe Mode won’t help in that case.
You need to have a firmware running that does not interfere with the cloud but also sets the keep alive whenever needed.

1 Like

Thanks for the quick reply. We are setting the keepAlive once after our first connection in setup, and then we actually set it again every time we reconnect to the cloud and the system callback for connected to the cloud fires.

This works well on our other product we have with the same SIM and keepAlive settings in Nigeria, and since we are able to flash Tinker once the device is in Safe Mode, it does seem like we are still maintaining a good cloud connection on the device.

Have you ever seen cases where either the size or other characteristics of the new incoming binary affected the ability for the flash to succeed? For example, our binary is 0.045MB, compared to the 0.009MB for Tinker.

When the connection is brittle or extremely slow OTA updates tend to fail.
With a tiny FW like Tinker the OTA is done and dusted a lot quicker than with bigger binaries and hence the stability/speed of the connection plays less of a role.

But 45KB doesn’t seem huge.

Ok, that’s good to know. We are using 2G Electrons so I could see how that could be a concern, but I just checked our other product binary and it is actually larger (60kB), and flashes fine with both sitting next to each other, so I think the connection should be pretty stable.
When I press MODE I get back 4 green flashes, which I believe should also indicate a pretty good cellular signal, though I recognize that’s a crude indicator.

Given that we always see the device stop blinking magenta, and either freeze on solid magenta or with the LED off, it seems like there is some kind of hang happening, rather than the update just taking too long.
And would a slow connection be expected to result in the serial warning we saw?
0000574030 [comm] WARN: handle received message failed - aborting transfer

Have you got the option to flash some diagnostic code?
When you can flash a simple sketch (AUTOMATIC and not SYSTEM_MODE(ENABLED)) with SerialLogHandler traceLoger(LOG_LEVEL_TRACE); you should be able to see the output of the OTA download and the first few flashing tasks.
If this doesn’t reveal anything you could use this

If you are running 3rd party SIM you need to have the correct Particle.keepAlive() set in your code and due to this bug this has to happen after the connection gets established and each time after the connection is lost.

@ScruffR I hadn't realized that keepAlive would need to be set after reconnection events beyond the first time. How do you normally choose to handle that? I have been using Automatic mode so I might not always catch a brief disconnection explicitly (main loop can take 60seconds to run during disconnection due to TCPclient connect). I can however do my best to track the last connection state best I can. Any best practice? Or am I going to want to pursue manual mode?

You can subscribe to the cloud connection status event

But for that SYSTEM_THREAD(ENABLED) should be chosen.

And to get the TCP issue sorted, you should definetly test 0.8.0-rc.3 and report back whether all “fixes” incorporated in there do change a thing for you or not.

1 Like

@justicefreed_amper I use that cloud connection status to set the keepAlive each time we reconnect to the cloud. I just check in the callback to see if it is the cloud_connected event and reset the value if so. Has been working well for me. This is what my handler function looks like, and I actually don’t use SYSTEM_THREAD(ENABLED) with it always but it still seems to work for me:

void cloudConnectionEvent(system_event_t event, int param) {
   if (param == cloud_status_connected) {
     Log.trace("Cloud Just Connected, setting keepAlive.");
     Particle.keepAlive(PARTICLE_KEEP_ALIVE);
   }
} 

@ScruffR I was able to try a flash with the LOG_LEVEL_TRACE set to capture the system logs. Here’s the tail end of that output:

...
0000729540 [system] TRACE: received 36
0000729540 [comm] TRACE: update done received
0000729541 [comm.coap] TRACE: sending message id=88e6
0000729541 [system] TRACE: send 34
socketSendTo(0,54.158.29.30,5684,,34)
   729.541 AT send      35 "AT+USOST=0,\"54.158.29.30\",5684,34\r\n"
   729.592 AT read  >    3 "\r\n@"
   729.642 AT send      34 "\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x00\x1a\x00\x15\x00\x01\x00\x00\x00\x00\x00\x1a\xc1Wb\xc9p\xbaZ\x9e\x96\x18fJ\xdf"
   729.683 AT read  +   16 "\r\n+USOST: 0,34\r\n"
   729.693 AT read OK    6 "\r\nOK\r\n"
0000729694 [comm] TRACE: update done - missing chunks starting at 75
0000729694 [comm] TRACE: Sent 2 missing chunks
0000729695 [comm.coap] TRACE: sending message id=b1 synchronously
0000729695 [comm.coap] TRACE: sending message id=b1
0000729696 [system] TRACE: send 40
socketSendTo(0,54.158.29.30,5684,,40)
   729.695 AT send      35 "AT+USOST=0,\"54.158.29.30\",5684,40\r\n"
   729.746 AT read  >    3 "\r\n@"
   729.796 AT send      40 "\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x00\x1b\x00\x1b\x00\x01\x00\x00\x00\x00\x00\x1b\xc0n\xb2\x88\xd5\xe1N\xce\xf4\xf4\xf6A\x96\xc1oO\xeasX"
   729.837 AT read  +   16 "\r\n+USOST: 0,40\r\n"
   729.847 AT read OK    6 "\r\nOK\r\n"
0000735008 [system] TRACE: send 40
socketSendTo(0,54.158.29.30,5684,,40)
   735.007 AT send      35 "AT+USOST=0,\"54.158.29.30\",5684,40\r\n"
   735.058 AT read  >    3 "\r\n@"
   735.108 AT send      40 "\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x00\x1c\x00\x1b\x00\x01\x00\x00\x00\x00\x00\x1c\x14\x9f\xe0\xc3\xf7p^P\xb7\xe1.\xffJ\x82R@\x88in"
   735.149 AT read  +   16 "\r\n+USOST: 0,40\r\n"
   735.159 AT read OK    6 "\r\nOK\r\n"
0000744230 [system] TRACE: send 40
socketSendTo(0,54.158.29.30,5684,,40)
   744.229 AT send      35 "AT+USOST=0,\"54.158.29.30\",5684,40\r\n"
   744.280 AT read  >    3 "\r\n@"
   744.330 AT send      40 "\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x00\x1d\x00\x1b\x00\x01\x00\x00\x00\x00\x00\x1d\x0e\x88\xb2\xd6l\xbc\x8d%\xe3{E\xf9\xb6CCN\xa1/@"
   744.371 AT read  +   16 "\r\n+USOST: 0,40\r\n"
   744.381 AT read OK    6 "\r\nOK\r\n"
0000765422 [system] TRACE: send 40
socketSendTo(0,54.158.29.30,5684,,40)
   765.421 AT send      35 "AT+USOST=0,\"54.158.29.30\",5684,40\r\n"
   765.772 AT read  >    3 "\r\n@"
   765.822 AT send      40 "\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x00\x1e\x00\x1b\x00\x01\x00\x00\x00\x00\x00\x1eB\xfd\xe7\xee\x94\xad\xf8\xb8\x04x\x8e\xcbmU*\x12\n\xaaA"
   765.863 AT read  +   16 "\r\n+USOST: 0,40\r\n"
   765.873 AT read OK    6 "\r\nOK\r\n"
0000798674 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 0
0000798674 [comm.protocol] INFO: rcv'd message type=6
0000798675 [comm] WARN: handle received message failed - aborting transfer
0000798675 [system] INFO: Send spark/device/ota_result event
{"r":"error"}
0000799676 [comm] WARN: handle received message failed - aborting transfer
0000799676 [system] INFO: Send spark/device/ota_result event
{"r":"error"}
0000800677 [comm] WARN: handle received message failed - aborting transfer
0000800677 [system] INFO: Send spark/device/ota_result event
{"r":"error"}
0000801678 [comm] WARN: handle received message failed - aborting transfer
0000801678 [system] INFO: Send spark/device/ota_result event
{"r":"error"}
0000802678 [comm] WARN: handle received message failed - aborting transfer
0000802678 [system] INFO: Send spark/device/ota_result event
{"r":"error"}
0000803679 [comm] WARN: handle received message failed - aborting transfer
0000803679 [system] INFO: Send spark/device/ota_result event
{"r":"error"}
0000804679 [comm] WARN: handle received message failed - aborting transfer
0000804679 [system] INFO: Send spark/device/ota_result event
{"r":"error"}

I can post the full flash sequence if that would be helpful, but its quite a lot of lines. I took this section at the end because one line that stood out to me, aside from all the warnings at the very end, was the:
0000729694 [comm] TRACE: update done - missing chunks starting at 75

Should also note that this time the device hung for a bit with the LED off, but then hard faulted before resetting.

Could it be that some sections of the binary are not getting written, causing the whole new firmware file to crash when it tries to load it in? Do you get any other hints from this?

1 Like