OTA updates failing on 2G Electron with 3rd Party SIM

Hello!

I have a 2G Electron (SARA G350) that has stopped being able to complete OTA updates.

We used to be able to flash code OTA with this same 3rd party SIM (and same application code) as of about 2 months ago. Since trying it at the beginning of this week, it has not worked once.

Here are some specifics

Device system firmware: 0.6.4 # Although we have also tested with 0.8.0-rc.4
Electron: 2G Sara-G350
SYSTEM_THREAD: not enabled # although we have tried both flavors
SYSTEM_MODE: MANUAL

Other maybe relevant info

  • We are mainly doing work in Africa, but are in the United States (Bay Area CA) currently, where we are experiencing this issue.
  • We’ve been able to flash the exact same application code in the past; this has stopped working recently
    • Unfortunately, the exact time this stopped working is hard to say. All I can say for sure is that OTA flash updating was working before April 10th, and is not working as of June 25th
  • Tiny code-bases (for example, tinker but changed for our 3rd party APN – as found in the example here: 3rd party SIM tips – successfully update sometimes, but fail other times (I didn’t test this extremely systematically, i.e. I don’t remember the device system FW version, but we saw both successes and failures)

Configurations

We have 2 different 3rd party SIMs on hand, from GeoSIM and eseye. We tested each of these, as well as a Particle SIM card.

We have tried this with all combinations of:

  • SYSTEM_THREAD(ENABLED) and not
  • 0.6.4 and 0.8.0-rc.4 (some testing with 0.7.0 but not systematically so I won’t report on that)
  • Particle, geosim, and eseye SIM cards.

Results

  • Every iteration with the Particle SIM flashed new code successfully (tested by watching output, and also by using a changed static Particle.variable to ensure new code was running). So, with all threading and FW version combos, particle SIM worked every time.
  • Every iteration with both the GeoSIM and the eseye SIM failed to flash.

Failed to flash

When I say failed to flash, it always starts successfully, but then does not complete.

output from device subscribe either looks like this:

{"name":"spark/flash/status","data":"started ","ttl":60,"published_at":"2018-06-28T22:46:56.830Z","coreid":"123"}
{"name":"spark/device/last_reset","data":"update_timeout","ttl":60,"published_at":"2018-06-28T22:48:44.513Z","coreid":"123"}

…or, it simply prints the first “started” event, but then never prints another update-related event.

Serial output

I logged output from every single iteration of this, but will just post 2 failure cases and 1 success case. The output (as far as I can tell, and I spent a lot of time looking through these logs) basically falls into one of 2 camps. I am only printing what I think is relevant (but let me know if more info would be useful vs just noisy).

Failure 1 - Single thread

One note: When using SYSTEM_THREAD(DISABLED); (not including anything related to system threading), it always stops downloading at the chunk with idx=15.

... [ update starts, and starts receiving chunks ]
...
0000139911 [comm] TRACE: chunk
0000139912 [comm] TRACE: chunk idx=14 crc=1 fast=1 updating=1
0000139914 [comm.protocol] INFO: rcv'd message type=7
   139.913 AT read  +   18 "\r\n+UUSORF: 0,557\r\n"
Socket 0: handle 0 has 557 bytes pending
   139.924 AT send      16 "AT+USORF=0,813\r\n"
   140.024 AT read  +  597 "\r\n+USORF: 0,\"54.152.148.189\",5684,557,\"\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x00w\x02 \x00\x01\x00\x00\x00\x00\x00w\x10\"H\xf1\x97\xad\xbc\xe3\xa8\x95 \xd2\x9d#4\x87\x95V{Y\xee)[.\xdd|1%\xfd\xf5~\xc9=rP\xf3AH=\xb8]M\xa2f\xbb'\xd2UX\xe3\x92\xcd\xe5\xb4\x014\v*$:\x11E\xdf\xd1\xf1\x13\xc3\xab\xd2\xbcF\xb4L\xd1\xf7\x12\xc8\xa4^Qv6\x1eZ\x1a\xa7\xfdl3\x93\xa0^\xf4\xa6\x8d\x958\xa2d\x8e\x9aI\x81\x82\xd9(\x90\xc5P\xb0l \xb6\xbe\xc1\"\xcb\xad\x04\x81\xf2\xa8g\r\xcb\xc7\xd1\xf1\xfa\xe3\xd0\xed\xa4-\x8c\xd0\xcbQ\x03\xba\x05\xd6\xa9\x98\x83f*\xa5\xde\xbb\xb8\x16\xc1e\x04s\xaa\xde\x80\x90\x7fo6\xe9\xdf\x16\x95\xa8>\x05W9\xdb\x13\xc9\xa7\xbd\xa3\xed\xffaQ\xb6\x1bb\x8ey\x83\x91{\xc0\xc3\xb3\x7f\x00(X\x99\x9a)4:\b^\x909\xe9\xd2\"\xe6'\xc2!\xc1Nc\x06\xa4\x87\xc3I\x87\xe0@I\xa5\xa6\xe6\x02\xec\x83\xaf\xa5v\x1a\xee\x8d\xdbM\x91\x153g\xd7\x8c\xeer\xb9\xf0\xca1u&\xa4\xe3\xe5\xb9o\xcd\xd4B\x94\x91\x11\xf4\x94\xff@\xcb@XRQ\xb3\x8f*\x05vL\xc8z\xd9\xa2X~z\x8d\x87\xe96\xee(2\xe8ny\x0e}yd|\xe1V\xe7G\x1d\v\x0eU,\x86,f\xbf|\xbcu\x90S\x03\xed\x86\x12&\xc1\x1f\x00\x00v\x83\xbd\xcf\x97v\x1eG\x02Y\"\xc5\xee\x93\xc3:|\x96\x1e|\xd1\xc7\xd2z\xf1\xa2\xe2R=6\xfa\xcbD\xef\x98\x97Y\x02\xc9<\xc2J\\<\x80\x0eE\xda\x9d)\xe4Z]J(\x8fkf\xb1CJ\r\x15'\xee\xf9\x111-8\x7fH\xd3J}YH\xbe\x00\xd7$\xb1\xc4F\xdfj\xd8\xdag\x050\b\xa1nQ\x0eh\x11\x9cz3,I8\x02o1\xe4\xeb\xba\xd5\xa7\xf4h\xbd\xa6\x1f\xe1?#\x924\x19\xa5\b\xa4K6\xda\x8e\xcf\x9c-gB\x05z\a\xd2\xe9\x92\x9e\xa1\nU\x94\t\x01\x8e\x8d\xc3.\xe6\xcb\x18Lw\xea\x9f\x8fe\xc8Tj\xcf[\xd9Q\xd6#OQ\x80\xe8\xe6\v\xbb\xcb\x93!\xb6\xfd\x1c\x15\x04Z\x18_\xd05Q\xf4\x13\xe1\x8c6W\x04\xbf\xcc\x10\xe3\xabG\xf4\xcc\xbc\v\""
   140.054 AT read UNK   2 "\r\n"
   140.064 AT read OK    6 "\r\nOK\r\n"
0000140065 [system] TRACE: received 557
0000140066 [comm] TRACE: chunk
0000140067 [comm] TRACE: chunk idx=15 crc=1 fast=1 updating=1
0000140069 [comm.protocol] INFO: rcv'd message type=7

[ There is no more output. It simply hangs after idx=15 ]
Failure 2: System Threading ENABLED

This failure is similar, with 2 differences:

  • when threaded, the application output continues to print out after the chunk receiving stops
  • where the idx stops:
    • when threaded in 0.6.4, it always stops at chunk idx=15
    • when threaded in 0.8.0-rc.4, it always gets to idx=15, then jumps to some non-sequential index, at which point it stops (such as chunk idx=31 in the log output below)
... [ update starts, we download some chunks ]
...
0000709960 [system] TRACE: received 557
0000709962 [comm] TRACE: chunk
0000709964 [comm] TRACE: chunk idx=15 crc=1 fast=1 updating=1
0000709968 [comm.protocol] INFO: rcv'd message type=7
   710.066 AT read  +   18 "\r\n+UUSORF: 0,557\r\n"
Socket 0: handle 0 has 557 bytes pending
   710.078 AT send      17 "AT+USORF=0,1024\r\n"
   710.178 AT read  +  596 "\r\n+USORF: 0,\"54.205.74.202\",5684,557,\"\x17\xfe\xfd\x00\x01\x00\x00\x00\x00\x003\x02 \x00\x01\x00\x00\x00\x00\x003\xdcw\xa8)\x1b\xde\xff\xa81\xb1b\xf6\xb6?\xe4U\xf2^\xcd\xa6g2qA\nB6u\xben\xc9\x9e\xc5/K\xd2\xcc\xddA\x8d\xaeb\xfd\x80\xad\xe2h|\xdb)\xaa\xa2f\xae\x10\xd9H9\xd5vh\xe5\xfa\xb2m\xedE\v\xaec+w\xf47\xbf\x93\x16\\&\a1\xc4\xbb\xecv\xe9\xd1\x93\x82M\xf2Z\xe2\x86\xb0m|>O\xf3(\xdc:\x1db8S\xbe\x98([0\xa3\xe6\x91\xae\xb8\x1d%\xca\xe0\x93\xf47\x99Wt\x8e\xb1\x8f\x7fR\b\xe4\xb31\xbbDF\x9e\x1d#\x19\x9a\x9a\x1e\xd9\x004E\x89\xfe\x84#\xbd\xe4\x16\xccw\r\x04\x86\xc3Z\x85R\x14/\xd4\xbe|\xac=\xf0\xe4Z\x91e\xe5+\x13\xe0\xd6\xe9\xc3\xbe\xad\xc7\xe3\"\xcd\x95b\v\x7f-w-\x84\xa33P(\x96Eo:,\x94\xdeD[g#\xb1\xc0\xc2}a\xe8\xb6=\xb9\xd3\xa6\xc5\xc5\x89\a)\x87\x85\xe9\xa4h\xc4v\x81q'\xd2\xcf\xd1tF`NS\x9aEMV\xd0\x9a\x11;\xa7\x92L8 \x82\xb7\x157\n8\xcdtS\x16\xe5\xf7\xad\xda\xfd\x17\xe5\xa8G\xf3:\xed\tU<\xff\x1b\xa7\xb6k]\ft\xe7k\xaff}\xd8|\x03'\"\x93\x11\xe4\xd2\xc4\xf4\xf9V\x10vl1p\x8e\x0f\x00\xf4\x80H\xa3\xec\x8f\x8by\x13~\xd4\x8d\xaa\xf7:\x005\x8e\xd3\xbf\xcfIz\xd8\xc5\x13{\xae\xdf\xd8\x00\xa3\xc1\x05/k\xf46\xcd\xcf\x81\xbf\xf3\x15\xfc\x93\x7f5\x1c5$\x11\x1a\xdcZW\xe6L\xe0\x11\xbf\xcf`\x1f)R\xc0\xdd\x9f!\x92\x81\xffa\xa4e\xf9!\xe3\xae\x95x)<\\r3$\xeea\xf2X\xf2\x18J\xa8\xf8#BdZ\xcegW\xa3\xd1\xfbb\xe7\x13\x97]a$\xd2v\xb07y\xdfz\xd2\xc4\xf4\xffo/\xb4\xc8\xcflGGs)\x84\x8c\xc2<\x9d\xe8\x1d\xb0\xa69\xf1O\xb9\xb5\x86\x06d\x84Z\xfaR\x8f\x84[\xe7\xb8\x18\xf7KQ\xbb\xac\xae\xceL0\xec1\x9f\x1e\xda,\xbeB\xbd<\xc0wS\x92\n\x15\xd6\x15\xa5\xe5\x1av\xff\x020\xa3Z\x15BakM\x06\x00F\f\xd7\x9d ;1\x8e*}:\""
   710.226 AT read UNK   2 "\r\n"
   710.236 AT read OK    6 "\r\nOK\r\n"
   710.236 AT send      14 "AT+USORF=0,0\r\n"
   710.278 AT read  +   15 "\r\n+USORF: 0,0\r\n"
Socket 0: handle 0 has 0 bytes pending
   710.290 AT read OK    6 "\r\nOK\r\n"
0000710292 [system] TRACE: received 557
0000710294 [comm] TRACE: chunk
0000710296 [comm] TRACE: chunk idx=31 crc=1 fast=1 updating=1

[ NOTE: output from application code continues to run here, which is different than single threaded case, but nothing related to cloud or cellular ]
Success

Come to think of it, I’m not going to print the success logs, unless someone wants to see them specifically. It doesn’t seem like it would be super useful. But, it succeeds in downloading all the chunks in order (I think there are 94 in our code total), starts to say downloaded successfully, and then closes serial output.

Last couple notes

I’ve looked at 2 similar posts, and tried the suggested things in each:

To answer a couple commonly asked questions:

  • yes we do have software timers (I have tried disabling on firmware_update_pending, but to no avail)
  • yes we do particle publish. We have one specifically which in my view could be problematic, in a “cloud_status_connected” handler. I have commented this out at some points, but I don’t remember what other changes were in effect when I was trying that (and I never got OTA with a 3rd party SIM to work with different mutant versions of our code)

I have also run @rickkas7 's test code here. Notably, the modem output (with default times) would usually come up like this:

0000009370 [app] INFO: manufacturer=u-blox
0000009410 [app] INFO: model=
0000009450 [app] INFO: firmware version=
0000009530 [app] INFO: ordering code=SARA-G350-01S-00
0000009580 [app] INFO: IMEI=[redacted, but present and correct]
0000009620 [app] INFO: IMSI=
0000009670 [app] INFO: ICCID=[present and correct]
0000009670 [app] INFO: attempting to connect to the cellular network...
0000019501 [system] INFO: Sim not inserted? Detecting...
0000020831 [system] INFO: Sim Ready
0000020831 [system] INFO: ARM_WLAN_WD 1
0000021671 [system] INFO: ARM_WLAN_WD 2
0000021671 [system] INFO: CLR_WLAN_WD 1, DHCP success
0000021671 [app] INFO: connected to the cellular network in 12001 milliseconds
0000021672 [app] INFO: running cellular tests

However, if I extended the MODEM_ON_WAIT_TIME_MS to 8000 instead of 4000, I get correct output:

0000065227 [app] INFO: manufacturer=u-blox
0000065277 [app] INFO: model=SARA-G350
0000065327 [app] INFO: firmware version=08.70
0000065377 [app] INFO: ordering code=SARA-G350-01S-00
0000065427 [app] INFO: IMEI=[correct]
0000065477 [app] INFO: IMSI=u-blox
0000065527 [app] INFO: ICCID=[correct]

Thanks in advance for any help or insights!

Which system firmware are you on?

Place it in safe mode and try flashing again. If it goes well then there’s something in your code blocking the OTA from completing successfully

Hi @kennethlimcp, thanks for the quick reply.

We are using 0.6.4, as mentioned above (but we have also tried this with 0.8.0-rc.4 with same results, also mentioned above).

I should have included above, but we have tried flashing OTA from safe mode, and that has the same results (no serial output, but the same events published: ["started", "update_timeout"]). I just tried it again in case of voodoo but same deal.

Not sure if it’s me but I don’t see a 0.6.4 in build.particle.io.

Can you select 0.6.3 for the particular electron and flash again?

Hey @kennethlimcp, I think 0.6.3 is a firmware version for the Photon, and 0.6.4 for the Electron.

When I select my device there (on the web IDE, which as a side-note I’m not usually using to compile, if you think that makes a difference), there is only 0.6.4 and 0.6.2.

Github releases indicate the same: https://github.com/particle-iot/firmware/releases?after=v0.8.0-rc.1

Not sure if this is relevant for this case, but there was an issue about Particle.keepAlive() that required it to be set again and again after any (re)connect to the cloud, which probably not be happening in the course of multiple resets of a system upgrade.

This should have been solved with 0.8.0-rc.4 (currently latest version is 0.8.0-rc.8), but since users have adopted workarounds to the issue before that reports whether or not the fix actually resolved the issue are missing.

Running an explicit test might be useful.

Hey @ScruffR, we do re-set the keepAlive after a cloud reconnect, although I do agree that it could still be a related to the problem. Our relevant code looks like this:

// in setup(), called before Cellular/Particle connection:
System.on(cloud_status, cloudConnectionHandler);

// in setup(), after Particle.connected()
Particle.keepAlive(KEEP_ALIVE_SECONDS);

// ... and then later:
void cloudConnectionHandler(system_event_t event, int param) {
   if (param == cloud_status_connected) {
     Log.trace("Cloud Just Connected, setting keepAlive.");
     Particle.keepAlive(KEEP_ALIVE_SECONDS);
   }
}

But I definitely think some way to verify or test this would be good.

Running an explicit test might be useful.

I think you are right; what kind of test did you have in mind here?

What I meant was that the issue might hit when an OTA update also causes a Device OS update which will done in stages with multiple reset cycles after each of which your code is not running anymore and hence the Particle.keepAlive() logic won't kick in.

Some tests that might involve Device OS updates vs. pure application updates and see if there are any differences.
Unfortunately (IIRC) Safe Mode does not create any serial log output which would possibly help investigation - but that might make a useful request for future Device OS updates.


Done
https://github.com/particle-iot/firmware/issues/1554

@ScruffR thanks for the thought, and the ask on the logging of safe mode.

This issue could be happening when it tries to update device OS, but I don’t think it’s only happening in that case (unless I’m very much misunderstanding when this Device OS updating happens).

I created a very simple repo of an OTA failure example here to demonstrate the issue (you would need a 3rd party SIM – and maybe a 2G Electron – to test).

I think I mentioned this above, but my initial post was very long, so to reiterate: a pretty strange thing is that it almost always seems to hang at chunk idx=15, like this:

0000037059 [system] TRACE: received 557
0000037060 [comm] TRACE: chunk
0000037061 [comm] TRACE: chunk idx=15 crc=1 fast=1 updating=1
0000037063 [comm.protocol] INFO: rcv'd message type=7

[ at this point it hangs and there is no more output until ]
[ the 90 second update timeout ]

@BDub I just wanted to tag someone on the Particle team to see if you have any input, or if this sounds like any known issue (or known user error), for which there might be a workaround (or fix). Thanks!

@TyGuy thanks for the detailed report and logs. This is very interesting indeed and we will look into this further. Did you happen to mention which 3rd party SIM you are having the trouble with? I didn’t catch it if you did.

Regarding the keep alive in safe mode for third party SIMs, that is a problem in general that needs to be addressed probably by saving the SIM settings to DCD memory. However I don’t think that is the problem here since the re-handshake consumes data and keeps the connection alive just like the keep alive ping does. Shortly after that the OTA update would start, so the shorter keep alive ping won’t be needed.

Hey @BDub thanks for the response. We are having this problem with SIM cards from eseye, which is the SIM we mainly use, but I also tested with GeoSIM (with the same results), because we had a couple on hand.

Makes sense about the keepAlive.

Thank you for taking a look!

Hi @BDub and Particle team, I just wanted to follow up on this and see what the status of things or timeline for looking into this might look like. My team is leaving to deploy our next batch of devices very soon; the next deployment will be in about 9 days. So at least knowing if we should hope for a response soon, vs. needing to put a backup operational plan in action (which we'll begin planning for anyway, of course), would be very helpful. Thanks!

Having a very similar problem, only with 3G Electron and AT&T SIM cards. Fortunately I’m able to physically visit the units in question, but this is worrisome for our units farther afield. OTA worked a few weeks ago and now they don’t anymore. Definitely giving us a heartache right now.

How can we go about getting this issue prioritized?

P.S. @TyGuy idx=15 is on a bit boundary. Could be coincidence, but I wonder if this could be related to some kind of low-level bit manipulation or register mask.

This problem still exists, but a workaround seems to be to flash an empty project such as the below, and then flash the updated codebase.

#include "Particle.h"

// Set your 3rd-party SIM APN here
// https://docs.particle.io/reference/firmware/electron/#setcredentials-
STARTUP(cellular_credentials_set("m2m.com.attz", "", "", NULL));

SYSTEM_MODE(AUTOMATIC);

/* This function is called once at start up ----------------------------------*/
void setup()
{
	// Set the keep-alive value for 3rd party SIM card here
	// https://docs.particle.io/reference/firmware/electron/#particle-keepalive-
	Particle.keepAlive(60);
}

/* This function loops forever --------------------------------------------*/
void loop()
{
	//This will run in a loop
}

While these failures we’re seeing are almost certainly linked to the final binary size, it could be that the 3G issues are caused by a different mechanism from the 2G failures @TyGuy has observed.