OTA Issue - Endless updates consuming data at an alarming rate

I have been using the Particle Product over the air updates for over a month and I have to say I love it. However in this latest update:

  • deviceOS from 1.3.1 to 1.4.2
  • firmware from v6 to v9

One in five Electrons has a catastrophic failure:

  • the deviceOS updates to 1.4.2
  • the firmware updates to v9 again and again and again in rapid succession

Here is what I see on the console when a device goes into this mode:

And unfortunately, the data consumption goes through the roof. In fact, on a 3Mb plan, this device went to 11.63 Mb before Particle could pause the SIM.
34%20PM

I had this happen to one other device, placing it into “Dev” mode stops the endless updates but, to fix the device, I had to physically go to the device and perform the following steps:

  • Run Device doctor - no obvious issues
  • Run Particle update
  • Reflash the device with the v9 firmware using DFU

Obviously, I cannot do this for my more remote devices.

All this raises some questions:

  1. Why did Particle let the dive go over its monthly limit by 3x before pausing the SIM?
  2. Why does this update work for 80% of the devices and fail spectacularly on the 20%
  3. Can I remotely fix this issue and bring these devices back into the normal population?
  4. What tests can I do to avoid this in the future?

I will also open a help desk ticket but, any advice would be appreciated.

Thanks, Chip

Hi Chip,

Thanks for asking these important questions! As you know, we’re working internally to resolve this for you - and due to the nature of the diagnostic process for this issue (poring through account-specific logs in our backend), I strongly recommend anyone experiencing a similar issue to open up a support ticket.

  1. Why did Particle let the dive go over its monthly limit by 3x before pausing the SIM?

Particle works with several telephony partners to manage its Particle SIMs. Particle has realtime access to your data consumption metrics, hitting a partner-specific endpoint when a threshold is crossed to say “shut that SIM down!” Overwhelmingly, the appropriate gate closes immediately. Sometimes, however, this communication can incur some latency - this is typically not an issue given the rate of most normal, day-to-day data consumption. However, sometimes data can be consumed with such vigor that it makes this latency apparent.

  1. Why does this update work for 80% of the devices and fail spectacularly on the 20%

After taking a look in our backend, it’s clear that some of these devices are resetting in the middle of OTAs, either a) rejecting their incoming firmware or b) disconnecting due to some instruction in their firmware. I’d love to take a more concentrated look at this in the context of a support ticket.

  1. Can I remotely fix this issue and bring these devices back into the normal population?

When you mark the device as a development device, what happens? It stops receiving updates but does it go offline? Is it impossible to flash OTA to the device?

  1. What tests can I do to avoid this in the future?

A great question. I will post the results, as appropriate, of our work together here! In general:

  • be careful about the placement and duration of a device’s sleep cycle so as not to interfere with the OTA process
  • be vigilant against calling Particle.function()s and especially .variable()s with such rapidity that it exceeds a given device’s timeout window
  • be aware of blocking and how to avoid it

but we’ll develop the above as we speak further!

2 Likes

@marekparticle,

Thank you for the quick response and for the insights into data caps and how to avoid in the future. I hope that figuring this out here can help others.

For anyone reading along, here is the v9 code: https://github.com/chipmc/Cellular-Pressure-Next

Thanks,

Chip

3 Likes

I had a similar problem (but with a xenon) that always fell asleep when it tried to update it’s firmware (it should sleep but I didn’t think about keeping it awake in case of updates) I somehow thought the subsystem would manage these things. Another one seemed to be “fast enough” to complete before sleeping. Is there a simple way to prevent sleeping? Something like an update-pending-flag that can be checked before sleeping?

@glx,

In my case, the program code needs to instruct the device to go to sleep. I could be wrong but, my assumption, like yours is that once the update process begins, my code is no longer running and the system would manage wake / sleep.

The only wrinkle in this is that my carrier board has an external watchdog timer. However, I pet that watchdog as part of the wake up / connect process so I don’t think it is a factor in my case.

Thanks,

Chip

Update

So, this does not get to root cause but I have this device back in service. Here is what I did:

  • Turned on the “development device” flag in the console to stop the update do-loop cycle
  • At this point, the device was at deviceOS@1.4.2 but the firmware was v6 and the device went to sleep
  • When it woke, at @marekparticle’s suggestion, I flashed Tinker onto the device
  • Then I flashed v9 onto the device using “cloud flash” from the workbench.

At this point, the device had v9 and was working.

For fun, I unmarked the device for development (v9-v6) and then locked it to v9 and it updated as expected.

I am a bit nervous about updating the rest of my devices - especially the ones I cannot get to but at least I got this one recovered. I am hoping there is more to this story so this kind of issue can be avoided in the future.

Chip

With SYSTEM_THREAD(ENABLED) your code will still be running while downloading the new binaries.

Yup, that would be nice and I have proposed such behaviour years ago


But as it seems, there wasn’t enough community support for the issue to add some weight to the proposal.

@ScruffR,

When you say “wasn’t enough community support”, how does community support for an enhancement get measured.

BTW, if possible I would be happy to give blanket support for issues you raise as they do seem to be well aligned to my needs.

Thanks, Chip

2 Likes

When a particular issue comes up in the community people sometimes search the GitHub repo whether this or a similar issue was already reported and then comment on that issue there.
A issue with multiple individual commentors gradually gets more visibility on GitHub and also here in this forum when a link (as above) gets posted and attracts more people with similar needs.
So the more supportive comments on GitHub the better :sunglasses:

Thanks for that - much appreciated :blush: :+1:

1 Like

All,

Update here. I tried to simplify the problem as the update process I was performing updated both the deviceOS and the firmware. I simplified the process by only updating the firmware to v10 but staying with deviceOS@1.3.1. After testing this successfully with a number of devices, I started updating again.

Unfortunately, the 15th device went into the endless update loop again. The good news is that this is not a $5 mistake like before (12Mb in data charges before the system suspended the SIM) as it was only my firmware.

Again, looking for any help in identifying when a select few devices fail - and fail in this expensive way.

Thanks,

Chip

@chipmc,

Just wanted to add to this thread that I have the same problem on a regular basis in my products. I manage it the same way you did above, which is switch the OTA looping devices to development, then wait a day or so, then try again. Waiting a day or so seems to help. I’ve had to do this for between 1 and 5 devices out of hundreds getting updates on most recent firmware deployments.

From my observations and the device logs I receive, the devices actually get the new firmware but the system publish function that publishes the application version to Particle does not get updated. An example is:

  • Device X on App v100 gets OTA to v101
  • Device X successfully completes OTA giving success message, then reset message
  • Next system publish that comes from device indicates device is still on App v100
  • My App, however, has a Particle.publish(APP_VERSION) event that says it is on v101
  • OTA starts looping

Not sure if sleep caused your issue or not, but it was not the cause of my issues.

As for your other question:

  1. Why did Particle let the dive go over its monthly limit by 3x before pausing the SIM?

Are these LTE devices? I’ve also had this problem on my LTE devices. To my knowledge, Particle changed cellular partners for LTE devices from Telefonica to KORE. KORE has a number of issues I’ve experienced, including:

  1. KORE does not send updated data usage logs until a device runs out of data
  2. It is not easy to activate / deactivate / reactivate SIMs, about 20% of the time I need to contact Particle support to get LTE devices reactivated.

I’ve asked for Telefonica LTE SIMs, but I don’t think they are available. My mitigation on this is to make sure devices don’t get stuck in OTA loops.

@hwestbrook,

Thank you for chiming in. Glad it might not just be my bad luck.

I have also found that after a few days in “developer mode”, I can remove the label and the device does not go back into the endless loop.

I am seeing the same symptoms on the endless loop.

I have had both the Electron 3G and the Electron LTE exceed their data usage caps. As I scale, this issue could get expensive.

Thanks again, I will see if I can get an update on my ticket and will post whatever I learn here.

Chip