And unfortunately, the data consumption goes through the roof. In fact, on a 3Mb plan, this device went to 11.63 Mb before Particle could pause the SIM.
I had this happen to one other device, placing it into “Dev” mode stops the endless updates but, to fix the device, I had to physically go to the device and perform the following steps:
Run Device doctor - no obvious issues
Run Particle update
Reflash the device with the v9 firmware using DFU
Obviously, I cannot do this for my more remote devices.
All this raises some questions:
Why did Particle let the dive go over its monthly limit by 3x before pausing the SIM?
Why does this update work for 80% of the devices and fail spectacularly on the 20%
Can I remotely fix this issue and bring these devices back into the normal population?
What tests can I do to avoid this in the future?
I will also open a help desk ticket but, any advice would be appreciated.
Thanks for asking these important questions! As you know, we’re working internally to resolve this for you - and due to the nature of the diagnostic process for this issue (poring through account-specific logs in our backend), I strongly recommend anyone experiencing a similar issue to open up a support ticket.
Why did Particle let the dive go over its monthly limit by 3x before pausing the SIM?
Particle works with several telephony partners to manage its Particle SIMs. Particle has realtime access to your data consumption metrics, hitting a partner-specific endpoint when a threshold is crossed to say “shut that SIM down!” Overwhelmingly, the appropriate gate closes immediately. Sometimes, however, this communication can incur some latency - this is typically not an issue given the rate of most normal, day-to-day data consumption. However, sometimes data can be consumed with such vigor that it makes this latency apparent.
Why does this update work for 80% of the devices and fail spectacularly on the 20%
After taking a look in our backend, it’s clear that some of these devices are resetting in the middle of OTAs, either a) rejecting their incoming firmware or b) disconnecting due to some instruction in their firmware. I’d love to take a more concentrated look at this in the context of a support ticket.
Can I remotely fix this issue and bring these devices back into the normal population?
When you mark the device as a development device, what happens? It stops receiving updates but does it go offline? Is it impossible to flash OTA to the device?
What tests can I do to avoid this in the future?
A great question. I will post the results, as appropriate, of our work together here! In general:
be careful about the placement and duration of a device’s sleep cycle so as not to interfere with the OTA process
be vigilant against calling Particle.function()s and especially .variable()s with such rapidity that it exceeds a given device’s timeout window
Thank you for the quick response and for the insights into data caps and how to avoid in the future. I hope that figuring this out here can help others.
I had a similar problem (but with a xenon) that always fell asleep when it tried to update it’s firmware (it should sleep but I didn’t think about keeping it awake in case of updates) I somehow thought the subsystem would manage these things. Another one seemed to be "fast enough" to complete before sleeping. Is there a simple way to prevent sleeping? Something like an update-pending-flag that can be checked before sleeping?
In my case, the program code needs to instruct the device to go to sleep. I could be wrong but, my assumption, like yours is that once the update process begins, my code is no longer running and the system would manage wake / sleep.
The only wrinkle in this is that my carrier board has an external watchdog timer. However, I pet that watchdog as part of the wake up / connect process so I don’t think it is a factor in my case.
So, this does not get to root cause but I have this device back in service. Here is what I did:
Turned on the “development device” flag in the console to stop the update do-loop cycle
At this point, the device was at deviceOS@1.4.2 but the firmware was v6 and the device went to sleep
When it woke, at @marekparticle’s suggestion, I flashed Tinker onto the device
Then I flashed v9 onto the device using “cloud flash” from the workbench.
At this point, the device had v9 and was working.
For fun, I unmarked the device for development (v9-v6) and then locked it to v9 and it updated as expected.
I am a bit nervous about updating the rest of my devices - especially the ones I cannot get to but at least I got this one recovered. I am hoping there is more to this story so this kind of issue can be avoided in the future.
With SYSTEM_THREAD(ENABLED) your code will still be running while downloading the new binaries.
Yup, that would be nice and I have proposed such behaviour years ago https://github.com/particle-iot/device-os/issues/1166
But as it seems, there wasn't enough community support for the issue to add some weight to the proposal.
When a particular issue comes up in the community people sometimes search the GitHub repo whether this or a similar issue was already reported and then comment on that issue there.
A issue with multiple individual commentors gradually gets more visibility on GitHub and also here in this forum when a link (as above) gets posted and attracts more people with similar needs.
So the more supportive comments on GitHub the better
Update here. I tried to simplify the problem as the update process I was performing updated both the deviceOS and the firmware. I simplified the process by only updating the firmware to v10 but staying with deviceOS@1.3.1. After testing this successfully with a number of devices, I started updating again.
Unfortunately, the 15th device went into the endless update loop again. The good news is that this is not a $5 mistake like before (12Mb in data charges before the system suspended the SIM) as it was only my firmware.
Again, looking for any help in identifying when a select few devices fail - and fail in this expensive way.
Just wanted to add to this thread that I have the same problem on a regular basis in my products. I manage it the same way you did above, which is switch the OTA looping devices to development, then wait a day or so, then try again. Waiting a day or so seems to help. I've had to do this for between 1 and 5 devices out of hundreds getting updates on most recent firmware deployments.
From my observations and the device logs I receive, the devices actually get the new firmware but the system publish function that publishes the application version to Particle does not get updated. An example is:
Device X on App v100 gets OTA to v101
Device X successfully completes OTA giving success message, then reset message
Next system publish that comes from device indicates device is still on App v100
My App, however, has a Particle.publish(APP_VERSION) event that says it is on v101
OTA starts looping
Not sure if sleep caused your issue or not, but it was not the cause of my issues.
As for your other question:
Why did Particle let the dive go over its monthly limit by 3x before pausing the SIM?
Are these LTE devices? I've also had this problem on my LTE devices. To my knowledge, Particle changed cellular partners for LTE devices from Telefonica to KORE. KORE has a number of issues I've experienced, including:
KORE does not send updated data usage logs until a device runs out of data
It is not easy to activate / deactivate / reactivate SIMs, about 20% of the time I need to contact Particle support to get LTE devices reactivated.
I've asked for Telefonica LTE SIMs, but I don't think they are available. My mitigation on this is to make sure devices don't get stuck in OTA loops.