I'd like to get more clarity on how automatic rollbacks of OTA updates work. The documentation says this:
If for some reason an OTA is interrupted (like a disruption in connectivity or a device losing power), the device will fail gracefully by automatically reverting to the previous version of working firmware
And this is the only reference to "rollbacks" that I found in the docs.
What about the case when OTA is technically completed (no disruption in connectivity) but device is in unusable state, like not responding to commands? Ideally each device should have a customizable health check which it should pass within N seconds after OTA. Is this supported by firmware?
Maybe "rollback" isn't the correct term. When an OTA update fails the new firmware will never get activated.
The process is
the new firmware gets downloaded into a dedicated OTA flash area
the downloaded firmware binary gets checked for integrity
the bootloader tries to transfer the new firmware from OTA flash area to execution space
If any of the above fails the original firmware will stay untouched.
But on Core, Electron and mesh devices there is a factory image section where you can deposit a fallback firmware which can be copied over to the execution section via factory reset.
what reasons would you be thinking of?
How would you define "customizable"?
What would you tell the health check what to check for?
If it's customizable it could also be customized badly which wouldn't help either.
On the other hand when you are paying attention to customizing it thoroughly you may also be able to run your own health check entirely in your firmware and when detecting issues put the device in Safe Mode.
This would have the additional benefit that this would be independent of OTA updates but could be done at any time when your device misbehaves.
A bug, programmer error. Some kind of deadlock or state migration or any other bug that programmers can make.
Written in code by a programmer of the device.
That is specific to the use case. Maybe try to communicate with the cloud and initiate a sequence of two-way pings. If it's mechanical, maybe try to do something and measure a reaction. Again, depends on the device and what programmer considers to be a "healthy" state of this device.
Of course, and that's why this needs to be integrated into DeviceOS directly. Something like a dead man's switch that will activate if it doesn't get a confirmation from both its own health check (cloud handshake) and programmer's custom health check.
Yes, but if my health check hangs for any reason, something else needs to put the device into Safe Mode.
I see where you're coming from and the wish for some insurance policy is a fair one.
However, these two statements of yours illustrate the problem well
vs.
That is a contradiction in itself, isn't it?
Who wouldn't like to eat the cake but still have it.
The individual use case may - by design - not want a cloud handshake and/or the custom health check may be flawed. Hence what would the device OS be acting on?
Yup, something else always helps, but the problem is what that would be.
The difficult thing is not to see what would be nice to have but how to achieve it.
"Something" is a dangerous word. It's used so quickly and easily but doesn't really provide any useful information and may even suggest the availability of a solution to an unsolvable problem.
I don't quite see why it would be a contradiction. DeviceOS might be able to invoke user-written functions and at the same time (in different thread?) do its own health checks and proceed if user-written callback didn't respond successfully within a timeout period.
If we look at it from the layered perspective, then DeviceOS is expected to be super-stable and survive and recover independent of what user code is running on top of it. Compare it with Linux/Windows: if you use documented APIs then you won't be able to crash it (please don't pick on my words here, I'm just using it as an example).
Another comparison: Kubernetes. If you try deploying a bad service that doesn't complete a health check in time then Kubernetes will roll it back. It can also be based on user-specified metrics (error rate, exceptions, etc).
With Particle, I don't think there's much choice. What else other than DeviceOS is available on the device?