we’ve recently migrated our fleet (~800 Photons) from device OS 1.4.4 to 2.0.0. Occasionally, the photons freeze and need to be power cycled. The Photon is totally unresponsive, offline, stops executing the code.
It never happened (not a single time) on 1.4.4, so we concluded there must be an issue with 2.0.0. Before preparing a minimal example to reproduce this issue, I trying to find out whether this was already noticed by Particle?
Some more info:
watchdog is set up, but it doesn’t trigger
it happens really rarely (hard to reproduce) - I heard about it only from few customers. Personally, I saw it 3 times in half-year on our internal devices.
we are sure it is caused by device OS, not our code
Particle has broken the hardfault handler. It doesn’t SOS and reboot anymore, but locks up.
There is also a bug in the wifi code that causes a hardfault occasionally when wifi is lost, but before the hardfault bug was introduced this would only cause a reboot.
I have opened a PR with a fix.
It seems that to fix a compiler warning, someone added an = character without ever checking that this didn’t change the meaning of the code.
I think the hardfault handler now causes a new hardfault, resulting in an endless loop.
I’m now trying to convince them that this is a bug severe enough to do a hotfix release immediately…
Perfect, this is a super valuable finding. I could not reproduce it on our side but we have several devices deployed to customers who report that problem.
I can confirm the problem persists on all current 2.X.X and 3.X.X Device OS versions.
Yeah breaking the hardfault handler because you didn’t bother to test that your change didn’t change the syntax of the assembly? Wow.
For reference, the test isn’t hard to write:
I’m surprised it went unnoticed for such a long time. I guess it is because it is a bug in the bug catcher. Everyone focuses on the first bug and misses that the freeze is not part of the first bug.
For my customers, the device freezing when wifi is lost can result in ruining large batches of beer. The device suddenly freezes with the heating or cooling in the last state!
The wifi bug is an old bug, but at least up until the hardfault bug the device would just reboot. Now the watchdog doesn’t even work anymore.
I have given the line number of the wifi bug, so hopefully that gets some attention too. This bug has caused me frustration and lots of time for years.
Particle seems to focus more on the newer generation of devices now. I think we deserve more attention as long time customers.
We’re not asking for much now, just merge the fix into the latest rc and release it as an extra rc.
Do the wifi bug later. Just fix the hardfault asap so we can reboot instead of freeze .
GCC 10 shows a warning about alignment without that = character.
By adding the = character, the warning is gone, but I think this is because a different address is used (the = causes the expression to be evaluated). So the warning is gone because the wrong address is loaded now, not because the alignment is fixed for the correct address!