We have a few hundred Elections in the field. We are having a few devices which get into a solid red state. From our understanding the device may have been outputting a SOS pattern but something (stack overflow of the like) has locked up the device. For now it is a rather expensive push on the reset button.
Does the locked up state mean that no code is running and that the RGB is red because that was the last command it received?
Presently, we have to get someone to go reset the device to get them back working.
I understand that having an external watchdog timer circuit could get the electrons out of this mode but as the circuit boards have already been manufactured and are on site I wondered if there is any other clever way someone could suggest to reset the device.
The firmware is largely stable and we are only seeing this on a few devices over many months (years).
@Moors7 @rickkas7 @ScruffR as you have been helpful before do you have any views?
It’s almost always a software issue in user firmware. The docs include this:
In most cases, solid colors are the side effect of a bug. If code crashes or infinitely loops with interrupts disabled, it’s possible that the LED animation will stop. The color of the LED is the color it last was before failure. So for example, it could be solid cyan if it was previously breathing cyan, or solid red if it was trying to output an SOS pattern.
Another reason is deadlock. For example, if you try to obtain a mutex from a
SINGLE_THREADED_BLOCK while another thread has obtained the mutex, the system will halt because the mutex cannot be obtained (it’s in use), but because it was done from a single threaded block, the other thread can never be swapped in to release it, so the device just halts whatever it was doing. However this usually results in LED off or LED cyan, since the device was probably in normal operating mode when this happened.
Solid red is almost always because the device locked up while outputting an SOS code. Tracking when you get an SOS but the device does not lock up may yield clues to where the underlying bug is.
Typically it’s something like using freed memory, freeing memory twice, using an uninitialized pointer, overwriting or underwriting an allocated memory block, or overflowing the stack.
Thanks for the quick response. We appreciate that it may be some issue in user firmware but with thousands of lines of code it is difficult to understand where it is. It is running successfully but every now and again causes a problem. So at the moment not possible to find out where the issue is.
We do not use any SINGLE_THREADED_BLOCK
Interrupts are not disabled in the firmware
So, likely it has crashed somehow.
The commonality amongst the devices is low cellular signal and the devices are dropping and reconnecting signal often. But that may be a red herring.
Assuming we can’t find the error, is the only way to get out of the issue to “press” the rest button or to cycle power?
Essentially that’s what a hardware watchdog does, it presses the reset button when the MCU stops running. There’s no way to get out of this state on the device itself because it’s basically stopped running.
However, one workaround that may band-aid the problem is to periodically reboot the devices using System.reset(). Maybe once a week, or every few days. If the problem is memory corruption, you may successfully reboot the device before the corruption causes it to lock up. Especially on the Electron/E Series, the cellular modem stays connected during a reset, so it’s fast and does not use large amounts of cellular data.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.