Issue: after 11-12 of no-reset-no-sleep run hours, my Electron is freezing / crashing with a solid LED that is either green or cyan. This seems to be related to detachInterrupt(), but that seems unlikely, so I am posting here.
I run a bunch (200+) of electrons for ag-tech use, and I am experiencing the issue on a new version of our product. We run Particle firmware 0.6.1 in manual and system thread enabled modes with a relatively complex user app, which I can’t post here because it is too long. Both the old and new versions of the product count RPM as part of their sensor arrays, but the new product counts a much higher pulses per second 1500-2500, than the old, 150-250 pulses per second.
The freeze is unpredictable, so it is hard to reproduce, but I have been able to reproduce it on my test bench with another microcontroller providing the RPM source. I’ve reproduced it twice, and only after device has run for 11+ hours. On the latest run, I put logging into the app to see if I can trace it back to a particular function. The function I traced it back to is here:
When it crashed we received nothing logged to serial after TACH2. The only functions between TACH2 and TACH3 are:
set timerStop to millis()
calculation of RPM based on rpmCount and timers
set rpmCount to 0
I’ve tried to create some simple firmwares that would cause this crash to repeat, but no luck. I’ve attempted:
getting stuck in that while loop forever — LED continues to pulse
millis rollover, i.e. timerStart > timerStop — sets rpm to 0
rpmCount much higher than can be handled by a uint16_t when divided by time — sets rpm to lower value
attach and detach interrupt repeatedly
send data to device during the while loop where Particle process is being called
The questions I am hoping someone here can help with:
Does the serial logging lag behind code execution? i.e. Could TACH3 in that code run without getting logged to Serial?
What exactly does a solid LED mean? I see in another post that it could mean the RTOS has crashed. Is this accurate?
Solid LED freezing seems to be linked to connection issues in other parts of this forum. Is this the only reason I might get this type of freeze or are there other causes?
One issue I see is calling Particle.process() in that while loop. Could that be a source of this crash?
I’ve considered upgrading to 0.6.4, but I don’t see anything in the github repository between 0.6.1 and 0.6.4 that would affect this issue. We do use SPI on our device, but I don’t think the changes related to SPI in 0.6.2 affect this. Is there a reason I should bump to v0.6.4?
Is there a way for the internal watchdog to reset device on system crash, or do I need an external watchdog for that?
It’s not my full code, it is too complicated to post in its entirety here, so I posted a snippet of where my error logging says the issue is.
The Log.trace calls are only there in order to figure out where the issue is coming from. They were added after the issue started happening and don’t seem to affect whether the Electron freezes or not.
Have you also checked out 0.7.0 change log?
As mentioned in other threads any solid colour indicates a deadlock in the system - most likely due to some shared resource usage and Serial is one of these.
But without knowing more about the rest of your code, that’s all we can see as possible contributor to the issue.
So you are leaving the heavy lifting of testing and reporting possible issues so that they can be found and addressed to others? But what if noone would use the system in just the way you need to use it?
How would potential bugs for that use-case ever be found and eliminated for 0.7.x?
How would 0.7.1 come about if everybody just stuck with the legacy version and left the developing of new feature, new hardware, new backends and testing all thinkable edge cases, identifying issues, logging, tracking and squashing bugs to a team of half a dozen engineers?
I’d guess we would not have seen 0.5.4 yet.
@scruffr, on your BTW, I think there is a place for deploying 0.7.0, but the place for that is not on devices in the field. The issue I am trying to fix is for a few devices situated in a place a few hours from where I work. If it makes you feel better I have 0.7.0 on a test rig right now. So far so good, but I was burned badly with 0.6.0, so I won’t be taking chances with it.
When Serial.print lags, can Serial.flush be used to counteract that. For example, if I place a Serial.flush after Log.trace(“TACH3”), will the ensure TACH3 will be sent to Serial prior to moving to the next piece of code to execute?
Update on this issue: Last night I ran two tests, one on a device with 0.6.1 and one on a device with 0.7.0, with this modified code snippet, where I am using Serial.flush to hopefully get a specific location of crash:
Both 0.6.1 and 0.7.0 devices froze at the same place, right after they logged “TACH3” to serial. The 0.6.1 device froze after 4 hours 12 minutes of running and the 0.7.0 device froze after 7 hours 9 minutes.
It looks to me like they froze on detachInterrupt. Is this a shared system resource?
How can I guard against a freeze on detachInterrupt? Or at least eliminate it from possibilities?
I’m also logging free memory, and this doesn’t seem to be an issue. 0.6.1 is running with 93296 free memory and 0.7.1 is running with 83096 free memory, neither of which change over the run times of the device.
@hwestbrook, can you provide context/code as to where and how getRpm() is called? If you can post more of your code, especially your globlal (vars, object instantiations, timers, etc) and setup() code, it might be helpful.
Couple of notes: return; is not required in your ISR. Also, you may want to disable interrupts prior to detaching the interrupt if that interrupt can possibly occur during that call. You can enable interrupts immediately after the detach.
@peekay123, thanks for the suggestion – I will try disabling and then enabling interrupts using noInterrupts() and interrupts(). I’ve just got that test rig going, so now just wait 12 hours or so…
I know this is annoying from your perspective, but our app is just too complex to post here. The general place this function runs is:
5 or so sensors are read and saved into variables (including RPM from this function and this IC: LM1815)
Depending on sensor output and preset settings, our device actuates farm equipment
Sensor data and actuation data are uploaded over CoAP to our own backend server
Our backend server responds with any new settings
This all happens in the loop function which is not a FSM (which is something I need to move to when I have time). We have 1000’s of combined hours of testing in the application, so know it is pretty stable.
I will follow up with something I can post on here that will crash the Electron that is just the code above and maybe some cloud activity.
This crashed my Electron after 29 minutes. The electron running this code is on 0.7.0. The RPM pulse signal is about 3700 pulses per second. A link to the log file is here: https://pastebin.com/DFVM65CL
In parallel I ran a version of the code turning interrupts on and off before and after detachInterrupt, but in my main application. That too crashed that Electron. I will try this on the simpler code above and report back.
Both of these happened quicker than I would have expected, but I don’t think this bug is related to time, so it might be more random.
One obvious item to remove would be Particle.process, although I’m not sure why this would cause this crash?
@hwestbrook, my big concern was how your call calling the function but it sounds like you are calling it from loop() so that’s fine. One thing to remember is that the Log capability was never designed to be synchronous with events, only to log the events. For farm equipment I would have to say that an FSM and exception handling are a must!
At 3700 RPM, the Electron will be interrupted every 270us so disabling interrupts is important IMO. Given that you are using system threading, you should try without Particle.process() calls. I’m not quite certain why you have so many Log.trace() calls in getRPM(). If you only have one when you enter and another when you exit, does the Electron crash? What if you remove all of them in getRPM()?
@peekay123, the only reason I have the logging calls in this code is in order to trace the bug back to its source – all of these Log.trace calls have been added since the bug showed up. My production firmware does not log anything to Serial (it logs to a 200 byte buffer, but does not log anything related to getRpm).
My first versions of this tracing was just on function enter and exit, so I think I’ve already covered that one. And since my main firmware does not log anything in this function, I think that I’ve already covered this as well.
For faster output to Serial, should I be calling Serial.print as opposed to Log.trace? Is there some risk that the program is crashing somewhere other than detachInterrupt?
Here is an updated version with noInterrupts and a delay before detachInterrupt. This is on my test rig now:
Also, I agree on FSM and exception handling, but I don’t want to derail this discussion, as I believe the code above should operate for at least 24 hours. (and should indefinitely, but I don’t want to set my expectations too high!)
@hwestbrook, I said my bit on FSMs so let’s leave it at that
I would remove all logging for now and replace the calls with Serial.print() in getRPM(). I am assuming that you came to the conclusion that things crash in getRPM() by using the Log.trace() calls, correct? From what I see, you are not using any software timers or other asynchronous bits, correct? How are you read “5 or so” sensors? Do they each have an interrupt?
@peekay123, no, I didn’t come to the conclusion of Log.trace calls are causing crash. They’ve been added after the fact and I think they are only pointing to where I think the problem is. I’m not sure what the evidence for Log calls causing crash? I do not have these calls in my normal firmware.
The code below and above is the exact code I am running on my test devices, so there are not other sensors or anything else being read from. In my normal firmware, there are other sensors, like a ds18b20, but nothing that requires pulse counting like this.
@hwestbrook, the question was whether you had ascertained that it was getRPM() where you crash was occurring. I am assuming that your test units are crashing consistently with the code. For the sake of non-blocking, can you change loop() to this:
I would also get rid of the delay(1); after the noInterrupts(); call. Since you have retained memory enabled, you could create a counter and set it to zero. Increment the counter whenever you want to “count” a step in your code. Reset the counter before calling getRPM. When you reboot, if the counter is non-zero in setup() then print the value. From this you will be able to figure out which step the code had completed before it crashed.
The DS18B20 uses OneWire which also makes heavy use of interrupts - including noInterrupts()/interrupts(), hence we were asking for more info about your surrounding code.
We do have a feeling what may interfere with your code, but only when we know about these apparently unlrealated but actually very relevant facts.
However, if you really see the same issue without any extras, there is something rather fishy.
Yes, I am pretty sure it is getRpm where the crash is happening. I slowly put more and more Log.trace statements into my full application code and every time it crashed it crashed right in the middle of getRpm. There was never a time where it crashed on any other function or sensor. If it was a crash from too many Logs or something, I wouldn’t expect it to always crash in the same spot at different times.
From the traces that it was at getRpm, I simplified code down to what is at post #12 and then post #17. I think I should be able to iterate on this code (removing serial prints, etc…) until I can get it to run for 24 hours without a crash.
Do you agree that I should be able to run at least a version of this code for 24 hours without a crash?
The next time I get a crash I will remove all serial prints in getRpm and go with your idea of a counter to find position.
Do you think its OK to have a non-blocking serial print in my loop, like in your example above? It seems like it should be possible to print out data to Serial indefinitely with an Electron. Is there some limit to how many times you can call Serial.print?