I am occasionally (1-2 times per day) experiencing a dropped connection. After further investigation through serial monitor and by casually observing the RBG light on the Boron, I FINALLY caught at least one and likely two events that resulted in the dropped/reconnection. I’m quite certain it is something within my specific code on the Boron rather than an issue with the device at all.
When this occurs, the boron RBG does an SOS of 10 blinks between two SOS patterns which per https://docs.particle.io/tutorials/device-os/led/photon/#red-flash-sos means assertion failure. It only happens 4-5 times a day when I’m publishing data every minute (more frequent then needed but trying to get to the bottom of this issue). So far, I can’t figure out what’s causing it. As soon as this happens my serial logger disconnects from the PC and it is “mid stream” of spitting out data to the serial logger. I’m afraid I am not getting the full picture or exactly what line in the code is causing this. I’m afraid the actual issue is occurring several steps forward in the code. Even if I put a serial.println() within every line of code, I am afraid serial monitor still wouldn’t provide details.
Can someone provide better clarification on what Assertion failure really means or what are the leading causes of it. Secondly, any guidance on ways to pinpoint what piece of code is causing it? For now, my strategy is to add serial.println() on each line of code and a short delay(100) to try and give it some time to print each step to serial and then try to pinpoint the exact line of code.
For reference, I’m on LTS 2.0 Particle Boron. I flashed Tinker and it stayed connected for the entire day with no issue so for now I’m thinking it’s something I’m trying to do in my specific code.
This is what I'd have suggested too.
In addition you can instruct your SerialLogHandler to use LOG_LEVEL_ALL to get some of the device OS states prior the incident logged too.
AFAICT SOS+10 happens when your code causes the device OS to run into some states that are not meant to happen (should be impossible to reach).
These cases are very difficult to pinpoint as they usually are not a direct effect of your code but rather an accumulation of issues that eventually lead to such a crash.
Memory leaks, heap fragmentation and obviously the inevitable bug may be contributing factors.
To properly debug this you'd probably need a proper debugger which halts on an assertion fault and have a look at the stack trace.
However a keen eye on your code might be able to spot potential contributing factors.
I do have one of the gen 3 debuggers lying around here somewhere. I connected it once right one I received it but didn't have an issue since then. Maybe I need to drag that back out. My struggle is it's so intermittent so not sure how to "capture" the event using that. It'll run for hours and process many events without issue, if it was repeatable it would be much easier to pinpoint. Is there any guidance on how I configure it to halt on a specific assertion fault? Is there some good documents or other open source tutorials you can point me to on how to configure that? I am using VS Code / Workbench as my IDE.
Haha. Yikes. looks like I'm on a wild goose chase.
Yup, it's possible. Leave it to me.
If you or anyone else have any other words of advice let me know. I'll try and report back what I find if I'm able to find it.
@ScruffR - So do you think adding delays would correct or band aide an issue where it doesn't occur anymore? Is it by chance I had a race condition of some sort? I added this: Serial.println("Reached debug step: x") delay(100);
Where x is just a 1 up number at different points. Prior to adding this it was occurring about every 2-4 hours between capturing an event. Since I added this to about 10 lines, it now has been 6 hours or so and hasn't occurred yet. This
Would adding a few delays here and there be a reasonable way to explain it? Not sure if I should let it go like this for a day in hopes I eventually catch something or if I should start taking a delay(100) away one at a time and see what delay had an impact. Maybe it's just the old phrase "a pot watched never boils" and I need to be more patient. The pieces of logic right before/after/around involve:
Basically, taking an array of bytes, converting them to a base64, packaging it in a JSON object and then queuing that data to be published. Somewhere in the middle of this I do a Serial.println() and it is in mid stream of printing this out using serial print where PuTTY stops updating and SOS occurs.
The unusual or maybe surprising part to me is once the device reboots itself after the SOS x10 flashes, once it reconnects to the cloud, one of the first messages that hits the particle cloud even before the device vitals is reported is the queued up event from before the SOS event. I would of thought an SOS type error was a hard reset and all memory cleared.
For now, I just sprinkled in 10 different print lines and one up numbers with the delay() to try and figure out what's the last line that was executed before the SOS. Since the message was actually queued up to be published, I wonder if something is occurring after this. It just happens to be right in the middle of writing this to serial when the serial port freezes, I'm using PuTTY for serial monitor. Just thought I'd share a little more detail as I keep working through it incase you or someone has ideas to try.
That depends on how you've setup the the publish queue. When it's storing the data in retained memory it will start publishing previously stored data as soon it's back up running.
Yeah ok, I'll have to dig into the documentation/examples of that further. Thanks!
So in the last 15 hours or so, I only had 1 event vs having 4-6 per 24 hours previously. Secondly, this single event occurred at a different area of the code/state. However, it also stopped mid stream of a publish queue. The only commonality between the two is both use JsonParserGeneratorRK.h
PublishQueueAsyncRK.h
// This creates a buffer to hold up to 256 bytes of JSON data (good for Particle.publish)
JsonWriterStatic<256> jw;
{
JsonWriterAutoObject obj(&jw);
jw.insertKeyValue(String(dataPtID_DevReadUTC), Time.now());
jw.insertKeyValue(String(dataPtID_Sensor1), ReadSensor1());
jw.insertKeyValue(String(dataPtID_Sensor2), ReadSensor2());
jw.insertKeyValue(String(dataPtID_Sensor3), ReadSensor3());
jw.insertKeyValue(String(dataPtID_Pwr), isPowered());
jw.insertKeyValue(String(dataPtID_msgCnt), msgCnt++);
}
Serial.println(jw.getBuffer());
publishQueue.publish("Stat", jw.getBuffer(), 60, PRIVATE, WITH_ACK);
In this example, the serial.print output viewed via PuTTY stops mid stream of printing out the contents of Serial.println(jw.getBuffer()). I wonder if something is happening with publishQueue.Publish(). I can try and add delays before/after this as well nd see if it goes away but would be nice to identify what is actually happening. Any other guidance? or maybe like you said earlier, I should just get the debugger out and try to catch the stack trace when this occurs. I am not familiar with it and my concern is it'll take a lot of time to figure out how to set it up and not sure if I'll be able to interpret the stack trace to find anything meaningful as C++ and this type of debugging I haven't done before.
@rickkas7 - You do such an amazing job with providing these libraries for Particle users like myself as well as answering questions on these forums like this. I thought I’d tag you here to get any insight. Is there anything explainable that adding a Delay(100) immediately before or after a publishQueue.publish() would prevent a seemingly random race condition that ends up with a SOS assertion failure? This is using your PublishQueueAsyncRK.h library.
It might be a little too early yet to say with any certainty but from my perspective, it feels like there is something with publishQueue.publish(“Stat”, jw.getBuffer(), 60, PRIVATE, WITH_ACK) line within the sketch that causes the SOS Assertion Failure under certain circumstances. Whatever those circumstances are, when I add a delay(100) immediately before and after seems to prevent it completely or at the very least it greatly reduces the frequency of it.
I have two instances of publishQueue.publish() in my sketch. Judging by the last text that made it out the Serial.print() these always occurred immediately before/during/after one of the specific publishQueue.publish() calls. I then added a few delay(100) before/after it. Since then, that specific publishQueue.publish() never had an issue and the frequency of occurrence dropped significantly. All new events started occurring on the other instance of publishQueue.publish() within the sketch. This instance publishes slightly different data. As of about 9 hours ago, I added a delay(100) immediately before and after the second instance and so far I haven’t had an issue yet. I am very thankful for this but certainly would be nice to understand this further. Is there anything explainable in this? Would the delay prevent a race condition of some kind within publishQueue.publish().
At this point, I’d like to see a full 24 hours of 100% cloud connectivity with no SOS events. I set it to crank out data at 1 minute intervals to try and get it to repeat quicker, so far so good. If I get to 24 hours with no events, my next course of action would be:
take away all delays I sprinkled in except the one immediately before and after both instances of publishQueue.publish().
Go another 24 hours of no SOS events.
Remove the delay() immediately before the instances of publishQueue.publish(). I doubt that’s doing anything anyhow.
Go another 24 hours of no SOS events.
Keep cutting the delay in half until it’s equal to 1 and see at what threshold the issue comes back. With every reduction in delay, go 24 hours of no SOS events. If I make it to 0, then remove the delay.
Add a delay to loop() and see if it just needs a delay somewhere vs specifically around publishQueue.publish().
Maybe at a step or two in this process, go back to my original sketch without the delay() and just make sure the issue still is occurring vs. something else changed.
I’ll report back here what my learning is. If there is anything I should be doing differently or should also try just let me know.
Alright... quick update. I am happy to report I just went 24 hours with publishing 2 events / minute with zero disconnects, SOS events and the device is still writing data out the Serial port (never hung up). Now moving onto what specific delay() and what the value of delay(x) needs to be to prevent this. I.e. starting with these steps:
Alright… that didn’t take too long. I removed all delay(100) I sprinkled in except the ones immediately after publishQueue.publish(). I.e. Step 3 in my original plan (I skipped step 1 thinking that wasn’t it). Within 3 hours after this change the event happened again. This time I was able to see it first hand. The RBG light stayed a solid magenta. I think this is described in docs https://docs.particle.io/tutorials/device-os/led/boron/#solid-colors “In most cases, solid colors are the side effect of a bug. If code crashes or infinitely loops with interrupts disabled, it’s possible that the LED animation will stop.”
I even caught the solid magenta color for 10-20 seconds on video on my phone quick before my external watchdog circuit reset the device.
Now going back to Step 1 in my original plan: All delays removed except the delay(100) immediately before and after publishQueue.publish(). Lets see how long it lasts this time and where it hangs. Looks like I’ll just keep adding delay(100) back in one at a time until I get another 24 hour of no issues.
Just figured I’d provide an update if anyone was curious on my progress.
So after adding the delay() back in before the publishQueue.publish() I still had one reset/disconnect/serial print line freezing up in the last 12 hours or so. It seems to have hung up right in the middle of the publishQueue.publsih(). As you can see in the screen shot. It successfully printed out what was in the JSON buffer string, “New Message received: 6” but did not make it to “new Message received: 7”. The only line of code between the two is the delay(100) and publishQueue.publish(). However, as you can see, it did successfully queue the publish before the device reset itself due to the external hardware watchdog as the event data does in fact make it to the cloud after the reset event and reconnected.
I am not sure where to go next besides keep the same test running and see what the frequency or how repeatable this is. If anyone has any additional thoughts or guidance let me know. My only other thought is the delay() does in fact seem to reduce the frequency of events, so maybe instead of delay(100) I make it delay(250)? And I really can’t understand why adding delays would help at all. Is it simply bad timing between the thread that does the work with PublishQueue and some other thread and by adding a delay the PublishQueue thread doesn’t get interrupted?
I’d sure appreciate any additional thoughts, comments or guidance on what to look at next. @rickkas7, @ScruffR.
And a few hours later I had another event same spot in the program execution. This time it printed out a few more characters before locking up.
For reference, the “normal” print out when it doesn’t have an error is app.pubq() INFO: queing eventName=…" as shown here. This is what the next line should look like when it doesn’t error and restart itself:
"You should also avoid mixing the use of Serial.printlnf and Log.info (and similar calls). Since Serial.print is not thread safe, it can interfere with the use of Log calls."\
I wonder if the serial.print I was using was conflicting with the thread or conflicting with the serial port that the Loghandler inside of publishQueue.publish() was also writing to. Not exactly sure of the details of this or if this makes logical sense but I thought I'd give it a try. So I removed all Serial.print statements within my sketch and switched them all to Log.info or Log.warn and then removed all the delays() I sprinkled as part of my testing. Let's see how long it lasts this time.
I am very happy to report that it has been about 14 hours with 100% connectivity, no device resets with the device publishing 2X per minute and I haven’t had any issues. At this point, I suspect the root cause was using Serial.println() and Log.info() within the same project. I think the race condition or timing just happened to be perfect around the Log.info within publishQueue.publsih() and the serial.print() I used just before and after it. Adding the delays just happened to change this timing enough.
This has been yet another reminder to always read the docs!
Solution: Change all Serial. Print, Serial.println, and similar instructions to instead use Log.info (or similar based on desired logging level)