Boron - Seemingly random device reset - How to best capture the reason

jgskarda · December 30, 2020, 4:55pm

I am experiencing what seems to be a sporadic device reset during I believe the PublishQueue.Publish() call although not sure. It seems similar to assertion-failure SOS error I had prior but it does not seem to go into an SOS assertion failure anymore but rather just does a device reset. I believe it is doing a full device reset and not just a disconnect event as the setup() is called again which resets a message counter to 0. The only way this is set to 0 is during setup(). This image is why I believe it’s around or related to PublishQueue.Publish().

As you can see, from the console, the message “Stat” is published to the cloud then immediately the device resets. Once the device is connected to the cloud again, it will then Publish the same message again. I’m guessing the device resets immediately after publishing the message before or as it is receiving and processing the ACK message and so the device sends it again after re-connecting.

I’ve yet to deduce exactly what is causing the device reset event but with it being so sporadic it’s been difficult for me to pin point. Possible reasons:

External Watchdog implemented per application note - AB1805_RK. Yes, I call ab1805.loop() every loop().
Out of memory (I have an out of memory handler that does a System.Reset()
Some sort of hardware issue pulling 3.3V too low (at the time this happens, I just finish taking readings from all the sensors, power is turned off to all sensors, I softdelay 500 ms and then do the publishqueue.publish(). Seems unlikely but not sure what else would cause a reset.

It seems to occur quite randomly on 1 out of the 10 devices ever 12-24 hours. For now during this testing, these devices are awake and connected 100% of the time and do the same thing every 5 minutes (take readings, send them to the cloud, stay awake, repeat). I normally can debug these things using logging to serial but I haven’t been able to be connected to the right device when this happens.

Any guidance on what can be done to try and “catch” the reset reason?

A few things I’m considering trying:

Change all devices from PublishQueue.Publish --> Particle.Publish() - determine if it still occurs
Do not initialize the SerialLogHandler on all devices except the one I’m able to monitor serial on. Does it still occur? The phenomenon I see is if a device had say 1-2 events within an hour or two, as soon as I then hook up to that device to monitor it’s serial it seemingly never happens again for well over 24 hours.
Try writing a value to EEPROM just before System.Rest() of out of memory handler and then read that upon setup() to determine if that is what the reset was for.

Are there any other suggestions or things to try to determine what is causing the seemingly random device resets on devices that are not connected to USB.

chipmc · December 30, 2020, 5:18pm

@jgskarda,

I will be very interested in what you learn here.

You may have tried this already but you can setup a macro to help capture the reset reason.

Initialize this:

STARTUP(System.enableFeature(FEATURE_RESET_INFO));

Then you can call this function in your Setup()

if (System.resetReason() == RESET_REASON_PIN_RESET || System.resetReason() == RESET_REASON_USER) { // Check to see if we are starting from a pin reset or a reset in the sketch
    sysStatus.resetCount++;
    systemStatusWriteNeeded = true;                                    // If so, store incremented number - watchdog must have done This
  }

I use this to track how many times the device is reset by the watchdog (I use the AB1805 as well).

However, you can get more information out of this API:

Hope this helps and good luck.

Chip

jgskarda · December 30, 2020, 5:28pm

Thanks @chipmc I’ll give this a try. And again a reminder to check the docs. One of these days I should just read through the entire docs file from start to finish. A lot of good stuff there. Much appreciated. I’ll let you know what I find out.

rickkas7 · December 30, 2020, 5:37pm

Also be sure note the AB1805 wake reason using getWakeReason() from setup(). If it’s caused by the watchdog, that will differentiate between normal system reset and watchdog system reset.

Also set a retained variable in your memory allocation failure system event handler. Test and clear it in setup(), so you can tell if that was the cause of a System.reset().

jgskarda · December 30, 2020, 7:12pm

Thanks @rickkas7 and @chipmc

I added the following:

STARTUP(System.enableFeature(FEATURE_RESET_INFO));
STARTUP(System.enableFeature(FEATURE_RETAINED_MEMORY));
retained int outOfMemorySts;

Located within setup():

//Make sure the Wake-up reason is correct before reading it and inialize to -1
    ab1805.updateWakeReason();
    int AB1805_Reason = -1;
    AB1805::WakeReason wakeReason = ab1805.getWakeReason();
    if (wakeReason == AB1805::WakeReason::UNKNOWN) AB1805_Reason = 0;
    if (wakeReason == AB1805::WakeReason::WATCHDOG) AB1805_Reason = 1;
    if (wakeReason == AB1805::WakeReason::DEEP_POWER_DOWN) AB1805_Reason = 2;
    if (wakeReason == AB1805::WakeReason::COUNTDOWN_TIMER) AB1805_Reason = 3;
    if (wakeReason == AB1805::WakeReason::ALARM) AB1805_Reason = 4;

    //Create JSON of System Reaset Reason, AB1805 reason and out of Memory Status
    JsonWriterStatic<256> jw;
    {
      JsonWriterAutoObject obj(&jw);
      jw.insertKeyValue("SyRst_Reason", System.resetReason());
      jw.insertKeyValue("AB1805_Reason", AB1805_Reason);
      jw.insertKeyValue("outOfMemorySts", outOfMemorySts);
    }

    //Log the info to Serial
    Log.info("Reset message: %s", jw.getBuffer());

    //Publish the info to the Particle Cloud. This will publish once connected. 
    publishQueue.publish("RstRsn", jw.getBuffer(), 60, PRIVATE, WITH_ACK);

    //Reset out of Memory Status to 0 to capture again. This is set = 1 in the out of memory handler just before resetting. 
    outOfMemorySts = 0;

within Loop:

if (outOfMemory >= 0) {
      // An out of memory condition occurred - reset device.
      Log.error("out of memory occurred size=%d", outOfMemory);
      outOfMemory = 1;
      outOfMemorySts = 1;
      delay(100);

      System.reset();
  }

Which publishes this to the cloud each time the device powers up:

Now to deploy this to the other 7-8 devices, make a quick webhook to send the data to my backend so I can log it to a SQL table. Seems like good info to have recorded even if I didn’t have this specific issue. Is there anything you think I’m missing in the meantime? The only thing I was considering was to first connect to particle and then use the classic Particle.Publish() instead of publishQueue.Publish()

jgskarda · December 30, 2020, 9:00pm

I deployed the update to the 8 devices I have active testing right now, created a new web hook to my backend and the data is being pushed to a SQL table. Seems to be working well as all 8 devices reported back SyRst_Reason: 70 which translates to: “Successful firmware update”. So now I wait for the elusive sporadic reset event to happen and see what gets reported.

In any case, this seems like nice data to have for any future troubleshooting of this or other issues. For anyone who is looking at doing something similar… here is the decoder ring:

SyRst_Reason:
Unspecified reset reason:10
Reset button or reset pin:20
Low-power management reset:30
Power-down reset:40
Brownout reset:50
Hardware watchdog reset:60
Successful firmware update:70
Firmware update timeout:90
Factory reset requested:100
Safe mode requested:110
DFU mode requested:120
System panic:130
User-requested reset:140

AB1805:
UNKNOWN:0
WATCHDOG:1
DEEP_POWER_DOWN:2
COUNTDOWN_TIME:3
ALARM:4

outOfMemorySts:
Reset not due to out of memory: 0
Reset due to Out of Memory: 1

jgskarda · December 31, 2020, 3:03am

Well that didn’t work… I had the same issue but still didn’t capture a reset reason unless it’s the AB1805 with a reason of “Unknown”. Here is what I saw. Same behavior where the event “Stat” is published immediately before the event and then again once it connects to the cloud with the same data but reset reason capture was 0 for all 3 reasons. Not quite what I was expecting. I think I did everything right for writing the value to retained memory. Not quite sure. What is a bit unusual is SyRst_Reason was a value of 0 which isn’t even an option supposedly.

Also, I noticed most times when I do an actual device reset of either software update, reset pushbutton or even powering it off/back on, I get an actual reset event i.e. “spark/device/last_reset” which I suspect should match the SyRst_reason that I am trying to capture as well. However, for some reason I do not get the “spark/device/last_reset” event so it appears the device isn’t actually resetting??? However, it is re-running setup() so a bit unclear on what would cause this.

Does anyone have any thoughts… This isn’t a big deal for me as it only happens once or twice a day across the 10 devices I have I just would prefer to avoid it if possible or at least understand what could be causing it.

jgskarda · December 31, 2020, 3:43pm

Captured another event last night. This time it occurred on the device I was also doing the serial monitor on (FINALLY). Two interesting observations:

Serial monitor indicates it made it to my state machine's "stayAwake" state which does nothing. However, it never got to [comm.protocol] INFO: Received TIME response: 1609416012 did something happen in the comm.protocol background thread or the piece that handles the publish acknowledgement? The data makes it to the cloud before the reset so it doesn't seem to be on that side of the equation but rather the response back to acknowledge.

Here is what a normal sequence looks like:

0031306588 [app.pubq] INFO: queueing eventName=Stat data={"254":1609416008,"4":1170,"3":685,"1":0,"5":1,"252":50,"255":105} ttl=60 flags1=1 flags2=8 size=80
0031306589 [app] INFO: PublishQueue.Publish():After
0031306590 [app.pubq] INFO: publishing Stat {"254":1609416008,"4":1170,"3":685,"1":0,"5":1,"252":50,"255":105} ttl=60 flags=9
0031306592 [app] INFO: {"prevState": st_PubStat, "newState": st_Connect, "millis": 31306592, "elpsdms": 10121, "elpsdTimeInState_ms": 1618}
0031306608 [app] INFO: {"prevState": st_Connect, "newState": st_StayAwake, "millis": 31306608, "elpsdms": 10137, "elpsdTimeInState_ms": 16}
> 0031308435 [comm.protocol] INFO: Received TIME response: 1609416012
0031308437 [app.pubq] INFO: published successfully
0031308863 [app] INFO: New Subscription Event: e00fce683945d780412a3ad6/hook-response/Stat/0
0031308863 [app] INFO: With Data: {"0": 1}
0031312695 [app.ab1805] INFO: setWDT -1

Here is what the sequence looked like when it stopped/reset:

0031606553 [app.pubq] INFO: queueing eventName=Stat data={"254":1609416308,"4":1174,"3":686,"1":0,"5":1,"252":50,"255":106} ttl=60 flags1=1 flags2=8 size=80
0031606554 [app] INFO: PublishQueue.Publish():After
0031606554 [app.pubq] INFO: publishing Stat {"254":1609416308,"4":1174,"3":686,"1":0,"5":1,"252":50,"255":106} ttl=60 flags=9
0031606558 [app] INFO: {"prevState": st_PubStat, "newState": st_Connect, "millis": 31606558, "elpsdms": 10121, "elpsdTimeInState_ms": 1618}
0031606574 [app] INFO: {"prevState": st_Connect, "newState": st_StayAwake, "millis": 31606574, "elpsdms": 10137, "elpsdTimeInState_ms": 16}
... (USB serial monitor stopped here since the device reset itself, it should of gotten to: INFO: Received TIME response: xxx).

Since it was in my "stayAwake" state, I suspect it was the background thread of receiving the response of my publish that caused this event. StayAwake state doesn't do anything besides a few calculations each time to determine how long until the next time it needs to report back, a case statement that acts as my state machine and then a basic IF statement. This is all it does in this st_StayAwake state:

Second odd thing is the System.resetReason() was 1082396676 how is this even possible?

 //Create JSON of System Reaset Reason, AB1805 reason and out of Memory Status
    JsonWriterStatic<256> jw;
    {
      JsonWriterAutoObject obj(&jw);
      jw.insertKeyValue("SyRst_Reason", System.resetReason());
      jw.insertKeyValue("AB1805_Reason", AB1805_Reason);
      jw.insertKeyValue("outOfMemorySts", outOfMemorySts);
    }

    //Log the info to Serial
    Log.info("Reset message: %s", jw.getBuffer());

    //Publish the info to the Particle Cloud. This will publish once connected. 
    publishQueue.publish("RstRsn", jw.getBuffer(), 60, PRIVATE, WITH_ACK);

This is likely not a outOfmemory issue as it should of included that in the log just before it did the System.Reset() if that was the culprit.

It's probably too early to make to many assumptions or changes, I'd prefer to have a few repeat events captured but here are a few things on my mind. Not sure what this changes but a test to understand impact and learn more.

Selectively use publishQueue.Publish if not Particle.Connected() otherwise use Particle.Publish().
Do not use the WITH_ACK flag and just assume it makes it?
I'm currently using System_Thread(Enabled), System_Mode(Manual) and therefore do not use Particle.Process(). I am tempted to not enable System Thread and then add Particle.Process() if Particle.Connected().

jgskarda · January 2, 2021, 2:18pm

I’ve had several events in the last few days and occasionally get a “burst of even 5-10 events” from the same device every few minutes but eventually it recovers on its own. In almost all of these cases the devices all three reasons (System reset, ab1805, and out of memory) all report a value of “0”. So it’s unclear what is actually causing the reset or pseudo reset that causes the setup function to be called again. I am in manual mode so it shouldn’t run setup even on a cloud disconnect.

However, after seeing this Application note:https://support.particle.io/hc/en-us/articles/360044776653-AN005-Threading-Explainer I have a feeling this might be due to having threading enabled since the issue seems to be related to handling of cloud events right after a publish event. After reading this AN I clearly have more to learn about threading before I should use it and sounds like I’m best to not use threading enabled.

@ScruffR - you referenced this application note in a different Discussion . Do you think it could pertain to the issue I am seeing as well? Or maybe it could be accomplished with threading enabled but my code isn’t “clean” enough to do it correctly? This is what really caught my attention “ Finally, it’s really a pain to debug unsafe thread code. It’s unpredictable and timing-sensitive. When you look for the bug it can stop happening.”

In any case, I think I’ll not use threading enabled make sure I have particle.process() where appropriate and then deploy it to the 8 devices and see what happens. I’ll report back in a few days after testing.

Gildons · January 2, 2021, 7:39pm

Hi jgskarda

I read your post and forgot to ask: Are you also using Serial1 to interface with anything?

I’m still trying to narrow down my problem, but following ScruffR request to make a non-thread version of my test, I found that the problem persists even without an additional “user thread”.

It might be interesting if you can remove/disable parts of your code to pinpoint where the problem might be. I ended up removing all my code before arriving at the Serial1/Serial issue

jgskarda · January 2, 2021, 7:59pm

@Gildons - No I am not using serial for anything else. I only use it for logging messages out to serial when connected to the PC for debug, troubleshooting, and generating log traces on the PC to refine the logic, I typically do not have a serial device communicating to the Boron directly. I just got back home from a little vacation so hope to begin testing the single thread approach here shortly.

Although your issue seems to be slightly different than mine I was hoping you were going to report and say it is due to multi threading. I still have my fingers crossed single threading will help alleviate my scenario. Time will tell.

jgskarda · January 3, 2021, 5:12pm

So after disabling threading on the 8 devices last night the behavior has changed quite a bit. A very good thing… as it certainly seems related to threading. However, I still have more to learn… Here is what I now see:

A device no longer does a reset however the cloud connection does drop and reconnect at the same time it used to reset. So what was a reset (or at least re-calling of setup() even though a reset flag wasn’t set) has now turned into a cloud disconnect while staying in Loop().
The frequency of the cloud connection dropping/reconnecting is significantly more (about once per hour per device most devices when publishing every 5 minutes). I’d say atleast X10 more.
Each disconnect event happens immediately after a particle.PublishQueue() based on the timestamps of events that reach the console (devices publish at the top of the hour and then at 5 minute intervals. Always happens right at this interval and immediatly after publishing an event.
I still receive duplicate publish events within the particle cloud during each event indicating the PublishQueue is never acknowledged assuming because the cloud disconnect event.

image703×265 27.6 KB

Now here is the unusual behavior I wasn’t expecting:
I figured now that it’s happening more frequently it will make it much easier to troubleshoot given log files. Especially if it’s not resetting the serial port should stay open so I get the full log file before/after after each event. So I plugged a device that was experiencing this disconnect every hour or more and started serial monitoring the log via Putty. The disconnect events have seemingly stopped on that specific device for the last 5 hours but continue on the other 7 other devices at least once per hour. This one device doesn’t disconnect at all and is working perfectly!

After 5+ hours of running perfectly with zero disconnects when connected via USB cable to my PC, @ ~1:18 PM I disconnected the USB cable from the PC. At the immediate next publish event that occurring at 1:20 PM it disconnect from the cloud, reconnects and then publishes duplicate events just like it did before.

What does this mean? I’m not sure.
Top of mind next steps:

Plug the device into a USB wall charge rather than the PC (maybe it’s a power thing, doubtful but trying to think what can be different with a USB cable.
Disable Serial monitoring in the firmware (I wouldn’t think using logging when a USB cable isn’t hooked up would be a problem but maybe I don’t understand something. Am I suppose to not use Logging when not plugged in?, I’m going to feel silly if this is the issue.)
Try Particle.Publish() instead of PublishQueue.Publish() and repeat (maybe PublishQueue.Publish() is doing something that causes the cloud disconnect events)

jgskarda · January 4, 2021, 1:16am

In my further testing this afternoon, there is definitely some cross dependency between some or all of the following:

The library PublishQueueAsyncRK
Calling PublishQueue when Connected vs when not connected and then connecting later for events to be published.
Threading being enabled or disabled
The use of a log handler within the main sketch (i.e. SerialLogHandler logHandler;)
The Boron serial port is being actively monitored by a PC to accept the messages or by itself.

Here was my results from a few initial tests in hopes to correct the problem:
Test 1: Plug the device into a USB wall charge rather than the PC (maybe it’s a power thing, doubtful but trying to think what can be different with a USB cable.
Result from test 1: As expected, no change in behavior, still would disconnect frequently (at least once or twice per hour)

Test 2: Disable Serial monitoring in the firmware by commenting out this line in the main sketch:

//SerialLogHandler logHandler;

Results from test 2: Interestingly the sketch would compile however, this caused almost immediate and constant disconnects as soon as PublishQueue.Publish() was called and by constant I mean it was every 30 seconds - 2 minutes. Since each disconnect/reconnect can burn up data, I didn’t let this one run very long.

Test 3: Try Particle.Publish() instead of PublishQueue.Publish() and repeat (maybe PublishQueue.Publish() is doing something that causes the cloud disconnect events).
Result from test 3: Significantly improved connectivity with zero disconnects in 3 hours (so far).

My immediate next step:

Update all 8 devices to conditional use Particle.Publish ONLY when Particle.Connected() with all devices disconnect from a PC. Set 4 devices to stayAwake and 4 devices to sleep, take readings, call publishQueue.Publish and then connect. Identify if this corrects the random reset/cloud disconnect issue. Using this looks very promising and I’m hopeful it will correct the issue for me! I will use something like this:

    if (Particle.connected()){
      Particle.publish("Stat", jw.getBuffer(), 60, PRIVATE, WITH_ACK);
    }
    else {
      publishQueue.publish("Stat", jw.getBuffer(), 60, PRIVATE, WITH_ACK);
    }

Results: Although I thought this looked promising, the disconnect event still happens when the device connects to the cloud and starts unloading the queued events:

If step 1 is successful, repeat the same test but with threading enabled - CANCELED due to results above.

3a) Test for tonight. Let’s validate there is not an issue when I use Particle.Publish() on all 8 devices and keep the devices “awake”.

Results: Out of the 8 devices publishing every 5 minutes the last 8 hours using only Particle.Publish() I had zero disconnects at the moment of the publish event with duplicate publish events showing up. So this definitely seems related to publishQueue.publish(). 6 devices out of the 8 had 0 total disconnects. 1 had 4 disconnects (but not around a publish event) and the last one had 5 disconnects (also not around a publish event). What I mean by not around a publish event, is I publish at the top of the hour and at 5 minute intervals. I.e. 1:00, 1:05, 1:10, 1:15… The disconnects in this test on the two devices were always in between those publish events. I.e. @ 1:03 for example. I suspect this is unrelated and more cell signal quality/strength related.

3b) Modify Logging Level to determine impact. At one point, I tried LOG_LEVEL_NONE and it would not maintain connection. Went back to LOG_LEVEL_INFO, flashed and connection was maintained. I need to test this more yet. Given this 5 second test and no issues when connected to the PC, it seems like there is something related to logging level and PublishQueue().

Create a bare bones sketch, test each possible combination of options between Logging, connected to PC or not, Publish Queue used or not, and threading enabled or not. Document the sketch used for each test as well as the results and provide them to someone way smarter than me. Include a test on other/earlier versions of firmware as I do not recall having this issue in 1.5.x. I hope to be able to share this tomorrow 1/4/2021.

By the way… this testing has been on LTS release 2.0.1. Not sure if the issue is with that version of the LTS release, the PublishQueueAsyncRK or maybe just my lack of understanding of all of this stuff. In any case, feels like I’m getting close on a resolution (I Hope!)

ScruffR · January 4, 2021, 8:23am

As it seems that ParticlePublishQueue is somewhat in the center of the issue it would be good if @rickkas7 could chime in again.
Also the connection between LOG_LEVEL_NONE (or removal of the handler entirely) and the worsening of the issue seems disconcerting and may ask for Rick’s attention too.

jgskarda · January 5, 2021, 4:47am

@ScruffR - Yes… definitely something with PublishQueue. I think it’s time for someone smarter than me (i.e. @rickkas7) to see what might be happening. Do I submit a support ticket to Particle somehow or what is the next steps to engage the Particle Engineering/Support team for this. I have full log files to share as well just let me know what I can do to help! Hopefully Rick can replicate the issue and see what is happening given the info below:

My only course of action right now is to go back to SYSTEM_THREAD(ENABLED) and put up with what then seems like a pseudo reset event that occurs what seems to be at the same time. I will go back to Systeup_threaded(enabled) with logging set to ALL and see if I can then capture one of the pseudo reset events while logging.

I can now replicated the issue quite frequently in the bare bones sketch shared below. The cloud disconnects are quite persistent when using PublishQueue() and non-existent when using Particle.Publish()
My setup:

Particle Boron running LTS 2.0.1
Particle Boron in free space (not physically connected to hardware
-Standard USB cable connected from usb port on boron to USB port on PC
Serial Monitoring being done via PuTTY
6600 mAh Lipo battery connected directly to JST connector
Reported cellular signal strength when conducting test
-Publish Event is being kicked off by manually calling the function “Stat” from the Console.Particle.IO

"signal":{
"at":"LTE Cat-M1"
"strength":32.99
"strength_units":"%"
"strengthv":-109
"strengthv_units":"dBm"
"strengthv_type":"RSRP"
"quality":38.23
"quality_units":"%"
"qualityv":-13.5
"qualityv_units":"dB"
"qualityv_type":"RSRQ"
}

I am also happy to say the issue is repeatable while serial monitoring when log level is set to “LOG_LEVEL_ALL”. So I was able to capture some logs. I am happy to provide the complete log file but not sure easiest way to do that. Do I email it to somewhere? Here is some quick snippets of the log file around the issue. Top line log file indicates Event Loop Error 3 typically when it’s trying to immediately publish an event from publishQueue.Publish() or Event loop error 18 when it’s publishing an event from the queue.

0000063727 [app] INFO: stat message: {"1":1609818841,"2":22,"3":33,"4":44,"5":55,"6":66,"7":1}
0000063728 [app.pubq] INFO: queueing eventName=Stat data={"1":1609818841,"2":22,"3":33,"4":44,"5":55,"6":66,"7":1} ttl=60 flags1=1 flags2=8 size=72
0000063729 [app.pubq] INFO: publishing Stat {"1":1609818841,"2":22,"3":33,"4":44,"5":55,"6":66,"7":1} ttl=60 flags=9
0000063730 [comm.protocol] TRACE: rcv'd message type=2
0000063733 [comm.protocol] ERROR: Event loop error 3
0000063736 [system] WARN: Communication loop error, closing cloud socket
0000063740 [system] INFO: Cloud: disconnecting
0000063743 [system] INFO: Cloud: disconnected
0000063747 [system] INFO: Cloud: connecting

I also receive this trace:

0000114930 [app.pubq] INFO: publishing Stat {"1":1609818893,"2":22,"3":33,"4":44,"5":55,"6":66,"7":11} ttl=60 flags=9
0000114941 [comm.protocol] ERROR: Event loop error 18
0000114941 [system] WARN: Communication loop error, closing cloud socket
0000114944 [system] INFO: Cloud: disconnecting
0000114948 [system] INFO: Cloud: disconnected
0000114948 [system] INFO: Cloud: connecting

Here is the exact sketch I was using to test this:

/*
 * 
 * Description: Bare bones sketch to replicate issue with using PublishQueue in LTS 2.0.1 
 * Author: jgskarda
 * Date: 1/4/2021
 */

//Include necassary libraries
#include "Particle.h"
#include "PublishQueueAsyncRK.h"
#include <JsonParserGeneratorRK.h>

SerialLogHandler logHandler(LOG_LEVEL_ALL);

STARTUP(System.enableFeature(FEATURE_RESET_INFO));
STARTUP(System.enableFeature(FEATURE_RETAINED_MEMORY));


//define the system modes to use
//System threading is disabled. When system threading is enabled, the frequency is much less however, instead of a cloud disconnect, it results in a pseudo reset event where setup() is called agian. 
// SYSTEM_THREAD(ENABLED);
SYSTEM_MODE(MANUAL);

//Define variables needed for Publish Queue library. 
retained uint8_t publishQueueRetainedBuffer[2048];
PublishQueueAsync publishQueue(publishQueueRetainedBuffer, sizeof(publishQueueRetainedBuffer));


int8_t msgCnt = 0;
/*************************************************************************************/
//                        SETUP
/*************************************************************************************/
void setup()
{
  //Define all particle functions.
  Particle.function("Stat", Stat);

  //Delay 10 seconds to give time for serial debugging
  delay(10s);

  publishQueue.setup(); // Needed for queueing of Publish Events 

}


/*************************************************************************************/
//                            LOOP
/*************************************************************************************/
void loop()
{
  Particle.process();

  static uint32_t msLastConTime = 0;
  if (!Particle.connected() && millis() - msLastConTime < 60000) {
    Particle.connect();
    msLastConTime = millis();
  }

  static uint32_t msLastPublish = 0;
  if (!Particle.connected() && millis() - msLastPublish < 10000) {
    Stat("args");
    msLastPublish = millis();
  }
}

 /*******************************************************************************
 * Function Name  : stat()
 * Description    :  
 * Return         : Publishes event to the cloud
 *******************************************************************************/
int Stat(String arg){
	
  // This creates a buffer to hold up to 256 bytes of JSON data (good for Particle.publish)
	JsonWriterStatic<256> jw;
  {
		JsonWriterAutoObject obj(&jw);
		jw.insertKeyValue(String(1), Time.now());
		jw.insertKeyValue(String(2), 22);
    jw.insertKeyValue(String(3), 33);
    jw.insertKeyValue(String(4), 44);
    jw.insertKeyValue(String(5), 55);
    jw.insertKeyValue(String(6), 66);
    jw.insertKeyValue(String(7), msgCnt++);
	}
    Log.info("stat message: %s", jw.getBuffer());
    
    //Using publishQueue.Publish will result in frequent cloud disconnects
    publishQueue.publish("Stat", jw.getBuffer(), 60, PRIVATE, WITH_ACK);

    //Using Particle.Publish() will run without issue even hammering on the request
    // Particle.publish("Stat", jw.getBuffer(), 60, PRIVATE, WITH_ACK);

  return 1;
 }

A few other noteworthy observations:

Does using NO_ACK vs WITH_ACK flag have an impact? No - Slightly different behavior but disconnects still occur. As you’d expect, the events are no longer duplicated to the cloud over and over again however, the device still disconnects on nearly every call to PublishQueue
Does different logging levels have an impact?: YES - it appears to. Issue impacted by different logging levels but unable to correct issue. Just occurs at different frequencies and level of impact. This test was done with my full original program, Threading: NOT Enabled, , Only using publishQueue. Observations:
LOG_LEVEL_ALL : - Disconnect constantly and continues to try and publish the same event to the cloud.
LOG_LEVEL_TRACE : Happens but not very frequent. One event in a 3 hour period.
LOG_LEVEL_INFO : Happens but not as frequent about the same as Log_Level_Trace
LOG_LEVEL_WARN : Disconnect constantly and continues to try and publish the same event to the cloud.
LOG_LEVEL_ERROR : Disconnect constantly and continues to try and publish the same event to the cloud.
LOG_LEVEL_NONE : Disconnect constantly and continues to try and publish the same event to the cloud.

rickkas7 · January 5, 2021, 12:23pm

I would not attempt to use PublishQueueAsyncRK without using SYSTEM_THREAD(ENABLED). It’s almost guaranteed to fail in unpredictable ways. I added a warning to the README that it’s required.

jgskarda · January 5, 2021, 3:04pm

Thanks @rickkas7 I had a feeling that was the case and that it was nearly required. It’s been a bit conflicting information on if/when Threading should be used based on reading that application note. Since it wasn’t originally not in the documentation I figured it wasn’t required but glad you add that to the README for clarity.

Any explanation as to why the behavior seems to change weather the Boron is physically connected to a PC via USB cable and actively using serial monitor as well as what level of logging is being used? Does this just affect timing enough and results in some race conditions and precise timing/race condition of the cloud handshake?

Finally, what is Event Loop Error 3 and Event loop error 18 mean? Is this a particle cloud handshake thing or something else?

I did re-enable threading on one device last night and monitoring the logs via PuTTY while using the following:
SerialLogHandler logHandler(LOG_LEVEL_ALL);
SYSTEM_THREAD(ENABLED);
SYSTEM_MODE(MANUAL);
…
publishQueue.publish(“Stat”, jw.getBuffer(), 60, PRIVATE, WITH_ACK);

So far no disconnect events and no pseudo reset events in the last 9 hours on that one device. I will deploy this same thing to the other 7 devices and see if I still get the occasional reset events that started me down this road of not enabling threading. IF I still get those, I will then implement the selective use of publishQueue.Publish based on !Particle.Connected() and repeat. Thanks again for the reply and guidance! Always very much appreciated!

rickkas7 · January 5, 2021, 3:19pm

The errors of the form “Event loop error 3” are ProtocolError codes.

3 is INVALID_STATE which appears to occur when an unexpected CoAP or DTLS message arrives that would not be expected at that point in the transaction.
18 is IO_ERROR_GENERIC_RECEIVE is an error reading DTLS data.

github.com

particle-iot/device-os/blob/develop/communication/inc/protocol_defs.h#L25


#ifndef PRODUCT_ID
#define PRODUCT_ID (0xffff)
#endif
#ifndef PRODUCT_FIRMWARE_VERSION
#define PRODUCT_FIRMWARE_VERSION (0xffff)
#endif
#define MAX_SUBSCRIPTIONS (6)       // 2 system and 4 application
enum ProtocolError
{
    /* 00 */ NO_ERROR,
    /* 01 */ PING_TIMEOUT,
    /* 02 */ IO_ERROR,  // too generic, discontinue using this.  Perfer/add a specific one below
    /* 03 */ INVALID_STATE,
    /* 04 */ INSUFFICIENT_STORAGE,
    /* 05 */ MALFORMED_MESSAGE,
    /* 06 */ DECRYPTION_ERROR,
    /* 07 */ ENCRYPTION_ERROR,
    /* 08 */ AUTHENTICATION_ERROR,

jgskarda · January 6, 2021, 5:07pm

Just a quick update… I’ve been successfully running with this the last 24 hours with zero pseudo resets as I did earlier. This is across 8 devices publishing every 5 minutes. I tested this both with devices sleeping/waking as well as devices that stay awake.

The pseudo reset conditions and duplicate publishing of events is what started this whole discussion and then going down the rabbit hole of not using threading. At this time, I no longer am seeing that behavior thankfully. Just not 100% sure why. I may need to go back to gitHub and look at the specific change of what I have now vs what I was running several days ago. Only thing that jumps out at me is changing log level from LOG_LEVEL_INFO --> LOG_LEVEL_ALL but I’d be surprised if that was it.

SerialLogHandler logHandler(LOG_LEVEL_ALL);
SYSTEM_THREAD(ENABLED);
SYSTEM_MODE(MANUAL);
…
publishQueue.publish(“Stat”, jw.getBuffer(), 60, PRIVATE, WITH_ACK);

Since I am no longer seeing the issue, I think I’m moving on to the next area of focus, leveraging different sleep modes for different scenarios. If/when this behavior comes back I’ll let you know. Thanks everyone for the guidance!

system · March 7, 2021, 5:07pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Particle Boron - Occasionally goes into Flashing Red SOS + 10 blinks - Assertion failure - How to troubleshoot? Device OS	12	1154	December 19, 2020
PublishQueueAsyncRK Hanging Libraries	4	383	June 3, 2022
Boron 2g/3g doesn't reconnect to cloud after a disconnect event Troubleshooting boron , cellular	32	3671	September 25, 2020
Electron flashing cyan after particle.publish Troubleshooting	20	1967	February 20, 2018
Code Not Working after 15 Days in the Field General	49	6173	September 28, 2018

Boron - Seemingly random device reset - How to best capture the reason

Related topics