Had an unusual one this morning… a device stopped reporting data to the cloud. (normally it reports data every 5 minutes via Particle.Publish(). I checked it from the console and it indicated the device was still online and connected. I could refresh vitals, ping it, but it wouldn’t execute any particle function.
The only course of action I could think of was flash firmware to it to “reset” it. I did that and now it’s working properly again, connected, reporting data every 5 minutes and able to call functions on it.
Any idea what would cause something like this? Is it in Safe Mode or something where it’s not executing the application firmware? That is what it behaved like but I have no idea how it would of entered safe mode on it’s own. Any guidance or thoughts would be appreciated!
Thanks for the ideas chip. In this case we are not using 3rd party SIM. I’ll check over the application firmware and see what scenarios there are. I’ve always used soft delays which included calls to particle.process() and I also have a watchdog. I think I pet the watchdog as well every time I call particle.process() but will double check.
Does this kick the device into safe mode or just cause the application firmware to hang? I was hoping my watchdog would of been the catch all for scenarios like this but must be something…
For a while, I had an ISR that would pet the watchdog but it did not catch issues like this because the interrupt wold fire and the ISR would pet the watchdog. Now, I pet the watchdog in the main loop to avoid issues like this. When this used to happen to me, I would see the status LED happily breathing cyan but the device was unresponsive to particle variable or function calls. Not sure if any of this is relevant to your issue but this hs how it was manifested for me.
I just looked at my user application further and the only place I call ab1805.loop() to pet the watchdog is in the loop. The only time I turn the Watch dog off via
//Stop the Watchdog
ab1805.stopWDT();
is immediately before sleeping and then I call"
//Resume the Watchdog once awake
ab1805.resumeWDT();
Immediately upon waking back up. In this case, the device is always powered on so we never even enter sleep mode. Given this, I would have thought the WDT would have handled this scenario IF the application code was not actually executing and should have self healed itself. It seems more like the application code was running, it was still connected to the cloud but Particle.Functions was not working properly. I’m glad flashing new firmware too (which basically does a reset) worked to restore it but still curious on what happened. I’ll keep digging and if I find anything I’ll post it here as well. If you have any other thoughts/considerations let me know.
So, every once in a while, the Particle can be connected but the session will fail. Eventually, the session can be lost and it takes a while for the Boron to realize this and take steps to re-establish the connection.
One way to spot this quickly is to create a boolean that gets set to true when you first connect. Then, in your main loop, you can have something like this:
if (particleConnectedFlag && !Particle.connected()) {
Particle.disconnect();
particleConnectedFlag = false;
}
Then you can have a catcher in your state machine to reconnect.
You could also add some logging / reporting to this to see how often it happens. My devices are typically connected for only 90 seconds but for devices that stay on-line, I have seen this catch some issues.
Chip’s suggestions are good. The thing about functions is that even in system threaded mode, they are dispatched only from the main application thread:
Between calls to loop()
During calls to Particle.process()
During calls to delay()
The reason for this is that calling functions from a different thread (system or otherwise) could cause thread safety issues, and since the user provided the code for the function handler, it could end up blocking the system thread, making it ineffective for its purpose.
However, console ping and device vitals return their response entirely from the system thread.
I’d guess that something is blocking the main loop thread. Since console ping works, the session is fine.
@chipmc, @rickkas7 - Thanks for the ideas and insights/things to look at. It helps me understand what could be causing this behavior. So far, it was a one off event that I was able to recover by flashing firmware OTA. So unless it happens again, I’m hoping it was a one time scenario. Just was a little uneasy not knowing what it was.
What is most bothersome is I would of expected the watchdog to catch this scenario and reset the device so it could self heal. I confirmed the only time I pet the watchdog is in loop(). So if it truly was infinitely blocking, then I would of thought the watchdog would of eventually reset the device after not being pet. On the PCB, I have the AB1805 implemented per AN023
In any case, just hoping it doesn’t happen again or across multiple devices. So far so good and this was a one time occurrence. If it happens again and I learn something new I’ll post again here.