I have a fleet of Boron 404Xs in a remote site with lower-than-optimal cellular signal range of [33,45]% which I do understand can exhibit disconnects and difficulties with cloud communication. 1 device in the fleet has never posted an event, while 2 have recently stopped posting. All 3 can still be pinged and vitals can be refreshed as needed. LED is also breathing cyan.
Similarly, I have 2 404Xs in a new site with slightly better signal ranges of [45,50]% but with abysmal network quality that fluctuates from [17,70]. As with the previous site, the devices can still be pinged, LED breathing cyan, but unable to post events to the cloud or GCP integrations.
As I have the devices set up to save to an SD card before a:
if (Particle.connected())
{
Particle.publish();
}
line, data is still being logged and I simply lose real time acquisition.
I am wondering if anybody has experienced similar issues and can recommend a possible solution/post their two cents, as cell data is still consumed due to reconnection attempts. Site relocation is not feasible, along with the installation of a booster. All devices are running OS 4.0.2, out of the box factory antennas and using the e-sim.
I personally have experienced this exact issue maybe 5-6 times. I can ping the device, the customer tells me the status LED was breathing as if it’s connected but the device stops sending data and I’m not able to call a particle function.
Oddly enough it does “accept” an Over The Air (OTA update). This naturally then resets/reboots the device. From that point forward it would work just fine.
I personally think it’s something with my firmware or device OS that causes the connectivity to hang. No idea what it could be but so far I just do the OTA update to get out of it the few times it occurred. I tend to believe it’s some edge scenario in device OS simply because I have a hardware watchdog already that if the application firmware stops calling loop() it should reset itself. Secondly, that application firmware also does a deep power down using the AB1805 if it’s trying to connect for more than 11 minutes. That too does not reset it. So my hypothesis is some strange edge condition/bug in Device OS that it thinks it’s connected, user application code is still functioning properly but no message (Particle.publish() or Particle.function()) gets through.
Basically, pet the watchdog anytime it successfully receives back an ACK from a particle.publish(). Or possible use a Particle.Function() to let the watchdog maybe every 2 hours from my back end. If the watchdog doesn’t get pet by the cloud the device will reset itself.
Would be great to know if an OTA update would work for you. Give it a try and let us know.
Thanks for the suggestion! Interestingly enough, an OTA update gets the affected devices to post to the cloud again, but with low success rate. I have multiple devices that reverted to the non-posting state after a few hours. LED is still breathing cyan, devices can be pinged, and vitals can be sent over.
What I did find weird were the logs:
cloud.connection.status shows connected, with only 1 disconnect/connect attempt.
cloud.uptime seems to aggregate in sizes comparable to device.system.uptime, compared to the usual value of 2.
A cloud side watchdog is definitely of interest to me, but similarly to what you posted, am on the fence with upgrading my fleet to 5.X. I might have to consider installing a sole dev 5.X device when I am on site to play around with this, as the non-cloud postings are not ideal in the long run.
@dzmn - Very interesting… and thanks for sharing! I guess I’m glad I’m not the only one. For me it’s been hit/miss as well if it’ll accept the OTA update successfully or not. I’ve only faced this issue 3-4 times and maybe 3 out of the 4 the OTA update worked to rescue the device remotely and then 1 out of 4 it did not.
I’m definitely going to implement the cloud side watchdog, I just don’t think I can throw it together quick enough with enough confidence to deploy in the next couple weeks. I will likely develop and add this in summer timeframe.
@Colleen - Is this a known issue of some sort? I.e. Device can be pinged and accept OTA updates but otherwise is unresponsive to all other cloud functions? The behavior makes it seems like some edge case bug in device OS?
@dzmn totally unsure about this, but I just came across the following:
Should a device becomes unreachable from the cloud (such as a timed out
function call or variable get), one possible cause of this is that
the keep alives have not been sent often enough.
The most common cause of not responding to function or variable requests, but responding to OTA or ping, is blocking the main application thread (loop thread). Since functions and variables are dispatched from this thread, between calls to loop(), they won’t be dispatched when the thread is blocked.
Since OTA and ping are handled from the system thread, those still work.
@rickkas7 This all makes sense. What I’m still baffled about is I am using the application watchdog with the AB1805 per your application note, if I’m truly blocking the main loop() thread then I would expect the device should reset itself after the 124 second timeout of that watchdog but that that doesn’t seem to happen. Either:
The issue is something else and loop is being called and thus watchdog is being pet but functions or variable requests are not responding yet.
I’m not properly enabling watchdog upon waking from a sleep state.
I’ve been struggling to get the issue to repeat locally so it’s been challenging one to solve. The cloud side watchdog seems like a reasonable “catch all” for this edge case scenario and other similar scenarios as I just don’t know how to solve it to root cause without being able to repeat it locally.
Just thought Id give everybody an update if they are interested. It seems to be a routine that every 1-2 weeks, one or two of the same three devices (fleet size is 10) stop posting to the cloud and I’ll have to perform an OTA update of the same sketch, essentially soft resetting it. Playing around with the keepalive duration did not seem to have an impact, and so I reverted to the default 23mins for particle SIMs.
Based on the diagnostic logs, I do have 2 situations that lead to the stops:
(1) I can see an aggregation in cloud connection attempts, which coincide with an increase in service.cloud.uptime() and can assume the poor signal quality for this site attributed to stops in posting.
(2) single cloud connection attempts per log, but service.cloud.uptime() aggregates to 2+ days, sometimes even as long as a system.uptime() after a reset. Although it would make sense that this is purely because the log intervals aren’t catching the true number of cloud connection attempts and so may be just situation (1).
Could somebody clarify the service.cloud.uptime() and its expected output? I do see a value of 2 during “nominal operations”. Would there be instances where they reflect the situations described above?
Much appreciated on the prompt replies! @ScruffR scheduling a system reset per week seems like the right way to go. I’ll let you know if this fixes the issue. Not sure why I did not think of this instead of manually sending an update.
@gusgonnet I explored this with a mock up unit that saves those logs to an SD card and see a constant “FreeHeapMem: 93680 and LargestBlockHeapMem: 92448”. I’ll report back if this changes.
Interestingly enough, using the same code to a similarly set up 4 layer PCB causes it to stop posting after a few days (same scenario: online, breathing cyan LED, data still logged). This time, FreeHeapMem seems to reduce to a new value every 4+ hours. I’m taking this as a sign to rework the .ino file - or if needed, just set an earlier System.reset() interval.