Several days ago, it started, but seems especially bad yesterday and today:
Devices have started dropping off-line for hours.
I have a watchdog timer set up so that after 20 minutes of failing to reach our servers, they reboot.
Often, they can reconnect, but not always.
The only solution seems to be to send a tech out to power the units down (spread out over a whole county).
A complete power cycle seems to fix the problem for a while, but they may be back in the same state a few hours later.
The problems are spread out over many cell towers. ATT is not reporting any outages, but my wife’s ATT cell phone acted similarly yesterday.
These are all Boron Devices, 94 in total, no recent software updates.
Looking at the Events: Event: Dashboard spark/device/diagnostics/update, the devices show an unusually high disconnect rate.
Update: Years ago, I built in the ability to automatically switch to a failover server after three failed attempts to communicate with the primary. This feature quietly took care of many network issues. The failover server was removed from service on August 1st. I have temporarily changed my code to simply wait an additional 200ms after each failed attempt > 3, and this seems to have solved my problem. I'll follow up if I learn more.