All,
I have been working with support for over a month to figure out why, at infrequent intervals dozens of my devices will timeout attempting to connect to the Particle cloud. This is not the everyday here and there a device can’t connect - this is a “mass failed connection event” where many geographically separated devices experience this issue simultaneously. Since this happens infrequently and with remote devices, I have not been able to capture logs to help the support team - though this effort is on-going.
I am posting this data / issue for two reasons:
- To see if anyone else is seeing issues like this
- So solicit any suggestions on how I might address the issue
Here is what I know:
- In my code, I keep track of how many seconds each connection attempt takes. Based on the example in AN023, I set an 11 minute timeout for a connection attempt.
- On any given day, some devices will reach this timeout which will be reported on the next successful connection and which, in turn, triggers an alert from Ubidots to a custom Slack channel I monitor. With this, I can see all missed connection events and review their history.
- Occasionally, 3-4 times a month, I will see a “mass missed connection event” where dozens of devices will all miss connections at the same time. These devices are spread across North Carolina and California. These events typically last two hours and then things go back to normal.
No data is lost since I use PublishQueuePOSIX to queue web hooks so, why is this a big deal? First, these events seems to be occurring slightly more frequently (see below) and second, attempting to connect for 11 minutes consumes significant battery for these solar devices.
Here is what these events look like in my Ubidots console:
This is for a single device but, as you can see in the Slack logs from yesterday’s event, many devices are “participating” in these events.
My code is open source so, please feel free to have a look here but I believe, that if it is possible for my code to be at issue, it would be in the “connecting state” which looks like this:
case CONNECTING_STATE:{ // Will connect - or not and head back to the Idle state
static State retainedOldState; // Keep track for where to go next (depends on whether we were called from Reporting)
static unsigned long connectionStartTimeStamp; // Time in Millis that helps us know how long it took to connect
if (state != oldState) { // Non-blocking function - these are first time items
retainedOldState = oldState; // Keep track for where to go next
sysStatus.lastConnectionDuration = 0; // Will exit with 0 if we do not connect or are connected or the connection time if we do
publishStateTransition();
// Let's make sure we need to connect
if (sysStatus.connectedStatus && Particle.connected()) {
Log.info("Connecting state but already connected");
stayAwake = stayAwakeLong; // Keeps device awake after reboot - helps with recovery
stayAwakeTimeStamp = millis();
(retainedOldState == REPORTING_STATE) ? state = RESP_WAIT_STATE : state = IDLE_STATE;
break;
}
// If we are in a low battery state - we are not going to connect unless we are over-riding with user switch (active low)
if (sysStatus.lowBatteryMode && digitalRead(userSwitch)) {
Log.info("Connecting state but low battery mode");
state = IDLE_STATE;
break;
}
// If we are in low power mode, we may bail if battery is too low and we need to reduce reporting frequency
if (sysStatus.lowPowerMode && digitalRead(userSwitch)) { // Low power mode and user switch not pressed
if (sysStatus.stateOfCharge <= 50 && (Time.hour() % 4)) { // If the battery level is <50%, only connect every fourth hour
Log.info("Connecting but <50%% charge - four hour schedule");
state = IDLE_STATE; // Will send us to connecting state - and it will send us back here
break;
} // Leave this state and go connect - will return only if we are successful in connecting
else if (sysStatus.stateOfCharge <= 65 && (Time.hour() % 2)) { // If the battery level is 50% - 65%, only connect every other hour
Log.info("Connecting but 50-65%% charge - two hour schedule");
state = IDLE_STATE; // Will send us to connecting state - and it will send us back here
break; // Leave this state and go connect - will return only if we are successful in connecting
}
}
// OK, let's do this thing!
connectionStartTimeStamp = millis(); // Have to use millis as the clock will get reset on connect
Cellular.on(); // Needed until they fix this: https://github.com/particle-iot/device-os/issues/1631
Particle.connect(); // Told the Particle to connect, now we need to wait
}
sysStatus.lastConnectionDuration = int((millis() - connectionStartTimeStamp)/1000);
if (Particle.connected()) {
sysStatus.connectedStatus = true;
sysStatus.lastConnection = Time.now(); // This is the last time we attempted to connect
stayAwake = stayAwakeLong; // Keeps device awake after reboot - helps with recovery
stayAwakeTimeStamp = millis();
recordConnectionDetails(); // Record outcome of connection attempt
Log.info("Cloud connection successful");
attachInterrupt(userSwitch, userSwitchISR,FALLING); // Attach interrupt for the user switch to enable verbose counts
if (retainedOldState == REPORTING_STATE) state = RESP_WAIT_STATE;
else state = IDLE_STATE;
}
else if (sysStatus.lastConnectionDuration > connectMaxTimeSec) {
current.alerts = 2; // Connection timed out alert
sysStatus.connectedStatus = false;
recordConnectionDetails(); // Record outcome of connection attempt
Log.info("cloud connection unsuccessful");
disconnectFromParticle(); // Make sure the modem is turned off
if (sysStatus.solarPowerMode) setLowPowerMode("1"); // If we cannot connect, there is no point to stayng out of low power mode
if ((Time.now() - sysStatus.lastConnection) > 3 * 3600L) { // Only sends to ERROR_STATE if it has been over three hours - this ties to reporting and low battery state
state = ERROR_STATE;
resetTimeStamp = millis();
break;
}
else state = IDLE_STATE;
}
} break;
However, it is unclear to me how an error in my code could cause intermittent, coordinated events across devices that are hundreds or thousands of miles distant.
I will continue to work with support but any advice, suggestions or similar experiences would be helpful.
Thank you,
Chip