Boron's gone offline

Hi I’m hoping someone on here can advise.

We have two borons in the field running apps on firmware 1.5.2. Both Borons normally operate while connected to the cloud. On the 01 Aug, both borons went offline, their system uptimes as per the last vitals were 526420 and 596673.

This morning 03 Aug, one of the devices connected long enough to send one set of vitals, uptime of 616708.

Obviously both devices are currently not connected to the cloud and also at the moment I have no way of remotely seeing what state the borons are in. Does anyone have any advice on how I can diagnose this any further? I can post vitals from the devices leading up to the disconnects, does anyone have any advice on how to interrogate vitals data and what warning signs to look out for? It’s just strange that they both went offline on the same day. Then stranger again that one has managed to send one set of vitals before disconnecting early this morning. Geographically, the devices are separated by a few hundred kilometres, but both running the same app.

Edit: I only mention uptimes, incase both apps are somehow leaking memory and the critical point is somewhere near the uptime marker of 600000 where they both went offline. Both apps are consistently using ~80% memory as per vitals prior to disconnect.

What is your keepAlive() setting?

// SYNTAX
Particle.keepAlive(23 * 60); // send a ping every 23 minutes

At first glance, it may be that the cloud did not acknowledge two default (23 minute) keepAlive() pings. So, the cloud may have timed your devices out and assigned them offline status.

https://docs.particle.io/reference/device-os/firmware/boron/#particle-keepalive-

If the above is what is happening to you, you may want to set your keepAlive() manually with a more frequent ping. Or, (what I do) send Particle.publish() occasionally with a NO_ACK parameter. This last method reduces the overhead of data a bit.

Hopefully, you might monitor the console and find your Borons go back online (for X reason). Then you can access them OTA without a physical visit to reset.

Took me a while to find this:

Thanks for the reply,

I don’t set the Particle.keepAlive, so I assume it is set to the default 23 mins?

At first glance, it may be that the cloud did not acknowledge two default (23 minute) keepAlive() pings. So, the cloud may have timed your devices out and assigned them offline status.

So if my devices have timed out, does the particle attempt to reconnect at certain intervals? If this was the case could it explain why one of the devices briefly came back online? Could it be getting stuck somewhere and if this was the case would a simple application watchdog with particle.reset guard against this(and by guard I mean reset device) ?

1 Like

I am experiencing similar issues on 1.5.2 and the new 2.0.0 beta. Whereas, 1.3.1-rc1 seems to work perfectly. I won’t go into detail right here, but the totality of my experiences suggests that Particle has worsened the Boron connection stability through software updates. It has cost me a lot of time and money.

Just this morning, I power cycled a Boron with 2.0.0 beta on it which had totally lost connection for no reason at 1am and still not reconnected 10 hours later.
One power cycle and, what do you know, 30 seconds later and instant internet/cloud connection. This proves it is a software issue. Highly frustrating and forcing me to consider alternatives to Particle for cellular connectivity.

I am downgrading my Borons to 1.3.1-rc1 to further test the theory. On the positive note, the two remote stations I have which haven’t required me to drive hours to the site only for hopeless head-scratching at Particle’s unstable LTE connection code, which have been running and reconnecting perfectly for multiple months, have been both V1.3.1-rc-1.

1 Like

Yes. I assume as well after reading the docs and forum! :wink: So, recommend you shorten the time between keep.Alive()'s or send a periodic ping. I do both.

It varies. Sometimes I have seen the devices go online within an hour to 3. Other times - never.

Yes. As Paul_M was referring to - things seem to have changed. I use a watchdog to look at diagnostics and force a System.reset() before my device goes offline. You might already be using the library:

Some example statements I use from the DiagnosticHelperRK library are:

g_cloudConnectionStatus = DiagnosticsHelper::getValue(DIAG_ID_CLOUD_CONNECTION_STATUS);
// connected = 2 connecting = ???

and

g_cloudConnectionErrorCode = DiagnosticsHelper::getValue(DIAG_ID_CLOUD_CONNECTION_ERROR_CODE);
// error code = 0 (no error) 10(???) 17(???)

One last option you might consider: How to force a handshake for OTA updates

Last, is an example I use while connected to cloud to force a new session (force a handshake):

Particle.publish(“spark/device/session/end”, “”, PRIVATE);
// will FORCE NEW SESSION AFTER DISCONNECT!!!
softDelay(1000);

I am facing the same problem. Two of our LTE boron device lost connection to the cloud on August 1 and August 2nd which are in the field firmware version 1.5.1.

Both Borons normally operate while connected to the cloud. I don’t know how to recover it now. Device were publishing at the interval of 2 minutes. Nothing strange in the vitals.

App usage is ~74% memory as per vitals prior to disconnect. Is there way to recover it? Both device didn’t go into safe-mode due to app crash. What else could cause this problem?

1 Like

If a device is presently publishing every 2 minutes and you can view this data in the console, when you ping the device is there a response of connected? Just curious. Sounds like it is connected, at least.

Try forcing a new session.

I am a Mac user. You can try to force a new session by using Terminal.

At console, for device in question:

Click on “View events from a terminal” icon. Using some scratchpad app, copy and paste 2 values into following curl statement:

curl -X PUT https://api.particle.io/v1/devices/device_id/disconnect?access_token=latest_access_token

device_id & latest_access_token (make sure there are no extra characters present after your paste)

Launch Terminal.

Copy entire modified curl statement and paste into Terminal command line. Hit return.

Hopefully, a new handshake will occur (Force New Session).

Note: access_token changes periodically so don’t try to use same token for more than 15 minutes

Thanks robc. Two devices are offline as of August 1st 9.00 pm and August 2nd 7.45 am and its not connected to particle cloud.

Will new session help to recover the unit which are offline?

Well, I don’t think fns will do any good if they are truly offline.

I am sorry. I misread your post. I don’t have any other suggestions.

My brain said “are” instead of “were”.

Hey rob,

Thanks for your replies. Can I ask what do you use those statements for from the DiagnosticHelperRK lib? Do you the post those variables using a publish statement?

Also

I use a watchdog to look at diagnostics and force a System.reset() before my device goes offline.

What are you watching with your watchdog and what thresholds are you using to decide to reset? Just the above two? Could you show me how you structured your watchdog, no worries if you can’t.

EDIT: One other thing I have noticed while looking into the diagnostics data is that the device.cloud.coap.transmit value seemed steady at around the ~30, but leading up to the disconnection on the 01 Aug this value increased to ~3500.

1 Like

Sure. I posted something similar this morning so will link it here:

I have posted the variables, in the past, using Particle.publish() so I could see events evolve in the Events console, for example. Now, I just use the two above statements in a small routine called every 3 minutes from loop(). If abnormal conditions persist, as in the above post link, for 3 minutes I typically call System.reset(). At present, my devices reset every 2 or 3 days on average.

Interesting note you made about the device.cloud.coap.transmit value. I will take a closer look at that, too. I might write a set of “conditions” to EEPROM so I can see what triggered a device last restart.

1 Like

Thanks for being so helpful robc, I will let you know how the above treats me. I will be visiting the device on site probably on Friday so I will revert and let you know what sate the machines seems to be in when I arrive.

Hi robc

I was thinking of using something like this and just calling it every x mins to check for an error. Do you think it might work well? Do you usually check for both conditions, ie. && vs || ?

Thanks again for your help.

void cloudWatchDog(){
    if( millis()-lastChecked > checkPeriod){
        int prev_g_status = g_status;
        int prev_g_errorCode = g_errorCode;
        // values of diag variables from last time checked
        
        g_status = DiagnosticsHelper::getValue(DIAG_ID_CLOUD_CONNECTION_STATUS);             
        //  **connected** = 2       connecting = ??????
        g_errorCode = DiagnosticsHelper::getValue(DIAG_ID_CLOUD_CONNECTION_ERROR_CODE);      
        //  error code = 0 (**no error**)   10(?????)   17(?????)
        
        currentConnErrExists = (g_status!=2 && g_errorCode!=0); 
        // TRUE if cloud is not connected and there is an error code
        errorExistedLastCheck = ((prev_g_status==g_status) && (prev_g_errorCode==g_errorCode));
        // TRUE if errors existed last time they were checked. 
        
        if (currentConnErrExists && errorExistedLastCheck){
            System.reset();
        } else {
            lastChecked=millis();
        }
    }
}

EDIT: Fixed copy and paste errors as identified below

Glad you asked. I was thinking about what I posted the other day and checked my actual code. I am presently checking only DIAG_ID_CLOUD_CONNECTION_ERROR_CODE. I will probably change the code similar to what you are proposing when I get a chance. I am presently hunting down the GitHub codes for the mysterious DIAG_ID_CLOUD_CONNECTION_ERROR_CODE just for the heck of it. So, checking for

g_status!=2 && g_errorCode!=0

is a good idea.

Two evenings ago, there was reported system degradation which caused all my Argon devices to rapid flash green. Once the system recovered I noticed DIAG_ID_CLOUD_CONNECTION_STATUS was a value of “1” for connecting

// DISCONNECTED = 0 CONNECTING = 1 connected = 2 DISCONNECTING = 3

even though I was having no trouble interacting with my devices. I tried to force a new session (handshake) but no change. I issued a Particle.function() command for a device to System.reset() which cleared up codes to what we would expect to receive after restart: connected = 2 && error code = 0. Your method would have cleaned up this issue for me right away. So, thank you!

You probably have already found it but I think there is an error near the top of cloudWatchDog().

int prev_g_errorCode = g_status;

should be

int prev_g_errorCode = g_errorCode;

then, further down:

errorExistedLastCheck = ((prev_g_status==g_status) && (prev_g_errorCode==g_status));

should be

errorExistedLastCheck = ((prev_g_status==g_status) && (prev_g_errorCode==g_errorCode));

1 Like

Haha, yeah caught one of these copy and paste errors but missed the other one, think I need to get away from the screen for a while. Thanks for your help, well spotted. I’ll do some testing and let you know