Electron Product connects to the Cloud but can't get queued firmware updates

Cliffnotes version: My Particle Product (Product created in the Console) can get into a state where it can connect and publish successfully, receive a subscribed event successfully, but won’t update “Last Connection Time” in the Console, or receive queued firmware updates (I’m running Particle Firmware 0.6.1-rc2). I currently have a device in this state and am hoping Particle can troubleshoot this while it is occurring.

I’m designing an industrial product using an Electron that needs to be left unattended and run for over a year with an occasional (min daily) cloud connection. The reliability of the cloud connection is critical so I’ve been testing the connection with a couple devices set to connect to the Particle Cloud once every 10 minutes, publish and check for firmware updates, then disconnect and go back to Deep Sleep. The connection and publish are reliable, but I noticed that the device sometimes won’t receive a queued firmware update, and I’ve been troubleshooting. I notice three symptoms when the device (or the cloud) is in this state: the firmware_update_pending event isn’t received (and the Log doesn’t show a spark/status auto-update event), the timestamp of “Last Connection” in the Particle Console isn’t updated, and the Log doesn’t show the spark/device/last_reset event for the device.

I currently have a device that has been in this state for over 17 hours, connecting and publishing successfully >100 times while in this state, but “Last Connection” hasn’t been updated, and the device hasn’t seen any firmware update requests though it is queued to receive an update. (I’m disabling firmware updates just for testing using System.disableUpdates() so I can look for the firmware update event to come in each time I connect, but not actually update the firmware).

Two devices, both have connected and published within the last 10 minutes (as of 2017-03-21 10:50AM EST) but one hasn’t updated “Last Connection” for over 17 hours:

Some details on how I connect, disconnect, sleep. Unfortunately I can’t share the code running on these devices as it was written for another company:

  • I use SYSTEM_MODE(MANUAL) as sometimes I need to wake without connecting, and need to control when I make a (potentially long) call to Particle.process()
  • I connect in stages: Cellular.on(), Cellular.connect(), Particle.connect()
  • I publish the data I need collected via Particle.publish(), then publish a second “sleep” event which I subscribed to back in setup(). I don’t start disconnecting until the “sleep” event (containing my device ID) makes a roundtrip from Electron to Cloud and back. I call Particle.Process() every half second while waiting for the response.
  • Next, to make sure I’ve waited long enough for “server sent confirmable messages to be acknowledged”, I make a call to Spark_Sleep() indirectly by calling System.sleep(D0,RISING,1,SLEEP_NETWORK_STANDBY) to sleep for 1 second with the modem connected. (I need this workaround as Spark_Sleep() isn’t available directly)
  • At this point I disconnect with Particle.disconnect(), wait a half second, then enter Deep Sleep

Given the efforts I’ve made to make sure there is a solid connection with the Particle Cloud - I’ve seen the publish I made, and the device has seen a publish go to the cloud and come back as a subscription event, and I made a call to Spark_Sleep() to make sure all confirmable messages have been sent - I believe the device has stayed connected long enough to get a firmware update event, let alone update “Last Connection” in the Particle Console. I think there’s something wrong here out of my control.

A manual reset, manual firmware update through USB, or manual entry to safe mode seem to fix this (at least sometimes, I haven’t seen this often enough to reproduce a reliable way of getting the device back in a working state), as well as just waiting. I can’t do any of those things besides wait if the device is installed in a unattended environment.

I’ll try to make time to isolate my code down to just the code involved in connecting and share it, but I wanted to post this issue now as I have a device actively showing the problem. This is relatively rare, I’ve had these devices set up for 12 days before one showed the problem.

@will @Charlie_K

Can you check wheather the Last Heard field gets updated correctly?
It’s a commonly known problem that on Electrons the Last Connected field is no reliable indicator due to the connectionless nature of UDP.

When you say

How long would that be?
Could you try to call Particle.process() more often than twice a second - e.g. every 10ms?

I think to remember cloud process (without SYSTEM_THREAD(ENABLED)) does only treat one queued action (e.g. variable request, function call, subscription callback, …) per visit to Particle.process(). So it’s not just the time that’s relevant but also the count of Particle.process() calls.

And just for curiosity, what exactly is the benefit of Spark_Sleep() over any of the other sleep options in your case?

last_heard = “2017-03-20T21:14:27.050Z”

also

last_handshake_at = “2017-03-20T21:14:27.050Z”

On average, 13 seconds. I’ve seen as few as 7 seconds, sometimes much longer than 13 seconds.

Good idea, I didn’t realize that Particle.process() only handled one queued action at a time per call. I should be adding more calls. I usually get the firmware update event immediately after the call to Spark_Sleep() and not before. The previous call to Particle.process() would have been used to handle receiving the message with the “sleep” event. I’ll add this and start a new test, but only after we’ve determined there’s no more debugging we can do with the device currently showing this problem.

Spark_Sleep() waits for all confirmable messages to be sent, and as of 0.6.1-rc2, waits for all confirmable messages to be received before sleeping. Only certain sleep modes include a call to Spark_Sleep(), in fact, just the pin change sleep mode includes the call (why? I have no idea). I’m using a 1 second pin change sleep as a hacky way to call Spark_Sleep() in lieu of a better way. I found out about Spark_Sleep() after looking into this PR to “Wait for server sent confirmable messages to be acknowledged before sleeping”, which seemed applicable for me not receiving firmware update events from the server. It did improve the success rate of of receiving firmware update requests before sleep after adding in that code, though, it could be from staying awake longer, or from more chances to process queued events inside the Spark_Sleep() call.

Looking into the DTLSProtocol::sleep() details now, the fix includes waiting for a single second to “give a chance for some requests to arrive”, and also says “FIXME” which doesn’t inspire confidence in the fix. What if the firmware update event takes longer than a second to arrive?

1 Like

Well, it’s back to normal as of 6:43 pm EST. I’ll start a new test with firmware that also calls Particle.process() more frequently and get it running as soon as I’m able.

I updated the firmware with frequent calls to Particle.process() (every 10ms while waiting for the “sleep” event) and added a few more groups of Particle.process() calls at each stage before I disconnect. So far I’m noticing the firmware update event is seen by the device earlier (before the Spark_Sleep() call), which was to be expected. I’ll keep an eye on this and post back with any significant results.

Well that didn’t take long, the same device has been unable to get firmware update events and hasn’t update “Last Connection” since 05:18 pm EST.

last_heard = “2017-03-22T21:18:28.619Z”

I have what I think is a stable system now thanks to help from @rickkas7. Over the past two weeks I’ve had two Electrons constantly connecting, checking for firmware updates, and sleeping for 10 minutes (same test as before), with some more calls to Particle.process() added. Across 4000 connection attempts, there were only 170 times when the Electron disconnected before getting the firmware update queued event (4.2% failure rate), and these seemed like isolated incidents, not multi-hour outages like before.

Some things that @rickkas7 pointed out that aren’t documented:

You need to call Particle.process() after Particle.disconnect() for the cloud to detect the Electron has disconnected. Otherwise, if you connect back up within some nearby timeframe - probably the keepalive timeout of 23 minutes - the cloud thinks it’s the same session and doesn’t go through sending events like it would with a new session. I added this code after Particle.disconnect():

    // wait for disconnect to complete
    while(!Particle.disconnected()) {
        Serial.println("...");
        for(i=0; i<100; i++) {
            Particle.process();
            delay(10);
        }
    }

BTW, Particle.disconnected() is a real function and seems to work though it’s not documented.

The Particle.process() documentation doesn’t cover all the cases where you need to call Particle.process(), so I just added in very frequent calls to Particle.process() starting from Cellular.connect() and ending with Particle.disconnected() returning true.

I noticed another improvement with the firmware in this test: I was getting occasional duplicate webhook events from a single publish, and these went away after calling Particle.process() more often throughout my code.

Hope this helps anyone else using SYSTEM_MODE_MANUAL and trying to optimize battery life by using deep sleep as much as possible.

3 Likes

This is a really good thing to know. Thank you.