Improvements to the online indicator for Gen 2 and Gen 3 hardware

Knowing if your device is online or not is a big deal. That’s why we’re so excited to share that now the Electron and all other Gen 2 and Gen 3 devices accurately report their status in the Console, Web IDE, CLI, and Cloud API.

While this feature is not new, important improvements were made to increase the accuracy of the online indicator. One of the challenges that we overcame — and why it’s not a simple problem to solve — is that the low-level protocol used by these Particle devices is UDP. UDP is a stateless protocol, unlike the more familiar TCP used by the Photon.

TCP gets its stateful nature partly by sending “extra” data over the wire. UDP connections do not need to send this extra data. This means UDP devices can have lower data costs on metered connections.

Reduced data consumption is an awesome feature of UDP, but being stateless makes it more difficult to know if a device is online or not. While we could lean on the networking stack to keep an eye on TCP devices, UDP requires us to track the state of devices on our own.

Our Solution

To solve this problem we changed how the online state of a device is determined. Instead of checking on the existence of a live connection in the network stack, we leverage a normal behavior of the Particle Device OS: keep-alive messages.

By default, keep-alive messages are sent at regular intervals by the Device OS to the cloud. These messages inform routers between the Device and the Particle Device Cloud to maintain an open UDP. Routers can only hold so many connections in memory at a time and will actively close connections if they think the connection is no longer required. For TCP connections, routers just watch the inherent state of the connection. Because UDP connections are stateless, routers have to depend on the existence of such keep-alive messages in order to know if the connection is still required.

Our online indicator uses a similar approach: if we receive data from a device, we mark it online. If no data has been received from that device within some period of time, we can reliably estimate that the device has gone offline.

The default timing of the Device OS keep-alive messages is well defined but varies somewhat between platforms. Given these values, we can accurately predict when a device will send its next keep-alive message.

An Electron, for example, sends a keep-alive message every 23-minutes by default.

Once we start receiving data from an Electron, we can expect at least one keep-alive message every 23-minutes. While other data may, of course, be received from the device during that time, as long as one message is received, we can accurately assume the device is still online.

However, if we do not receive any data within that 23 minute window, we can assume the Electron has gone offline.

Changes to Event Behavior

Robots and computers may not be great at seeing the breathing cyan online indicators on the Console, so we’ve also made changes to the behavior of the spark/status events.

These events are published after a device switches between an online or offline state. The timing of these events is also based on the default keep-alive. In some situations, these events may be published immediately after the device changes state, but on average, these events will be published after about 1.5x the default keep-alive timing for the device.

For Electrons, with their default keep-alive of 23-minutes, these events should be published between 23- and 46-minutes after the device changes state.

Timing is Everything

Those are the basics for tracking the online state of UDP-based Particle devices!

However, the implementation details are a little more complex. Clock drift between devices and the Cloud, sleepy devices, devices that often disconnect and reconnect their Cloud connection, certain types of network issues, mesh device topologies, and user applications that call the Particle.keepAlive() API all have to be taken into account.

All of these situations impact the timing of a device’s keep-alive and make it more difficult to predict when a device is truly offline.

While we were able to tackle some of these challenges so far, we still have future improvements in the pipeline to address these correctly and improve the reliability of the online indicator.

As we tackle these challenges, the timing of the spark/status events will also improve. Our goal is to publish these events as soon as the device actually changes state. However, we must be careful not to lose the lower data usage advantages we gain from the stateless nature of UDP connections.

For now, the online indicators and the updated spark/status events should be reliable for most applications!

If you have questions, please reply in this thread, and we’ll do our best to provide clarity.

10 Likes

Out of curiosity did this also change what the meaning of “Last Handshake” is? It seems to be the last time the device sent data now vs. the actual “Last Handshake”. It just surprised me a bit when I noticed that as I mis-interpreted it as a possible bug in my user code causing frequent handshakes.

As far as I am aware or could find, there is no way to push the device status out via a webhook.
Polling the API from our web GUI server for every device status time based is not efficiently scalable for thousands of devices.
Considering these great new improvements make the status reporting now accurate, are there any plans to add a status publish option to device console, which would then trigger webhooks.
So for example the device console triggers an event for ONLINE/OFFLINE and the webhooks push a JSON status event out to end point?

Or if not, may I request / suggest please.

@JamesHagerman Hi James, any feedback on above? thanks

Thank you both for the questions!

@jdash These changes did not impact “Last Handshake”. However, there is a known issue where that field and “Last Heard” overlap. We have captured some information as to why this is happening, but need to dig a little deeper to fully understand the impact of fixing this.

For now, the “Last Handshake” on the Devices list in the Console (and /v1/devices through the API) should return the correct Last Handshake date time stamp for devices.

@Paul You should be able to set up integrations listening for events named spark/status. Will that work for what you need?

EDIT: I just tried this with an integration with it’s Request Type set to to POST and it’s Request Format set to JSON. This is the JSON body that was sent to my webhook endpoint:

{"event":"spark/status","data":"offline","published_at":"2019-06-14T17:25:19.734Z","coreid":"e00fce68a6d1fea930d543a1"}
1 Like

Hi James, thanks for the info. Incase it helps, the device console list does show the correct time for Last Handshake however when you click on the device, it then seems to update to Last Heard. Hope that clue helps out.

1 Like

James, thanks a lot for the info.

No, I did not realise you could use event: spark/status inside the webhooks, that is great!
Thank you for your very helpful reply.

Cheers Paul.

1 Like