Delay in 'online' status and failed particle.function

My mesh devices are only Mesh connected - so there is no Ping coming from them via the gateway to the Cloud. In my case only the gateway will be pinging.

Could there be an application reason why the ethernet based gateway is occasionally not sending the ping to the cloud and hence the connection dropouts - could this be a short loss of internet connection, one ping doesn’t get through. They are setup with SYSTEM_THREAD(ENABLED) and SYSTEM_MODE(AUTOMATIC). Is there something in the Device OS API that can be used to “help” ensure there is a regular and successful ping?

One last thing - over the 48 hours the stability of the 2 gateways I have has improved remarkably - I haven’t had to reset them once. They are both still running on 1.4.2. Is this due to something that has been done in the Cloud?

One of the big underlying issues IMHO is that UDP is used for DeviceOS/Mesh messaging and while it has much less overhead than TCP - it is entirely unmanaged - so there is ZERO guarantees of any data being received by the destination, compared to TCP. So it was designed to be used in high reliability networks (cabled or optical) typically in well managed corporate networks. It is totally unsuitable on a “best effort” network such as the internet. But, as in life, it is all about tradeoffs and its not a bad way to send small amounts of data - if you can live with the occasional dropped packet/message. Due to this I use a separate TCP based connection to send the mission critical data from my devices and only use the particle cloud for management of the device - this while being code heavy on the device, is MY trade-off for reliability, YMMV, since there are many other aspects of Particle colour that are very attractive to use.

In my case - between the Cloud and ethernet connected device - if I am using Particle publish then this is using TCP? I thought it was only the Mesh network that used UDP. I am OK with dropped packets on the Mesh network as I have implemented an ACK mechanism from node to gateway for critical data.

Gen3 devices are unsing UDP for cloud communication just like the Electron.
Hence Particle.publish() defaults to up to three retries which - some time ago - caused multiple events hitting console for only one publish.

And if you are using AsyncQ, AFAIK, that returns true when you use it, true since the message is placed in the Q, not necessarily received by Particle cloud.

I had a conversation with one of the Particle presales folks recently and they made a strong case for using Particle cloud end to end however based on my knowledge and experience with UDP, it is not going to work in my use case. Besides if a UDP keep alive has to be sent every 30s across a cellular link - you will probably consume 5Mb a month before any user data is sent - so it still needs planning as to what method is used vs the gain/pain experienced.

Thanks @ScruffR and @shanevanj for clarifying that for me. I kind of thought that sticking with TCP for WiFi and Ethernet cloud connected devices would be the logical thing to do but I guess then the gateway would have to translate between UDP and TCP and there would need to be a different mechanism just for the Boron.

I guess with using the Async library I can also use WITH_ACK - or is that also not going to work with the Gen3 and UDP either?

defaults to up to three retries which - some time ago - caused multiple events hitting console for only one publish

thanks for the reminder about that little imperfection!

Oof. I just realized I totally tagged the wrong person in my replies. I apologize for that confusion @armor and @ian.c

@armor:

Is there something in the Device OS API that can be used to “help” ensure there is a regular and successful ping?

Yes, as you stated in a later post, WITH_ACK should generally increase the reliability of publishing messages. Depending on the situation, there can be other concerns to keep in mind. For example, WITH_ACK will consume more data.

There is always a chance that the network dropped the packet. As others have stated, we use UDP and dropped packets are a reality with any UDP connection.

While the UDP protocol itself does not offer transmission guarantees, CoAP does. CoAP is the application layer protocol that sits above UDP in the network stack. That WITH_ACK flag on Particle.publish() is part of the CoAP spec and is intended to regain some of the transmission guarantees provided by TCP while also maintaining the benefits of UDP. The term used in the spec for these types of messages is "confirmable".

[...] Is this due to something that has been done in the Cloud?

I am not aware of any specific change that could impact that particular situation but I'm glad to hear reliability seems to have improved!

[...] or is that also not going to work with the Gen3 and UDP either?

Because WITH_ACK is part of the CoAP spec, it will work with Particle.publish() on Gen3 and all UDP devices (Electron and E Series).

Sorry again for the tagging confusion!

3 Likes

For transparency:

The fix for that spark/status delay issue is complete. I’ll be testing it today.

1 Like

Thank you for the replies and the transparency!

OMG. I’m so glad to have found this thread. I’ve been going mad trying to figure this problem out.

We use E-series devices and have them connect on the hour, and for 2 minutes only before disconnecting and going into STOP mode. When the device connects it publishes a connected message which triggers our cloud to call device functions if needed. These function calls since recently are rejected immediately. I’ve finally noticed that if the last spark/status event was offline then Particle doesn’t even try and send the message, it just immediately returns ‘timeout’.

Even tho we publish a message (and WITH_ACK) as soon as we connect the spark/status = online is always delayed. We disconnect cleanly too:

  1. Particle.disconnect()
  2. Wait until !Particle.connected()
  3. Cellular.off()
  4. Wait 2 seconds
  5. Go into STOP mode

As a workaround could Particle simply always try and send function calls? I assume not sending when the device is offline is an attempt to save data charges but at the moment this is screwing us over. It’s not saving us anything anyway since we only try and send when the device has just connected. If that’s not possible then surely setting the status of a device to ‘online’ as soon as any message is received from the device can’t be too hard.

We’ve got a client under the gun here with devices in the field and at the moment no workaround.

That would be a difficult task when the cloud doesn't know how to find the device.
As said, the device needs to open the UDP path between itself and the cloud for the cloud to use that path to send anything back to the device.

Sure, your Particle.publish() call should open that path for the cloud to immediately use it and it's probably a bug when it currently doesn't. However, as @JamesHagerman has been aware of the issue and a fix should have been rolled out already as to this post a week ago

Ah yes. That makes perfect sense.

Hi @JamesHagerman. We've currently got a battery-sucking work around in place for this issue. Has the fix been put into place? I notice that the spark/status is still coming in a few minutes after the device re-connects.