Delay in 'online' status and failed particle.function

We’ve been using particle.functions triggered by input from the cloud console to configure field devices for a long time. Typically these devices are running 0.8.0. This process has been very reliable until today.

This afternoon I needed to change come configurations and attempted to run the functions as I have in the past and was greeted with the “Bummer! Failed to call” timeout response. This is odd as we have a specific stage in the firmware where the application listens for function calls and has no competing processes etc. As I ran repeated attempts to successfully fire the function, I finally received a ‘spark/status online’ event, and immediately the next function calls all worked flawlessly. One unit updated… Now I go to work on a second, so far (after 5+ attempts when the e-series wakes up out of deep sleep) I haven’t succeeded in getting a function to succeed.

In this instance, the second unit has shifted into a faster wake/sleep cycle, and I’m not receiving ‘spark/status online’ until after the e-series has woken, published, and returned to sleep. I can see all of this on the console, but have not ability to interact through functions. I believe when the longer stage wake up occurs tomorrow I’ll get my config update in, but this is a new frustrating situation.

Two questions:

  1. Has something in the cloud function delivery changed lately that could explain the odd delay?
  2. Is there a better way (existing or planned) to deliver functions calls to a sleeping cellular node?
    …I know we could write an elaborate method ourselves for handshaking and then calling the function, etc. But if we don’t receive the event fast enough (within 1 minute or less of connection) it won’t help anyway, or it will force us to bleed battery unnecessarily while we wait for the cloud to acknowledge the connection event.

The most reliable way would be having the device request the data via a pub/sub scheme instead of having to attempt to catch a wake cycle from outside.
You could use the Device Notes feature to store updated config data for the device to request on next wake.

To preserve energy you could limit the requesting frequency to once per day or so instead of doing it on every wake.

Thanks, ScruffR. I guess that will be the solution then. I’m disappointed that the ‘online’ event is taking so long (10 min lag) since the console was working so well for us up until now, but writing the pub/sub approach would definitely be more robust.

That shouldn't take that long and I haven't heard other reports in that direction as of yet.
Maybe @marekparticle can chime in here.

Yeah, it never has taken very long before and this suddenly began on Nov 13. The delay also leaves the earlier suggestion of a pub/sub invocation on particle.function as a dead-end: The delay seems to cause any external particle.function call to be rejected …even if it’s triggered directly from a particle.publish!!! Unfortunately, this means in almost all cases I can’t alter the configuration, OTA any device, or trigger functions from our external application (which I believed to be the root cause of the problem I was attempting to fix with my OTA config). With a 10 minute delay we essentially have no method of OTA command and control. Uplinked data is coming through fine though, so I can’t understand why the cloud thinks all the devices are offline while at the same time they are clearly transmitting.

This is definetly not what I am seeing.
My latencies are still within the few seconds (max) range. Do you happen to know which cell provider your devices use?
Have you got a device at hand to run a minimal test application you could also post here?
@peekay123 would also be using a Canadian provider - maybe he can have a latency check with one of his Electrons.

I’m quite positive it’s nothing on Particle’s side. It might be either the cell provider or the AWS servers for your region.

I’m not sure which they are connecting out on at the moment. In Canada it will be Rogers, Bell, or Telus in this location. I also have a test unit in BC (also Rogers, Bell, or Telus) that is not experiencing the issue. That might be an important point. The online event is received immediately even before the first publish event, in that case.

I’m having a real difficult time seeing how it could be a cell provider since the publish data is being delivered to my console correctly… My logic is that, If I can get particle.publish data through the particle console, then the device state = ‘online’. So, why doesn’t the particle cloud know that the device is obviously online? Also, if this was a carrier issue and the whole channel was delayed, then after 10 minutes when the ‘online’ event finally arrives, and I fire a function, wouldn’t the response time out? Instead, the device executes the function call correctly and responds within a second. I just can’t sent it anything until 10 minutes after it wakes up. Also, particle’s ‘health check’ is itself showing 1-3s latency at most (when it doesn’t timeout itself).

Does anyone know what conditions trigger the ‘online’ spark/status event? That might shed light.

How long are your sleeping periods?
Is it deep sleep or stop mode sleep?
With or without SLEEP_NETWORK_STANDBY?
With or without disconnecting before entereing sleep?

The reason for asking this is that the cloud may not detect a new connection since the device never got disconnected or a still valid session was reused.

Note for other readers that stumble onto the thread: We’re now working with support on this in parallel to this thread. I’ll try to share anything that might be beneficial to others but most of my investigation is happening through our support thread now.

ScruffR,
Minimum 60s and max 1hr. Usually around 3000-3400s, and sometimes 5 min if specific error trigger a shorter sleep cycle.
SLEEP_MODE_SOFTPOWEROFF

Basically this form:
// SYNTAX
System.sleep(SLEEP_MODE_SOFTPOWEROFF, long seconds);

Never SLEEP_NETWORK_STANDBY
Without disconnecting before entering sleep (in the affect code)

However, we’ve changed this to a ‘disconnect before sleep’ within the last 48 hours as we work through this, we’ve also updated to v1.4.2 from 0.8.0rc10 now that we have been able to deliver firmware. It’s very important to note that the ‘online’ event delay has never been an issue in the previous >1yr of operation and external particle.function calls suddenly went from near 100% reliability to 0% at a single point in time.

Additional notes regarding the disconnect and firmware version:
We use SYSTEM_MODE(SEMI_AUTOMATIC), specifically run Particle.connect() and check from the app for Particle.connected() before running certain code. This returns true in the affected code, while an external check for ‘online’ is evidently failing simultaneously. We were only specifically running Particle.disconnect() for certain error conditions where refreshing the connection to Particle may help recover the network and we’ve never seen issues with connection or reconnection in the past. We added the disconnect just prior to the System.sleep() call in the code change within the last 48 hours and I have no clear evidence of positive or negative impact given the other changes that have been made. Now for firmware, We flagged one of the earlier 1.?.? OS versions for some complications that we didn’t fully understand. This led us to decide to stick with a known-good version that was reliably operating in the field rather than risk an unnecessary change. We’ve been operating on that principle ever since and have seen very near 100% reliability in function response until the attempts to run particle.function calls on Nov 13. It’s very possible that either the disconnect or the firmware version are playing a role in the cause, but another factor must also be at play based on the long history of reliable operation.

One other note, the cell provider referenced in the console is Rogers. This had been unclear to me earlier but has been confirmed since.

@ian.c,

I have recently started to see this same behavior. Something has changed in November. The specific behavior is:

  • Device wakes from sleep
  • Device sends a webhook
  • Device waits for webhook response — Up to this point this is perfectly normal
  • I attempt to send an API call and get an immediate “Bummer Failed to Call”
  • Then, after three minutes, the device status shows “online”
  • An API request at this point does not fail immediately as above but times out as normal because the device has already went back to sleep.

Here is the log from another device. Notice the time it takes for the “online” message on each waking - 17 minutes in one case.

Thanks, Chip

Hello @ian.c, @chipmc, and @ScruffR (and anyone else that happens to land in this thread)!

Thank you for bringing this up and for providing so much detail!

We are aware of a bug that is causing those spark/status online messages to be delayed for up to 23 minutes for cellular and some gen3 devices. This bug may also negatively impact the ability to call functions for the same amount of time.

I am working on a fix for this issue today with the goal of having it deployed before the end of next week.

That said, our documentation regarding these spark/status events is somewhat out of date. Changes earlier this year did indeed add those events to cellular and gen3 devices. However, we still do not provide strong guarantees on the exact timing of when those events will be published.

The fix I’m working on will remove the bug causing long delay for spark/status online events, but depending on the use case, you may still run into other issues with these events.

Please note that spark/status offline events will continue to be delayed 23-46 minutes for the following device platforms. This should not impact the ability to call functions.

  • Electron and E Series
  • Boron and B Series
  • Argon and Xenon that are child nodes in a mesh
  • Argon and Xenon that were set up as stand-alone devices and not in a mesh network (Full disclosure, this is another bug I only just found yesterday. I’ll try to get a fix out for this too.)

We are working to improve the timing of these events as well, but those changes will have to wait until 2020.

I understand this is a frustrating issue. Please reach out if you have other questions about this.

Thank you!
James

1 Like

@JamesHagerman,

Thanks for the update and thank you very much for working to get this issue resolved.

I have one question, do you think that this issue could be impacting the ability for products that sleep a lot to get updates from the Particle OTA process?

Thanks,

Chip

1 Like

@all, Thought this would help highlight the issue by using a larger sample size. All of the devices in this list came online at 10:00 and reported successfully. As you can see the “online” message - and the ability to respond to an API call - took from 14 seconds to 21 minutes to report in.

Thanks,

Chip

1 Like

@JamesHagerman Thank you for the candour about the occurrence of these spark/status events.

I have 2 Xenons ethernet connected in a Particle ethernet featherwing. They both have mesh endnodes connected. I have been seeing repeated spark/status offline followed by spark/status online exactly 30 seconds later. In between Particle publishes can be seen. Occasionally, the gateway device will start to flash green and then the only way for it to reconnect to the cloud is a hard reset. Whilst it is flashing green the mesh network comms appear to be unaffected - I log publishes to a log file on the gateway. Is this behaviour part of what you are describing here or something else?

Example of console event log below


And this

@chipmc,

At this time, I do not believe this spark/status delay bug has any impact on Particle’s OTA update process.

That said, sleepy behavior does introduce a number of additional variables that can impact their ability to receive OTA updates. Some of these variables we can (and do) control for. Some, sadly, we can not.

For example, if a user application puts a devices to sleep before the first step of the OTA process has enough time to complete, it may never receive an update.

Poor cellular reception can slow that first step of the OTA process enough to cause issues.

While there are solutions to this issue, they all have trade offs and depend heavily on the exact use case.

Thank you for providing the events log for all those devices! It helps me know I’m not on a wild goose chase with this fix!

@ian.c,

That second event log does appear to be showing this spark/status event delay issue.

Generally, devices are marked offline if they fail to send data to the cloud within some amount of time. This time depends on the KeepAlive value set by the Device OS.

If the user application does not send data during that time, the Device OS will send a Ping to the cloud on its own. This is part of the normal maintenance required for a UDP connection.

Unfortunately, this method is not foolproof for a number of reasons.

For example, if the UDP packet carrying that Ping message is dropped somewhere on the network path between the device and the Particle Cloud, it may be marked offline.

A Xenon on an ethernet shield with a mesh should ping the cloud every 30 seconds. Therefore, in this caes, the delay bug means spark/status events for that device are only published every 30 seconds.

After this fix is in place, I believe you should see less time between spark/status offline and spark/status online events. In that last example you provided, I am aiming to get that online message to have a timestamp of 6:38:47 instead of 6:38:51.

That said, unfortunately the exact order of events published for UDP devices can not be guaranteed at this time.

I hope that helps,
Please let me know if you have additional questions!

Could I clarify what you might be suggesting is changed in the Xenon ethernet connected mesh gateway. Are you saying the gateway should be Cloud Publishing at least every 30 seconds or something else? The endnodes are only local mesh connected to the gateway which is then deciding what to publish to the cloud.

Assuming the user application makes no calls to Particle.publish(), all connected Particle devices will send Ping messages to the cloud at regular intervals. This is documented behavior.

The timing of those keep alive messages depends on the device platform. For mesh devices, the device’s role in the mesh also comes into play.

Generally, these keep alive messages are transparently sent and should not be a concern.

In your situation, yes, your Xenon gateways should be quietly sending those keep alive messages in the background every 30 seconds.

Assuming that your child nodes are connected to the Cloud, they should also be sending keep alive messages to the Cloud in the background, through the gateway, but likely at a different interval. As far as I understand, the Device OS determines the interval of keep alive messages for child nodes based on the current network topology. This means the interval may change.

Does that answer your question, @ian.c?

Edit:
Oh, and to clarify: You should not need to publish to maintain connectivity.

The default keep alive messages are being sent in the background to keep the connection alive. If other data goes over that connection, that data acts to keep the connection open instead and the keep alive messages do not need to be sent.

I agree. Despite this issue, the OTA process did update my devices.

After this fix is out the door, it would be good to develop some documentation on how to best update deployed sleepy devices.

Thanks, Chip

1 Like