OTA update failing due to network latency timeouts

Hey all,

Trying to release a new firmware to the devices in my product.

Haven’t had a problem with this in the past.

Have successfully OTA flashed the same firmware to a single development device using Particle-CLI.

All updates are generating spark/flash/status events with failed payload

spark/device/last_reset events all either have update_timeout or panic, stack_overflow payloads

Screenshot:

What gives?

I am concerned because this appears to be burning through the data on all my devices with the half-finished OTA update, reset, download OTA again loop.

For now I will “un-release” the firmware to break the cycle, but this is worrisome to say the least.

1 Like

I have brought it up to the team and will see if action needs to be taken.

1 Like

Was that device running the exact firmware the other devices are running currently?

Can you single out one or two product devices to apply the new firmware without rolling out to the whole fleet and check the trace logs?
With trace logging activated and SYSTEM_THREAD(ENABLED) you should be able to get some OTA messages via USB that may provide extra clues.

Problem seems to have solved itself:

These devices are all operating on an island off the coast of Tanzania where the cell phone networks are quite over-burdened.

I think this is probably why both the stack overflow and update timeout errors were occuring.

I have also seen other strange behaviour that I believe is related to the network latency issues.

For example, I routinely call a Particle.function() on a device from Particle console, and proceed to receive an error saying that the device couldn’t be reached only to see a few moments later (4-5 seconds at most) that the function call did go through (as evidenced by a Particle.publish that gets triggered by that function call).

I believe that latency issues are also responsible for requests for the value of Particle.variables() that I have exposed to the cloud failing.

Is it possible to increase the timeout length on Particle.function() and Particle.variable() calls from the Particle servers to devices as well as OTA timeout?

1 Like

Hi @jaza_tom,

If you’re frequently seeing timeouts when calling particle.functions, but you can see that the function call did kickoff, try not to block in your particle function callback. Instead, have the function call set a flag, or other variable, and use that variable to trigger the work in your loop function later. This way the particle function call can return immediately, and that will cut down on the latency when the device replies to the server, which might help you avoid those timeouts.

Thanks!
David

2 Likes

@Dave thanks for the tip! I think these functions tend to block for a fairly short amount of time (maybe 1 second at the most).

Would this be enough to cause a timeout on a function call?

If so, can we increase the timeout length?

Can I publish things from within a cloud function callback, or will that automatically cause a cloud function timeout?

What is the timeout length?

Hi @jaza_tom

Just my unofficial opinions but perhaps you should try to arrange for a paid consulting code review. One second is an eternity for a cloud function to block, in my experience. The best practice is to treat cloud functions like you would interrupt service routines and set flags for the main loop to handle any required actions. Publishing from within a cloud function can also be a problem due to the limited resource on devices-best to set a flag and publish from loop to avoid buffer problems.

I am not sure if extending the timeout period results in lower overall cloud performance but my feeling is that it probably does. There is certainly an engineering trade-off to be made in the timeout period versus scalability. I personally do not feel that changing the timeout is the answer to your issues.

4 Likes

@bko thank you for your thoughts, I will take that under advisement.

I have programmed a flag-based response architecture in an otherwise empty test cloud function that I hope to call next time there are network latency issues in Tanzania, to test my hypothesis that it is the network latency that is the dominant factor as opposed to my coding style.

I will report back with the results when they are available.

If indeed the strange Particle.function() behaviour is due to my coding style w.r.t. flag vs non-flag based callbacks, then that still doesn’t explain why OTA updates and Particle.variable() get calls fail at times with higher network latency.

How many milliseconds (or seconds) is the cloud timeout set to for:

  • Particle.function() call —> in device firmware —> in Particle Server Code
  • Particle.variable() call —> in device firmware —> in Particle Server Code
  • OTA update operation—> in device firmware —> in Particle Server Code

Hi @jaza_tom,

Typically the timeouts are pretty long, I think on the order of 15-45 seconds, but I don’t have that value in front of me at the moment.

You can also disable the low power mode on the electron if you want to reduce cellular latency. I’ve found disabling this low power mode can make the electron snappier, but consume more power.

Cellular.command("AT+UPSV=0\r\n");

Thanks,
David

1 Like

@jaza_tom,

I’m wondering if you ever resolved your latency issues. I’m having issues with Electron’s going offline, as detailed here: Electrons Lose Connectivity Intermittently

I’m experiencing the same OTA update issues as you’ve posted about here with the devices with the worst problems. I’m wondering if the source of my issues may be latency as opposed to issues with tower connections. Any tips you have would be welcome.