Udp.sendPacket(...) ocassionally blocks for 10 seconds

Greetings !


  • Device: Electron
  • System Firmware: 0.6.4

Many months of successful development using Electron (very satisfied). System test is underway with dozens of devices in the field. Initial deployment of several hundreds of Electrons is imminent, then expanding into the thousands based on success of the Initial deployment.

As we are expanding system test, we are detecting occasional unexpected problems where Udp.sendPacket() occasionally (and unexpectedly) blocks for 10 seconds before returning an error code of -1. Digging through the Particle forums, I find the following list of UDP error codes:

 [Particle UDP Error Codes](  https://community.particle.io/t/udp-error-codes-listed-anywhere/18775/3)

According to this list, an error code of -1 means: “Pending”. From this I can imagine many possible meanings for “Pending”, however, my bigger concern is that Udp.sendPacket() is blocking for 10 seconds before returning it. According to the Particle Firmware documentation, Udp.sendPacket() is supposed to be “unbuffered” meaning that the Packet is sent directly from user memory; when it returns, the Packet is either sent, or not with a negative return code.

True to the spirit of UDP, delivery of Packets is unreliable meaning that software in the sender and receiver must be be designed to handle lost, duplicate, and out-of-order transmission. I’m pleased to say that our software architecture and implementation for the Electron and our “Cloud” is successfully recovering from the lost “Pending” packet. Unfortunately, the problem still exists: why is sendPacket() blocking for 10 seconds? It shouldn’t take 10 seconds to send a 512 byte UDP Packet, or to detect an error.

We always call Cellular.ready() before attempting to send. It would appear that there might be a firmware error in Cellular.ready() or Udp.sendPacket(). Also, it is possible within the small window of time between Cellular.ready() and Udp.sendPacket() that Cellular is lost. In any case, our code can recover from the situation, but only if the System Firmware doesn’t block for such a long time.

This occasional blocking situation is invalidating a very long design/implementation effort where we have meticulously designed the loop() { } construct to follow a single threaded state-machine where nothing blocks or takes longer than 250 ms. Our implementation sports 7 sensors and 2 servo based actuators. As far as we can tell, we are successfully satisfying Particle’s design principles. That is, until we started detecting the occasional long-term blocking of sendPacket().

For many other kinds of projects, if the processor freezes occasionally for 10 seconds, that is no big deal. While I can’t get into details, it is violating our design principles where our “user interface” becomes frozen for these 10 seconds. For us, this is a big deal. No, we are not monitoring mission critical information. But the success of our product centers on both functionality and solid user interface experience.

I am posting this issue here because we’ve made great progress learning about how to solve various problems on these forums. I think that the answers posted here will be of value to the Particle community. Specifically,

  • Is the single-threaded state-machine loop() { } valid in all circumstances?
  • Is there some kind of undocumented “timeout” parameter I can add to Particle Firmware calls to ensure they never block?
  • Does the Firmware API expose FreeRTOS so that I can create my own threads (and sychronisation primitives) to make calls to the Particle Firmware? Is the System Firmware “thread-safe”? Or, is this a really bad idea attempting to solve this problem using this approach?

Thank you all for your invaluable input.

My guess is that it’s not actually the UDP send that is taking that long. Most modem operations on the u-blox modem take over the modem for the duration of the operation, and the next operation won’t proceed until the previous one is complete. My guess is that the UDP is blocking because another operation is using the modem at the time.

Of course thing begs the question of which operation, and why. That’s harder to answer. A debugging version of system firmware can output all modem operations by USB serial. That’s probably the best way to troubleshoot this, if you can reproduce it with any frequency.

There is multi-threaded support, however not all calls are thread safe, so you have to be pretty careful when using it. And because the modem is single-threaded, I’m not positive this will fix your blocking problem.

However, you may be able to move your user interface and/or measuring into a thread, but leave networking in the main loop. You need to be careful because things like I2C and SPI can be called from another thread, but there’s no thread safety built in, so you’ll need to wrap those to prevent problems. But that would assure that your user interface remains responsive.

rickkas7 Thank you for your quick reply.

I’m going to look into using the debugging system firmware. But first, I need to understand and characterize the frequency of this error. Perhaps there is a way to do something to increase the frequency of failures so we can catch it using the debugging version of system firmware. The device application is mobile and truly doesn’t make it easy to hang a PC off the USB port to capture the output of all modem operations.

Once I learn something interesting or useful, I’ll post it here.

Thank you and best regards,
Steve Scott