Unreliable UDP: crashes/freezes when sending at high frequency

Hi all,

I’ve been playing around with the spark’s UDP capabilities, I’m working on streaming accelerometer and gyro data from a number of sparks over UDP in my LAN simultaneous. I’ve found the UDP send to be unreliable in undocumented ways.

I’ve updated the CC3000 firmware to see, but this didn’t change anything. I’ve also tested on multiple sparks (and with multiple routers) so it’s probably not isolated to one spark/router.

In particular one behaviour that I can reproduce consistently is that there seems to be a minimum delay between Udp packets or else the spark will crash/freeze.

If I run this:

UDP Udp;

byte remoteIP[] = {192, 168, 1, 100};
int remotePort = 10000;

int localPort = 5000;

void setup(void) {
	Udp.begin(localPort);
}

void loop(void) {
	Udp.beginPacket(remoteIP, remotePort);
	Udp.write('a');
	Udp.endPacket();
 	delay(1); // Remove this and everyone loses their mind
}

I get a solid stream of ‘a’ to my console (running for example nc -l -u 10000), but without the delay(1) the spark will either freeze or crash all the time. It seems odd since the drivers claim that the underlying sendto call is blocking which I would assume means that the Udp.write() call would block until the send was finished and it was safe to send again.

(Also on a related note why is there a Udp.endPacket() if it literally does nothing)

As far as I can tell there seems to be timing limitations surrounding Udp.parsePacket() as well, but I don’t have a minimal test case/example for that yet.

Does anyone have more information about this issue and/or tips on making the Udp behave reliably? The goal for me is to implement something that’s streaming Udp at a fairly high frequency but will also be reliable and able to run for many hours non-stop. (I’ve yet to try TCP for my application since UDP is the more natural choice but it is an option to switch over if it will be more reliable. )

I don’t know the technical in’s and out’s of this, but it’s possible that the cloud management could be interfering. Could you try disabling the cloud connectivity using Spark.disconnect()? Docs for that are here.

I wonder if the Udp.endPacket() is in there for compatibility with the Arduino Ethernet Class? Just a hunch as to why that may be in there and not actually doing anything. Maybe @zachary can expand a little more on that.

1 Like

I’m able to reproduce this also. I haven’t looked for a fix yet.

Hi wgbartley,

I’ve actually been mostly testing with the cloud disconnected. Usually I run with #include "spark_disable_cloud.h" and there is no clear difference. It seems to happen with or without the cloud.

Blake, glad to hear it not just me :smile:

Do you have a sense of what the root issue might be or where I can look to workaround/understand this? For instance if I know that allowing for a certain interval between calls guarantees them to be safe that would be decent for now.

I’m not sure what it could be. If I get some time I’ll have a play around with it, but I don’t have a lot of experience with the firmware code base.

You need to look at the return value from udp.write. With your code, it returns -1 with the cloud off from the first call to udp.write. I am not sure why, but there is definitely an error being reported. I will try to investigate further tomorrow.

I like to add something like the code below at the end of loop with the cloud off so I can tie a pin high and do over-the-air flashing. You need to tie A0 low to run this.

 	if (digitalRead(A0)==HIGH) {
 	    Spark.connect();
 	    while( Spark.connected==0) {}
 	    for(;;) { SPARK_WLAN_Loop(); }
 	}

OK so it turns out you have to wait for UDP.begin() to return non-zero with the cloud off. With the cloud on, you know the WiFi connection is fully up before you get to setup() because it has already done the entire cloud start up negotiation. But with the cloud off, I don’t think there is a blocking call in setting up the WiFi connection, so you can loop until begin from the UDP connection returns non-zero, but I believe you could also loop looking at WiFi.status().


#include "spark_disable_cloud.h"

UDP Udp;

byte remoteIP[] = {192, 168, 1, 100};
int remotePort = 10000;

int localPort = 5000;

void setup(void) {
	int beginRet = Udp.begin(localPort);
	while (beginRet == 0 ) {  // could add a timeout here if you need it
	    beginRet = Udp.begin(localPort);
	}
}

void loop(void) {
	Udp.beginPacket(remoteIP, remotePort);
	int retval = Udp.write('a');
	if (retval!=1) { 
	    //do something here in your application   
	} 
	Udp.endPacket();
 	//delay(1); // Remove this and everyone loses their mind
}

2 Likes

Not a stab at anyone but it would be great if http://docs.spark.io/firmware/#udp-begin the documentation reflected that there was a return value. Also this is a great example that the docs could have a note somewhere with some great info here provided by @bko.

Hi @SomeFixItDude

I was thinking the same thing, so I will try to write something up. My non-Spark life has been busy lately!

Thanks @bko , I was actually aware of that issue. When the cloud is off via #include "spark_disable_cloud.h" you have to wait for the wifi to connect and acquire an IP before you call Udp.begin. You can do it the way you showed, I’m doing it like this instead (flashing via spark cloud compile and dfu-util)

#include "application.h"
#include "spark_disable_cloud.h"

// UDP

UDP Udp;

byte remoteIP[] = {192, 168, 1, 100};
int remotePort = 10000;

int localPort = 5000;

// Vars

bool online = false;

void setup(void) {

}

void loop(void) {
	if (!online) {
		IPAddress addr = Network.localIP();

		if (addr[0] != 0 || addr[1] != 0 || addr[2] != 0 || addr[3] != 0) {
			Udp.begin(localPort);
			online = true;
		}

		delay(100);
	} else {
		Udp.beginPacket(remoteIP, remotePort);
		Udp.write('a');
		Udp.endPacket();
		delay(1);
	}
}

But the main point is that having cloud on/off doesn’t affect the above mentioned issue.

@SomeFixItDude there’s also some other little errors in the documentation. The regularly refer to all uppercase UDP such as UDP.available() which doesn’t work, it should be Udp.available() lowercase.

@sparks I did see on the TI forum that depending on your CC3000 firmware version you may need to add a delay after your first udp packet. The suggested delay was one second. Later when I have time I’ll try to find the link or you could search it on the ti forum. What firmware version of the cc3000 are you running?

EDIT: Found the link… here

http://e2e.ti.com/support/wireless_connectivity/f/851/t/342177.aspx

Scroll to the TI employee response.

Here is the interesting part from the TI forum:

Below are some of the updates/answers regarding the release package 1.13.

  1. We will update the wiki soon with the correct download packages.

  2. With respect to the UDP issues, we also see these issues. In order to fix some of the TCP issues, we had to change the ARP behavior in the Network stack. And because of this we are seeing some issues with respect to the UDP.

Below are a couple of work arounds for the issues that we are facing:

i) For the UDP send, we suggest to add considerable delay only after the first udp sendto command.
This should fix the UDP sendto issue.

ii) For the mDns issues, after the WLAN connection is completed, please make a call to ‘gethostbyname’ with host_name set to “localhost”.
After this, make calls to “mdnsAdvertiser”.

3 Likes

Thanks for the detailed info @SomeFixItDude, playing around with a longer delay after the first packet hasn’t seemed to improve the issue. I’m running the latest cc3000 firmware on the master branch here

I’m also finding that even with a delay (say as much as 10ms) I have trouble keeping a stream of (bidirectional) data going. The behaviour I get right now is that the spark will fall into a weird cycles where from the perspective of my laptop it becomes unreachable (unpingable/host is down when I send UDP message) and then is briefly back online and then unreachable again, etc. This will occur after a few minutes when I’m streaming from 4 sparks simultaneously.

Could someone please post a simple program to talk with a core doing UDP? UDP seems like the path for my current problem.

Obviously the Spark should not crash, but other than that the OP does not describe any flaw in Spark UDP, but he does identify one by pointing out that UDP.endPacket() does nothing.

I see from the link you provide that it does seem that UDP.endPacket() does nothing. This is most perturbing. UDP is a datagram protocol. The UDP protocol demands each packet=datagram must be sent in its entirety or not at all and that the ENTIRE packet=datagram is received OR NOT AT ALL. I had always thought (and this can be the only UDP-compliant intention of those who specified the interface to the UDP class) that the purpose of the endPacket() call was to allow a packet to be assembled by multiple calls to UDP.write() and that nothing would be sent until endPacket() was called. Any other behaviour means that Spark UDP is fundamentally broken. This is a requirement of the entire packet=datagram being received in its entirety OR NOT AT ALL. [Once the entire packet=datagram is sent by UDP this may be split into several IP-packets in later versions of the IP layer (layer 2) of the network stack but this break-up and reassembly is invisible at the UDP layer (layer 3) - I am talking about the spec of the UDP protocol, not how it is implemented.] If Spark UDP does not implement this essential part of the UDP protocol then Spark UDP is fundamentally broken in yet another way than previously identified. Either that or we have found (one of the) cause(s) of https://community.spark.io/t/udp-received-dgram-boundaries-lost-read-parsepacket-available-all-broken/3800 and https://github.com/spark/core-firmware/issues/170

That a UDP datagram is received at all, or that UDP datagrams are not duplicated, or that UDP datagrams arrive in order is not guaranteed in any way by the UDP protocol. That you are missing packets, or the initial packet, is something that your app must just cope with. If you can improve things by introducing delays then that is good, but having to do so does NOT mean UDP is broken (other than that the Spark should not crash). I am not at all surprised that if you flood the network with UDP packets that many of them get lost. There is no flow control in the UDP protocol. [The underlying sendto() may be blocking for TCP but ought not be for UDP for which it needs to return immediately.] You must implement your own flow control. If can do this with application level ACKs & NAKs or with delays, but even then, you must expect some packets to go missing. And to be duplicated, and to be received out of order.

All the UDP protocol guarantees is that the datagram arrives INTACT, OR NOT AT ALL. There is NO reliability of delivery, there is NO order guarantee. There is NO flow control. Some respondents to the OP do not seem to be bearing this in mind.

“Unreliable UDP/Understanding UDP limitations”? Yup, the UDP protocol is unreliable by design. (It’s not supposed to be so unreliable that it crashes the Spark!) The limitations are as I describe, and you must expect this. In return you are able to implement higher speed comms than is possible with the bytestream protocol TCP, packet boundaries are preserved, and you also get broadcasting.

Obviously the Spark should not crash, but other than that the OP does not describe any flaw in Spark UDP, but he does identify one by pointing out that UDP.endPacket() does nothing.

So I have some UDP test code that works OK, but it sometimes gets watchdog timeouts as it accesses the cloud. I have noticed some difference with cores that have the latest TI CFOD patch and those that don’t, but the results are not crystal clear.

My test code uses the cloud to pass the IP address between two cores, a TX and and a RX core, every 10 seconds. The TX starts when it gets an address. If you run this, you should pick a different event name to publish under since there is bug in which subscribe does not respect the MY_DEVICES flag. This bug is fixed in the staging server but has not been fully tested and made it into the production server yet.

These ran for over 52000 32-byte packets before the watch dog times out.

Here is the TX code:

#define LOCALPORT 9000
#define REMOTEPORT 9001

#define BUFSIZE 32

UDP udp;
uint8_t packetBuffer[BUFSIZE];
uint8_t lastStart = 0;
bool once = true;
IPAddress remoteIP;

void setup() {
    Spark.subscribe("UDPIPAddr", setIP, MY_DEVICES);
    pinMode(D7,OUTPUT);
    digitalWrite(D7,LOW);
}

void loop() {
    if (remoteIP[0] != 0 && once) {
       once = false;
       udp.begin(LOCALPORT);
    }
    if (!once) {
        for(int i=0;i<BUFSIZE;i++) {
            packetBuffer[i] = ((i+lastStart)%96) + 32;
        }
        lastStart++;
        udp.beginPacket(remoteIP, REMOTEPORT);
        int nbytes = udp.write(packetBuffer, BUFSIZE);
        if (nbytes<0) {
            udp.stop();
            once = true;
            digitalWrite(D7,HIGH);
            delay(200);
            digitalWrite(D7,LOW);
        }
    }
}

void setIP(const char * event, const char * data) {
    char dataStr[64];
    strcpy(dataStr, data);  //need non-const below
    int i = 0;
    char * p = strtok(dataStr, ".");
    while (p != NULL) {
        remoteIP[i++] = atoi(p);
        p = strtok(NULL,".");
    }
}

and here is the RX code:

#define LOCALPORT 9001
#define REMOTEPORT 9000

#define BUFSIZE 32

UDP udp;
uint8_t packetBuffer[BUFSIZE+1];
uint8_t lastStart = 32;
bool once = true;
IPAddress remoteIP;
int rxcount = 0;

unsigned long lastTime=0;

void setup() {
    Serial.begin(9600);
}

void loop() {
    if (once) {
        once = false;
        udp.begin(LOCALPORT);
    }
    if (!once) {
        int nbytes = udp.parsePacket();
        if (nbytes>=BUFSIZE) {
            udp.read(packetBuffer, BUFSIZE);
            packetBuffer[BUFSIZE] = '\0';
            Serial.println(rxcount++);
        }
    }
    
    unsigned long currTime = millis();
    if ( currTime-lastTime>10000) {
        lastTime = currTime;
        publishIP();
    }

}

void publishIP() {
    char printStr[64];
    IPAddress localAddr = Network.localIP();
    sprintf(printStr,"%d.%d.%d.%d", localAddr[0],localAddr[1],localAddr[2],localAddr[3]);
    Spark.publish("UDPIPAddr", printStr);
}

I know I am being picky, but using the word “streaming” while talking about UDP can contribute confusion. UDP is not a streaming protocol, it is a datagram protocol. TCP is the streaming protocol. I know that data streaming is often implemented using UDP but please note this streaming is not provided by the UDP protocol but by the application programmer doing packet sequence checking, and ACK & NAK retransmission requests and flow control, on top of and in addition to UDP.

Ordinarily I would allow this to pass, but we are talking about flaws in Spark UDP and we need to be clear about what the UDP protocol provides.

@psb777 Couldn’t agree with you more.

I think the frustrating part is that I don’t perceive any movement. The core supports tcp and udp, and udp does not work as intended. That is a huge amount of functionality down the drain. I also am opposed to resetting the core when the core loses connection with cloud. There must be a more graceful way than stopping execution and starting over. Yes I understand it is only a temporary solution to keep code executing, but temporary becomes permanent all too quick. Many of my projects I am trying to use broadcast and discovery and it is very hard to implement when the boundaries are lost. This means I have to include source ip in the packet itself and I am not sure how many bytes each datagram is. So you have to send the exact same number of bytes for every datagram.

All I see is on waffle that these problems are way on the back burner for UDP under ideas. CFOD when you can’t run for more than 15 minutes without a reset is a serious problem and makes the spark core unusable.

2 Likes

Spark UDP is fundamentally broken. As is becoming apparent there are major bugs in the code of the Spark UDP class but I think @bko also found a reference to one in the TI firmware.

I am not using fixed length packets but instead I add a packet delimiter. Then, as if I am using TCP, I have to search for the delimiter, and break the incorrectly aggregated packets read by UDP.read() up. But neither the packet delimiter nor fixed length fixes anything as, with UDP, you may not have control and you may not want control over who it is who talks to you on your UDP socket - you have no control over the length or the structure of the received packets. That either “solution” is taken here, on the forum, as an acceptable workaround, allowing the UDP issues to be given low priority, is a mistake. And the longer the situation continues the worse it becomes, as when Spark UDP is fixed likely it will break some existing code and users will squeal. Some UDP code I have seen quoted never even calls UDP.endPacket()!