Simple UDP program breaks the core

Here is the modified version, maybe I missed something? This fails rapidly.

UDP Udp;

unsigned char TxMsg[12] = { 1, 254, 1, 254, 1, 254, 43, 212, 71, 184, 3, 252 };
unsigned char recbuf[12];


void setup() {
    Udp.begin(9000);
    pinMode(D7, OUTPUT);
    digitalWrite(D7, LOW);
}

void loop() {
    int32_t packetLen = Udp.parsePacket();
    
    //Dump any packets RX
    while (packetLen > 0) {
        Udp.read(recbuf, packetLen >= 12 ? 12 : packetLen);
        packetLen = Udp.parsePacket();
    }
    
    Udp.beginPacket(IPAddress(10,0,0,2), 9000);
    
    int wrBytes = 0;
    wrBytes = Udp.write(TxMsg, 12);
    Udp.endPacket();

    if (wrBytes==-1) {
        Udp.stop();
        Udp.begin(9000);
    }

    digitalWrite(D7, HIGH);
    delay(200);
    digitalWrite(D7, LOW);
    delay(200);
}

Thanks again.

1 Like

Looks like this thread is headed to depths of the void of unresolved problems. Is there any movement on the non-cloud network functionality becoming stable? Have people been hired at Spark to fix the many instabilities? I haven’t seen much movement on issues and I am starting to lose faith in the product. That simple test program should last more than 5 minutes.

Thanks

I can appreciate the program shouldn’t crash, but does this represent a realistic use case?

It looks like the program produces packets, which it receives, and then discards them (assuming 10.0.0.2 is the core’s IP address.) Is that correct?

If so, is there a test UDP client that we can use so the test is more typical of a real world scenario with separate UDP peers?

Thanks for replying @mdma

Yes it is a realistic use case. For example someone may want to send sensor data for every reading to a central server. This data is being read from the sensor pretty rapidly and I do not want to send the data through the cloud. I don’t want the latency, restrictions and dependency for Spark’s service to be up and running. The parsing of received packets does not need to be done, that was added as Voodoo in some magic hope it would stay alive for more than 5 minutes. In fact no one is sending any network data to the core so it is not needed. We shall remove it and just stick with sending. The 200 ms delay was also added to simulate time till next sensor reading this delay is actually longer than actual readings. The array in the example is actually one sample reading that I hard coded to debug the problem.

The core’s ip address is 10.0.0.13 assigned by dhcp. I am sending packets to IP address 10.0.0.2. This is UDP no one needs to be at the destination ip address and I am not sending to the originating IP just to be clear. In my case it fails either way if the ip is active or not. Like I stated, we are talking UDP, there is no handshake or session and no one needs to be on the other end. So really the fact that someone is active on that IP is a mute point.

Update test program:

UDP Udp;

unsigned char TxMsg[12] = { 1, 254, 1, 254, 1, 254, 43, 212, 71, 184, 3, 252 };
unsigned char recbuf[12];


void setup() {
    Udp.begin(9000);
    pinMode(D7, OUTPUT);
    digitalWrite(D7, LOW);
}

void loop() {
    int32_t packetLen = Udp.parsePacket();

    Udp.beginPacket(IPAddress(10,0,0,2), 9000);

    int wrBytes = 0;
    wrBytes = Udp.write(TxMsg, 12);
    Udp.endPacket();

    if (wrBytes==-1) {
        Udp.stop();
        Udp.begin(9000);
    }

    digitalWrite(D7, HIGH);
    delay(200);
    digitalWrite(D7, LOW);
    delay(200);
}

Thanks again @mdma glad to see another set of eyes looking at it.

Edit: The end goal is to have many cores sending udp packets to one central server. I can’t get that to work, so I am just trying to get one to work.

So I am responsible for the broadcast ā€œvoodooā€ because with the original code, it fails for me after 6 tries but with the ā€œvoodooā€ it typically runs more than 50 tries. I believe that is a clue to debugging it. My theory is that something in the driver needs to receive data in order to transmit data.

Also I have NTP code that uses UDP and I have run it in a similar loop and it does not fail running overnight, so I have been trying to isolate what is unique about this failing case and I think it is the transmit only nature of this code. With NTP you send one packet and receive on packet. I also tried increasing the packet size in the test code since NTP uses 48 byte packets. The packet size above is smaller than the minimum ethernet packet and so would require padding after the checksum.

Six is a magic number for the TI CC3000–there are 7 sockets in total and one is used for the Spark cloud connection, leaving 6. Hard to know if that is related or not, but it is curious. I spent some time trying to see of the code was using a new socket for every try but it did not seem to be. Still worth investigating.

When the code above fails, Udp.write() returns -1 indicating it failed, so I had @SomeFixItDude read the return values and take action in that case. You should also check the return value of Udp.beginPacket() which can return zero if the WLAN is not available or if gethostbyname fails, but I don’t think that is happening here. Udp.begin() also returns a value indicating if the bind failed. It is just good practice to check all return values to program defensively.

When you check the return values, it fails in a pattern of success followed by failure and long timeout, the success again.

I think if you had the time to setup a fully instrumented local build debug compile, you could try to debug it effectively. There are lots of people saying that UDP on the TI part is far from perfect on the TI forum but I have hopes that it can be improved in this case. Certainly the NTP results are encouraging.

2 Likes

Thank you both for the information and clarity in your writing - I understand basically that UDP is connectionless, but I’m not as familiar with UDP as I am with other protocols, so the details were helpful. I will try running this test case and see if I can probe about and cause any change.

I see TCP is also not totally reliable either. Fortunately, due to timeouts and such, TCP can recover in most cases after a delay. But my gut says there could well be a common cause for both protocol instabilities, such as buffer starvation.

Great to hear that the NTP results are more encouraging @bko, perhaps this could hint at a timing issue?

1 Like

I don't think it is a timing or race condition per se. You can slow down the UDP transmit to one packet every 2 seconds and it still fails in the same way.

Glad to have a fresh set of eyes looking at it! I am a little busy with other things right now.

2 Likes

Thanks guys for taking another crack at it! Side note - people’s tone in forums are often misconstrued. I am a little frustrated but I am more sitting here with the hands in the air, crap what do I do now, as opposed to being mad.

@bko I think you are on to something with the having to have to read packets. What makes that cause the program to run longer? How do I properly do the begin packet to check the return code and recover from bad value there? I would like to add or if you could add it to the last version of code that be awesome. (Either way it would be helpful) Also in your ntp program do you send to an ip address or does the core resolve the host name? Maybe that is another difference? I don’t know what to say about adding the Voodoo discard packet logic. I don’t know if keeping it in hides or directs us to a problem. I am at a loss there. I do appreciate the thought though.

@mdma I don’t know how to check if buffers is the problem or not? Any help here would be great. Just small tidbit on udp that you probably already know. With no handshake / session basically the core should transmit the packet and move on its merry way. And when I test the code above they are transmitted. Maybe after it is transmitted it is some how still in the buffer? This is where maybe you are on to something.

At one point I believe I posted some debug output, I’ll try to hunt it down. I believe Brian pointed out there was an unsolicited response from the cc3k. Maybe some better recovery needs to be done here? I’ll hunt it down.

Anyway thanks guys for helping.

1 Like

@SomeFixItDude - I took your code, and read through the UDP class in the firmware. It all checks out.

I added some serial commands to your code in a test program here: https://gist.github.com/m-mcgowan/3195ac229d672593e0eb

It uses Serial to dump any critial info (like UDP.write() failing) and allows the user control over wifi, the cloud and binding to the udp socket, and reading from the socket.

Here’s the test:

  1. start the code
  2. open a serial monitor
  3. hit ā€˜w’ to connect to wifi
  4. hit ā€˜b’ to bind to the socket
  5. observe the D7 led flash 12 times, then a slightly longer pause before the last breath of a 13th flash, then nothing, and the main loop is unresponsive.

When I get more time, I’ll pull out the JTAG debugger and see exactly where the code is hanging up.

As another test, I disabled the wifi while before the loop locked up. So I wonder if some of these SOSs are caused by intermittently losing the wifi connection?

@SomeFixItDude, I guess I have good news and bad news - I’m running your test case here and it is not crashing - just keeps cranking out the packets - I’m watching them show up on another box using tcpdump.

I made two changes:

  • My network is not 10.0.0.x, mine is 192.168.121.x - is there any practical way for you to get a core on a class C RFC 1918 network (like 192.168.x.x/24) ?

I know it sounds stupid, but there have been cases where the CC3000 behaved differently on 10.x.y.z networks (don’t ask me why.)

If you can’t do that, I can try and put this core on a test network configured to be 10.x.y.z/24 (or /16 or /8, whatvever)

  • I also added the ā€œif D0 == High spin calling SPARK_WLAN_loop()ā€ code to ensure I could OTA flash it.

I believe I’m running the latest, post CFOD-patch CC3000 firmware - I’ll leave it running and see how it fares for a few hours, then I’ll try loading it up on another core and testing there too.

Heres my dataoint - I have a patched core, and a class C 192.168.x.x network, and for me the core reliably hangs after 12 LED flashes.

Why the extra code for OTA flash? As you know the wlan loop is called so long as the user loop is running.

It definitely clobbers OTA flash, so I added that code. This fits my theory about running out of sockets, but I have not been able to prove that. There are some related things, I think, like Spark.publish() that can clobber OTA updates. Yet another mystery to figure out!

Interesting about the Class C network! Yet another variable to test. I have been testing on a very lightly loaded 10.0.0.x network and it consistently fails.

1 Like

Yes, the plot thickens!

Obviously if everything works, you don’t need the D0 code, but if my test had hung - then I wanted a way to easily recover, and @BDub’s code snippet seemed useful.

I’ll perform more testing on other cores and try and set up a different network too.

OK - I believe the problem is ARP (an area that has proven to be the CC3000’s Achilles Heel)

If you send packets to a host that exists, or something that is reachable via the default gateway off the local subnet, my testing suggests that you can send packets for ever without a problem.

However, if the designated host is on the local subnet and not available, or the default gateway is unavailable (e.g. the ARP requests go unanswered) - the app will die after ~12 iterations, as reported by @mdma and others.

I have this in captivity on a core with the the SPI bus pinned out for debug, so I can see if the CC3000 is trying to say it has run out of buffers while it is waiting for the ARP response (that is never destined to arrive.)

Can others please verify this behaviour (I’m thinking specifically of @bko & @mdma, but all datapoints are welcome.)

In related news, even @bko 's D0 clause doesn’t seem to work correctly, because the core is now effectlvely unflashable OTA.

4 Likes

Hi @AndyW

To reflash your core, tie D0 high and hit reset. That is why the first thing in loop() should be the infinite loop.

I can easily believe this is ARP related and that also explains why my NTP tests never fail–there is always a live host to talk to.

Yeah - I could see what the code was supposed to do, I’m just saying it didn’t work for me in this case.

Maybe it was a bad jumper wire, who knows.

It would be great to know if it is calling back with a "out of buffers" message. If so, next step is then to see if there is anything we can do about that in the host driver, which sadly, seems unlikely.

Buf if a "out of buffers" message is sent, then we could have that propagate to the upper layers of firmware code to say "disconnect all your stuff, I need the resources", which could temporarily disconnect from wifi/cloud to free up buffers.

2 Likes

I’ll try a couple tests tonight as well. I have a host sitting on 10.0.0.3 I’ll point it over there and see if it fails. Also perhaps a test to 10.0.0.255 is in order as well? I wouldn’t expect an arp being requested for a broadcast address. I know we will have to dump incoming traffic since we will pickup our own broadcast then. @mdma if you send to an active host does your core stay up as well?

Thanks Guys!

1 Like

I think it has already been established that as long as you consume the packets echoed to the core, broadcast UDP will work. Which matches the rest of the evidence. (Re)validating that would be good.

2 Likes