Simple UDP program breaks the core

zachary · August 28, 2014, 10:24pm

Can I please just this entire thread?!? Thank you all so much. Looking forward to the outcome of this discussion!

gruvin · August 28, 2014, 11:00pm

THEORY VOID: The theory I had relating to this post has been voided by other activity. So the rest of this most likely no longer applies.

@SomeFixItDude … got a bit of a curve ball test for you.

Without going into detail just yet, I wonder if you could test your simple UDP packet application using tag:spark_6 firmware (all three core-* repositories – simply downloading the ZIPs from github seems the easiest wy to get them) and the following precise version of the ARM-GCC compiler … gcc-arm-none-eabi-4_7-2013q3, which you can fine here.

We can discuss the details more if my suspicion is validated by your code suddenly working as expected using this earlier platform. Otherwise we can just move on. Thanks.

Alternatively, I would be happy to run the test for you. I’ll just need your application.cpp and your node.js app, please. Thanks.

AndyW · August 28, 2014, 11:26pm

@gruvin, run your test with this code from a little back up this thread.

Be sure to point it at a non-existent host on the local subnet, so that ARPs will go unanswered & report back. That code is enough to demonstrate the problem.

gruvin · August 29, 2014, 1:32am

THEORY VOID: The theory I had relating to this post has been voided by other activity. So the rest of this most likely no longer applies.

Yes, saw that code. Thanks. But I'm more about testing the original problem, which so far as I can see, did not involve a non-existant host and still CFOD'd anyway. IE. I know about the ARP issue and I'm trying to (dis)prove a completely different theory. So, I'll need the node.js code, as well.

This (for me, right now) is to do with another CFOD fault, under completely different circumstances, but which I believe is linked to the upgrading to the arm-gcc v2014q2 compiler, circa tag:spark_7 -- which I'm still guessing was about utilising certain atomic RAM handling features. If both scenarios are fixed by regressing to compiler v2013q3 (and necessarily tag:spark_6) then I believe both issues will be much closer to finding the real cause. If not, then at least we get to cross something off the list.

Oh and for the record -- yes, I know that my theory sounds insane. But there's been a bunch of testing and proofs done around this, on the side. It's not worth delving into the details here unless this specific test proves something.

EDIT ...

For what it's worth, the fact that certain tests fail for some users while bringing out faults for others is also a characteristic of what I'm trying to nail down with this compiler thing. I believe we're seeing that with the other fault, also. I don't yet know if it's the compiler itself or code using a new feature there-in or both -- or even differences in the Mac OS X versus Windows compile tool chain binaries, though that's next on my list of tests.

Meanwhile, the ARP issue is a definite candidate for the other fault, plus or minus the specific networks being use to test it ... except that I commented out code relating to that and saw no difference in that specific faults behaviour (which was not this UDP thing.)

The length of this post is evidence enough that I shouldn't even try to explain all the background here, yet. Still two many cross-referencing loose ends -- and I could yet be proven completely wrong. {sigh}

AndyW · August 29, 2014, 2:06am

Fair dues.
Keep us posted, I had to bump my local compiler when the atomics were added. I’d be mildly surprised (not for the first or last time) if gcc-arm was badly borked, given the overwhelming number of arm devices out there and how many of them rely on GCC.

gruvin · August 29, 2014, 2:15am

Agreed.

Meanwhile, the very latest commit to master:core-firmware by someone (new since my last post here) has cured that other fault. So it's STOP PRESS and off to look at what was changed! I'll be back.

EDIT: Oh dear. The latest commit was 9 days ago! I don't understand what just happened. :-/ I've done several pulls today (just to be sure I was on the latest version) and yet some changes just got pulled in on that last one, only. Git can be so mysterious at times. I MUST have done something stupid and wrong.

This also voids my above theory. I'll edit those posts to save others going down a dead end, then.

AndyW · August 29, 2014, 7:39pm

@SomeFixItDude, were you able to validate the hypothesis ?

SomeFixItDude · August 29, 2014, 8:05pm

Fortunately and unfortunately I am on vacation and didn’t bring my toys I will test it out on return! I hope the problem is only occurring when the destination host is unavailable. In normal working conditions the destination will be available. At the same time I don’t want the core to stop functioning if the other end goes down.

Fun side note I absolutely must construct a remote control or beach cart that can auto-follow me with the spark core. LOL!

AndyW · August 29, 2014, 8:14pm

No problem.

Failing if a destination is unavailable is clearly unacceptable behaviour.

I have the test case running on a test vehicle with the SPI bus hooked up to an analyser/decoder, hope to have more info this weekend.

bko · August 30, 2014, 2:19am

Hi @SomeFixItDude and @AndyW

I noticed that the core multicast presence announcement does not bind the socket for listening–why should it, it is never going to read the results, so just don’t listen for them. I thought I would try this test case in that mode. So I wrote a version of the UDP object that does not bind in Udp.begin() and I tried your test case with that.

First I tried sending to a non-existent host (10.0.0.200) and that fails in the exact same way as the real UDP object. Six packets sent, long pause, write failure, etc.

So then I tried broadcasting to the local subnet on 10.0.0.255, but not reading my own packets back since I don’t bind the socket. This is working fine and has sent over to 700 packets without failing.

I think this puts to bed my wrong idea that reading and writing data were somehow coupled–they clearly are not. But it strengthens the evidence for ARP since the only difference between working and failing now is the address: if I use the broadcast address for the subnet, it works but if I use an address that would require the core to use ARP acquire a MAC address to send-to, but there is no host there, then it fails.

I know folks tend to think of UDP sending as an “unconditional send”, but if it is not to a broadcast address, then it is really still point-to-point since the sender must have the MAC address of the receiver. We should see if the TI CC3000 is telling us it has no MAC address to send-to when it fails. Then we could propagate that message back to the user so the failure is less cryptic, but it is always going to fail a non-broadcast send to an unavailable host.

AndyW · August 30, 2014, 4:22am

Makes perfect sense. Maybe we can make UDP.begin() take a default arg that makes the code skip the bind(), if you are not going to listen. That would help simplify the broadcast send use case.

AndyW · September 1, 2014, 3:25am

I’m testing code that skips the bind() if you call udp.begin() [without a port #].

I don’t see any regressions and the code appears to do what you’d expect (e.g. core’s source port varies.)

However, if I send UDP packets to the core, with a destination port that matches the source port, it makes the core reboot in what appears to be exactly the same way as if I did the bind and send the packets to the port number used in the bind. Note it doesn’t happen immediately, but seconds later, where I’ve observed to range between 8 and 40.

So I’m not sure the inbound packet handling actually changes that much if I do not call bind().

However, spraying the core with UDP packets for ports that are unequivocally closed seems to cause no problem.

Am decoding SPI traffic now to see if that sheds any light on the matter.

bko · September 1, 2014, 3:40am

The only thing not bind’ing buys you is that you don’t have to listen to your own broadcast packets and flush them. If someone else is also broadcasting, that could be a problem anyway since you would want to listen.

AndyW · September 1, 2014, 4:06am

There is nothing stopping a user calling begin() with a different number than the destination port #, which will mean that you would not receive a copy of the broadcasts.

Maybe if we just document that, it is effectively the same as omitting the bind(), because we still have a source port and it looks like the CC3000 wants to handle any inbound packets for that port - and we can quickly get into buffer hell. We can just train people to pass a different value to begin(), instead of using the same as the destination port.

Now back to work on the behaviour for non-responsive/non-existent hosts.

SomeFixItDude · September 4, 2014, 2:06am

@AndyW Sending to an active ip does not seem to hang the core anymore. I am going to let it run for sometime to verify this. It has been running for less than 10 minutes but hey it usually dies very fast. Maybe some firmware improvement fixed the problem where it was freezing on this same scenario. This is hopeful. Now we just need to figure out if we can make the core send to a non-existent ip without freezing up. As we all know UDP does not require an endpoint or an active ip. I would hate to have all my sensors freeze anytime I had to take down the destination server.

Thanks all for continuing to look at this problem.

SomeFixItDude · September 4, 2014, 2:30am

@bko as you know the failing code here is when no ip is active on the other end. Although I am broadcasting occasionally in my actual project, I am still unicasting for the majority of the time and the failing example is not broadcasting. So I feel like that does help somewhat but doesn’t really get anywhere with the current issue. How does the Arduino behave in this situation? Can anyone test that? I will be surprised if it doesn’t work just fine (as in continues to operate). I don’t want the core to freeze if the end server goes down. In TCP I would program around this condition. I am guessing the ARP is cached but for how long?

@AndyW @bko I think you both are dead on that this an ARP issue. If this is the case where do we go from here? Can this somehow be handled in the host driver? Can this not be fixed without TI? If not, how does Spark get TI involved?

Thanks again everyone for checking this out.

BTW - Still running great against an active ip!

AndyW · September 4, 2014, 2:32am

Totally agree that the behaviour is unacceptable as is, and am actively working to find a fix/work around.

Thanks for the validation.

bko · September 4, 2014, 2:44am

UDP, like all IP protocols, does require the MAC address of the destination to be known before a packet be can sent to a non-broadcast address. This is what ARP does. But if the host has never existed or is not in your ARP cache, then there is no MAC address and the core (or any other host) will not be able to send anything to the missing computer. The router should be sending ICMP destination host unreachable messages back to the core. I am not sure the core is doing anything with these.

If the TI CC3000 was a full host, you could add a static ARP entry in the ARP cache for the sometimes-missing host, but I don't know of way to do that here.

@AndyW would know best, but I believe that ARP is handled completely internally to the TI CC3000. Maybe he can find a clever way or at least find a way to propagate the ICMP destination unreachable messages back to user code.

AndyW · September 4, 2014, 3:28am

ICMP unreachable messages are a distraction, if only for the reason that they don’t apply for anything on the local subnet.

bko · September 4, 2014, 4:28am

You are right, of course–I had forgotten that! I just did a simple test with wireshark and netcat and the ARP goes unanswered and no UDP packets are ever sent.

Topic		Replies	Views
UDP issues and workarounds Troubleshooting	29	14867	September 2, 2015
Unreliable UDP: crashes/freezes when sending at high frequency Troubleshooting	28	10806	June 5, 2014
UDP Broadcast problems with simple application Firmware	30	11702	November 19, 2014
TCP or UDP issues consolidation Troubleshooting	43	7333	June 14, 2015
Strange UDP bug	32	6495	March 10, 2014

Simple UDP program breaks the core

Related topics