Simple UDP program breaks the core

@bko @AndyW Thanks again for checking this out. Perhaps there needs to be a timeout on the ARP request. Also remember the ARP request only occurs on the local netmask. When unicasting to an address, say on the internet the core should not be making an ARP request. I don’t know what the current behavior is.

Thanks

When the core is trying to talk to anything off the local subnet, it will go through the default gateway.

For which it will also need to ARP.

If that default gateway is not present and unable to respond to ARP requests for any reason, we would see the same (suboptimal) behaviour.

Therefore both local and remote use cases are susceptible to this problem.

Hey @AndyW, any luck?

Thanks,
Brian

Sorry for going dark - been busy.

No obvious silver bullet yet. Here is a partial decode of the SPI transaction when the target exists:

+40.684445   984 send data fd 1, flags 0, length 12: 01 fe 01 fe 01 fe 2b d4 47 b8 03 fc
                                            to: 02 00 23 28 c0 a8 79 33
+40.684630   985 sendto status fd 1, length 12
+40.884214   986 select command 1 0 0001 0000 0000 0.005000
+40.889480   987 select status 0000, rd 0000, wr 0000, ex 0000
+40.889565   988 send data fd 1, flags 0, length 12: 01 fe 01 fe 01 fe 2b d4 47 b8 03 fc
                                            to: 02 00 23 28 c0 a8 79 33
+40.889744   989 sendto status fd 1, length 12
+40.890600   990 free buffers event 0:2
+41.089337   991 select command 1 0 0001 0000 0000 0.005000
+41.094598   992 select status 0000, rd 0000, wr 0000, ex 0000
+41.094683   993 send data fd 1, flags 0, length 12: 01 fe 01 fe 01 fe 2b d4 47 b8 03 fc
                                            to: 02 00 23 28 c0 a8 79 33
+41.094869   994 sendto status fd 1, length 12

It will happily continue sending these for days. You can see the sendto() message (packet numbers 984 and 993) and the response from the CC3000 acknowledging receipt (packet numbers 985 and 994.)

The failing case (no ARP) looks like this:

 +6.807132   218 send data fd 1, flags 0, length 12: 01 fe 01 fe 01 fe 2b d4 47 b8 03 fc
                                            to: 02 00 23 28 c0 a8 79 0b
 +6.807317   219 sendto status fd 1, length 12
 +6.807645   220 sendto status fd 1, length 4294967238
 +7.006903   221 select command 1 0 0001 0000 0000 0.005000
 +7.012174   222 select status 0000, rd 0000, wr 0000, ex 0000
 +7.012258   223 send data fd 1, flags 0, length 12: 01 fe 01 fe 01 fe 2b d4 47 b8 03 fc
                                            to: 02 00 23 28 c0 a8 79 0b
 +7.012444   224 sendto status fd 1, length 12
 +7.012780   225 sendto status fd 1, length 4294967238

Here you can see a similar pattern, but there is an additional reply in each case (packet numbers 220 and 225), where the CC3000 is sending -58 as an errno. That’s a bit of an odd errno, but it crops up in some source code as -EDEADLK (would deadlock.)

Now, since it follows the standard replies (packet numbers 219 and 224), I think the host driver will just ignore them, which is fine - this is UDP and there are no guarantees, remember. So I have a hunch the CC3000 is trying to do the right thing.

I’m still working it, but since the CC3000 seems to stay responsive, I’m still optimistic that we can find a way to handle this.

2 Likes

So I am not sure if this hurts the arp theory or not, but when I modify the test code to

Udp.begin(9000);
Udp.beginPacket(IPAddress(10,0,0,21), 9000);
Udp.write(TxMsg, 12);
Udp.endPacket();
Udp.stop();

The core doesn’t die with this modification. I have changed the ip address to a non-existent host and the core continues to function as well. Mind you I am not trying to function this way normally. I had a different project where I only wanted to send udp data out and I pulled it off using this method with great stability. I tested:

  1. target host never on an ip.
  2. target host active on ip.
  3. target host active and then disconnected from network.

Does this change any of our theories?

I don’t think it invalidates the theory, because you’re closing the socket.

But hey, if it works - it’s slightly higher overhead, but there’s no setup/teardown protocol for UDP, so the costs are purely local and pretty minimal.

In my other project where I am sending rapidly, I was avoiding the open and closing of the socket as much as possible.

Has anyone on the Spark Team made any progress on correcting this behavior?

Git Hub Issue link: https://github.com/spark/firmware/issues/240 .. issue is still marked as a new issue with no visible movement.

I thought I saw that people were hired and were looking for use cases with problems.

Just can't tell if this issue is falling in the cracks? I hope spark core v2 is not using the CC3K.

Hey @SomeFixItDude, thanks for following up.

If you check out github’s network view, you can see just how much amazing work is getting done on the firmware since we hired @BDub & @mdma:

UDP and TCP issues are wrapped up in a lot of the refactoring and fixes we’re doing lately. We’ll be releasing new firmware with a new architecture in the next week or so. After that release, please test again and ping that issue with your results. With 76 open issues in the firmware repo, we need people advocating for the issues that are most important to them.

Also, of course, just as a reminder, it’s open source, so anyone can work on a fix, not just internal spark folks. :smile:

Cheers!

1 Like

@zachary, Great thanks for the update and I look forward to the new architecture coming soon. Does this new architecture include a non-blocking cc3k driver?

I would love to lend a hand to finding fixes and would be more than happy to dig in. However the host driver code is beyond my capabilities. I program at work at a much more abstracted layer from hardware. Coding with micro-controllers has brought me much closer to low level interactions with hardware and it has been fascinating and frustrating. Hey maybe with a enough spark core tinkering I’ll get there. So for now I am at the mercy of other users and spark team members to look in to this issue. I’ll retry the test program after the firmware update comes out.

Thanks Again.

Nope.

Check out the feature/hal branch. You'll see a reorganized file structure, better testing, and there is a much cleaner separation of hardware-dependent and hardware-independent code.