Looks like this thread is headed to depths of the void of unresolved problems. Is there any movement on the non-cloud network functionality becoming stable? Have people been hired at Spark to fix the many instabilities? I havenāt seen much movement on issues and I am starting to lose faith in the product. That simple test program should last more than 5 minutes.
Yes it is a realistic use case. For example someone may want to send sensor data for every reading to a central server. This data is being read from the sensor pretty rapidly and I do not want to send the data through the cloud. I donāt want the latency, restrictions and dependency for Sparkās service to be up and running. The parsing of received packets does not need to be done, that was added as Voodoo in some magic hope it would stay alive for more than 5 minutes. In fact no one is sending any network data to the core so it is not needed. We shall remove it and just stick with sending. The 200 ms delay was also added to simulate time till next sensor reading this delay is actually longer than actual readings. The array in the example is actually one sample reading that I hard coded to debug the problem.
The coreās ip address is 10.0.0.13 assigned by dhcp. I am sending packets to IP address 10.0.0.2. This is UDP no one needs to be at the destination ip address and I am not sending to the originating IP just to be clear. In my case it fails either way if the ip is active or not. Like I stated, we are talking UDP, there is no handshake or session and no one needs to be on the other end. So really the fact that someone is active on that IP is a mute point.
Thanks again @mdma glad to see another set of eyes looking at it.
Edit: The end goal is to have many cores sending udp packets to one central server. I canāt get that to work, so I am just trying to get one to work.
So I am responsible for the broadcast āvoodooā because with the original code, it fails for me after 6 tries but with the āvoodooā it typically runs more than 50 tries. I believe that is a clue to debugging it. My theory is that something in the driver needs to receive data in order to transmit data.
Also I have NTP code that uses UDP and I have run it in a similar loop and it does not fail running overnight, so I have been trying to isolate what is unique about this failing case and I think it is the transmit only nature of this code. With NTP you send one packet and receive on packet. I also tried increasing the packet size in the test code since NTP uses 48 byte packets. The packet size above is smaller than the minimum ethernet packet and so would require padding after the checksum.
Six is a magic number for the TI CC3000āthere are 7 sockets in total and one is used for the Spark cloud connection, leaving 6. Hard to know if that is related or not, but it is curious. I spent some time trying to see of the code was using a new socket for every try but it did not seem to be. Still worth investigating.
When the code above fails, Udp.write() returns -1 indicating it failed, so I had @SomeFixItDude read the return values and take action in that case. You should also check the return value of Udp.beginPacket() which can return zero if the WLAN is not available or if gethostbyname fails, but I donāt think that is happening here. Udp.begin() also returns a value indicating if the bind failed. It is just good practice to check all return values to program defensively.
When you check the return values, it fails in a pattern of success followed by failure and long timeout, the success again.
I think if you had the time to setup a fully instrumented local build debug compile, you could try to debug it effectively. There are lots of people saying that UDP on the TI part is far from perfect on the TI forum but I have hopes that it can be improved in this case. Certainly the NTP results are encouraging.
Thank you both for the information and clarity in your writing - I understand basically that UDP is connectionless, but Iām not as familiar with UDP as I am with other protocols, so the details were helpful. I will try running this test case and see if I can probe about and cause any change.
I see TCP is also not totally reliable either. Fortunately, due to timeouts and such, TCP can recover in most cases after a delay. But my gut says there could well be a common cause for both protocol instabilities, such as buffer starvation.
Great to hear that the NTP results are more encouraging @bko, perhaps this could hint at a timing issue?
I don't think it is a timing or race condition per se. You can slow down the UDP transmit to one packet every 2 seconds and it still fails in the same way.
Glad to have a fresh set of eyes looking at it! I am a little busy with other things right now.
Thanks guys for taking another crack at it! Side note - peopleās tone in forums are often misconstrued. I am a little frustrated but I am more sitting here with the hands in the air, crap what do I do now, as opposed to being mad.
@bko I think you are on to something with the having to have to read packets. What makes that cause the program to run longer? How do I properly do the begin packet to check the return code and recover from bad value there? I would like to add or if you could add it to the last version of code that be awesome. (Either way it would be helpful) Also in your ntp program do you send to an ip address or does the core resolve the host name? Maybe that is another difference? I donāt know what to say about adding the Voodoo discard packet logic. I donāt know if keeping it in hides or directs us to a problem. I am at a loss there. I do appreciate the thought though.
@mdma I donāt know how to check if buffers is the problem or not? Any help here would be great. Just small tidbit on udp that you probably already know. With no handshake / session basically the core should transmit the packet and move on its merry way. And when I test the code above they are transmitted. Maybe after it is transmitted it is some how still in the buffer? This is where maybe you are on to something.
At one point I believe I posted some debug output, Iāll try to hunt it down. I believe Brian pointed out there was an unsolicited response from the cc3k. Maybe some better recovery needs to be done here? Iāll hunt it down.
It uses Serial to dump any critial info (like UDP.write() failing) and allows the user control over wifi, the cloud and binding to the udp socket, and reading from the socket.
Hereās the test:
start the code
open a serial monitor
hit āwā to connect to wifi
hit ābā to bind to the socket
observe the D7 led flash 12 times, then a slightly longer pause before the last breath of a 13th flash, then nothing, and the main loop is unresponsive.
When I get more time, Iāll pull out the JTAG debugger and see exactly where the code is hanging up.
As another test, I disabled the wifi while before the loop locked up. So I wonder if some of these SOSs are caused by intermittently losing the wifi connection?
@SomeFixItDude, I guess I have good news and bad news - Iām running your test case here and it is not crashing - just keeps cranking out the packets - Iām watching them show up on another box using tcpdump.
I made two changes:
My network is not 10.0.0.x, mine is 192.168.121.x - is there any practical way for you to get a core on a class C RFC 1918 network (like 192.168.x.x/24) ?
I know it sounds stupid, but there have been cases where the CC3000 behaved differently on 10.x.y.z networks (donāt ask me why.)
If you canāt do that, I can try and put this core on a test network configured to be 10.x.y.z/24 (or /16 or /8, whatvever)
I also added the āif D0 == High spin calling SPARK_WLAN_loop()ā code to ensure I could OTA flash it.
I believe Iām running the latest, post CFOD-patch CC3000 firmware - Iāll leave it running and see how it fares for a few hours, then Iāll try loading it up on another core and testing there too.
It definitely clobbers OTA flash, so I added that code. This fits my theory about running out of sockets, but I have not been able to prove that. There are some related things, I think, like Spark.publish() that can clobber OTA updates. Yet another mystery to figure out!
Interesting about the Class C network! Yet another variable to test. I have been testing on a very lightly loaded 10.0.0.x network and it consistently fails.
Obviously if everything works, you donāt need the D0 code, but if my test had hung - then I wanted a way to easily recover, and @BDubās code snippet seemed useful.
Iāll perform more testing on other cores and try and set up a different network too.
OK - I believe the problem is ARP (an area that has proven to be the CC3000ās Achilles Heel)
If you send packets to a host that exists, or something that is reachable via the default gateway off the local subnet, my testing suggests that you can send packets for ever without a problem.
However, if the designated host is on the local subnet and not available, or the default gateway is unavailable (e.g. the ARP requests go unanswered) - the app will die after ~12 iterations, as reported by @mdma and others.
I have this in captivity on a core with the the SPI bus pinned out for debug, so I can see if the CC3000 is trying to say it has run out of buffers while it is waiting for the ARP response (that is never destined to arrive.)
Can others please verify this behaviour (Iām thinking specifically of @bko & @mdma, but all datapoints are welcome.)
In related news, even @bko 's D0 clause doesnāt seem to work correctly, because the core is now effectlvely unflashable OTA.
It would be great to know if it is calling back with a "out of buffers" message. If so, next step is then to see if there is anything we can do about that in the host driver, which sadly, seems unlikely.
Buf if a "out of buffers" message is sent, then we could have that propagate to the upper layers of firmware code to say "disconnect all your stuff, I need the resources", which could temporarily disconnect from wifi/cloud to free up buffers.
Iāll try a couple tests tonight as well. I have a host sitting on 10.0.0.3 Iāll point it over there and see if it fails. Also perhaps a test to 10.0.0.255 is in order as well? I wouldnāt expect an arp being requested for a broadcast address. I know we will have to dump incoming traffic since we will pickup our own broadcast then. @mdma if you send to an active host does your core stay up as well?
I think it has already been established that as long as you consume the packets echoed to the core, broadcast UDP will work. Which matches the rest of the evidence. (Re)validating that would be good.