Bug bounty: Kill the 'Cyan flash of death'

david_s5 · February 6, 2014, 12:39am

All,

http://nscdg.com/spark/good_32_falt_tol_core-firmware.zip

This adds some additional checks and should prove to be more reliable

It send a udp packet every 60 seconds to 10.10.0.1 and does the same reading from the http get.

all the log data is critical, because the faults are logged after being detected and mitigated so they do not sent the CPU into a hardfault. Please sent logs and LED observations.

mohit · February 6, 2014, 1:45am

Thanks @david_s5
I ran the new binary on the Core and unlike the last time it smoothly recovered from most CFODs but then failed and got stuck in one in the end.

Here is the log: http://pastebin.com/eg4ACegr

The RGB LED now blinks twice after a timeout.

david_s5 · February 6, 2014, 1:53am

Is it 2 red blinks prefix by an SOS …—…? forever or is it the connect loop?

mohit · February 6, 2014, 2:03am

I’m not sure if I understood the question correctly. The RGB LED flashed RED twice in-between CFODs. The Core never connected back to the cloud. Does that make sense? I’m capturing the log on Serial1, do I need to do anything further?

AndyW · February 6, 2014, 2:11am

I have written a decoder that (partially) groks the spi traffic when captured by a saleae logic probe. Matching this against tcpdumps of the wireless traffic shows me pretty conclusively that the CFOD that we have in captivity is not caused by a race condition/mutex snafu in the driver. This was the first place we looked.

To be sure there is plenty of opportunity to remove knuckleheadedness in the TI driver, but we’re pretty confident at this point that the CFOD is caused by one or more CC3000-internal issue(s).

david_s5 · February 6, 2014, 2:15am

Any luck with the wireless sniffer?

david_s5 · February 6, 2014, 2:21am

Understood. It is the compound issues and recovery I am looking at. The code I have does nothing to detect the connect failures and reset anything. It is left this way to not mask any underlying issues…I have fixed 8 issues that caused the code to hard fault or hang. After it is stable, I can add the recovery and it will take less the 60 seconds to recover.

May I please have a copy of your decoder?

david_s5 · February 6, 2014, 2:33am

No. The code is not faulting or the core would be stuck in a red blink pattern of ...---... N Blinks ...---... repeat.

Where N is one of the panic codes below:

From the logs you uploaded I see one issue that still need to be resolved.
That is socket management so the SparkProtocol does not read from the users socket as a result of the inactivity timer closing the sockets asynchronous to the user/spark code opening them.

I have something I will try in the AM that my resolve this as well.

AndyW · February 6, 2014, 2:39am

Sure - how to get the decoder to you ? This forum software only attaches pretty pictures. It's a bit long to cut and paste here.

The decoder reads the data files created by the logic app, when it is configured to decode the SPI bus into hex. You dump that to a file, and feed it to my spectacularly ugly python hack that decodes the packets I care about to hunt down CFOD

The output looks like this:

$ tail -12 spi10.log
-25.016520 40078 send data fd 0, flags 0, length 18: 00 10 9a b3 c9 55 89 13 7a 17 2a f0 d1 85 36 51 29 a3
-25.016081 40079 send status fd 0, length 18
-22.857388 40942 free buffers event 0:1
-10.007862 46079 send data fd 0, flags 0, length 18: 00 10 9e 25 18 1a 4d 64 75 63 ca 23 a3 c5 64 df e4 39
-10.007428 46080 send status fd 0, length 18
-8.149413 46823 free buffers event 0:1
+0.100034 50082 ipconfig command
+0.100367 50083 ipconfig status: 192.168.121.179 255.255.255.0 192.168.121.1 192.168.121.1 192.168.121.1 00:12:55:55:55:55 Home1
+0.100917 50084 close command fd 0
+3.703081 50085 close status 0
+3.703224 50086 socket command domain 2, type 1, protocol 6
+3.703610 50087 socket status 0
$

That shows the last two packet writes (2 byte header, indicating 16 byte payload following) - these packets never make it out over the radio. Then the main core firmware gives up, and starts it's error handling - ipconfig(), close(), select() - but the CC3000 is borked at that point. SPI interface functions in zombie mode, but the radio side of the house is dead as a doornail. Packet traces show the cloud keeping on retrying until it quits in disgust.

And I want to be clear, I'm not knocking your work; just saying that I don't think quashing those bugs will stop CFOD.

david_s5 · February 6, 2014, 3:20am

No worries…I did not take it as knocking. I will pm you with my contact info and you can drop in skype or email it to me.

AndyW · February 6, 2014, 4:06am

On it’s way.

[filler here to make the forum happy.]

david_s5 · February 6, 2014, 4:07am

Thank you!

[filler here to make the forum happy.]

RWB · February 6, 2014, 8:05am

Just adding my experience with the “Flash of Death”

I’m using my Cellphone as my WiFi Hotspot and have been for years now.

The Spark Core stays connected just fine to it.

So I’m pushing data to Xivley and thats working just fine.

I start downloading movie torrents and stream a video on Youtube pulling about 1.5MB per second on hotspot data stream. Verizon kills the data connection due the high badwidth killing all connection to the internet. The torrent data transfer rate drops to 0. The WiFi signal is fine and all connected devices stay connected, there is just no data to and from the net.

Because of this the Spark Core starts to Flash Blue because the cloud is not accessible anymore. I reset the hot spot on my cell phone and every reconnects to the phone and data is flowing again. But the Spark Core never tries to reconnect to the WiFi Network, its remains flashing Blue and no further attempts are made to try to reconnect to my phones WiFi Hot Spot.

The only solution for my application is to program the Spark Core to power off and then on every 5 to 10 mins to guarantee it reconnects to the network like it does successfully after every reset. But if the Blue Flashing freezes up the code then there is no way a reset would actually run because that part of the code would never execute due to the Spark Core Freezing up.

Looking forward to the fix!

I wonder if Adafruit has this same issue with any of their CC3000 breakouts? If not then maybe this is a issue they already solved or could help out with??? They have sold a shit ton of them so I figure they should have had some of the same issues if they are using the same chip.

david_s5 · February 6, 2014, 12:27pm

@RWB Would you please run this http://nscdg.com/spark/good_32_falt_tol_core-firmware.zip and report back what the LEDs do, If you can grab the logs too that would be great (3.3v rx,tx on Serial1 115200)

Iv4n · February 6, 2014, 4:37pm

I guess this is not really a contribution but a comment about this issue in different environments.
I’m very impressed about something: I moved a core from my house where I can easily get days without any CFOD to my work and it doesn’t last more than 5 minutes. Here at work not only there are many WiFi connected clients but also there are many other networks from other companies (22 networks * their clients), still I’m able to easily Skype and use the internet from my phone, but the core comes very very unstable rebooting every time in less than 5 minutes.

RWB · February 6, 2014, 5:25pm

I can do this but I need some clear step by step instructions on how to properly load this firmware.

Another thing I noticed is that if my Wifi connection dropped out completely that the Spark Core would go to flashing green and as soon as the WiFi network was available again it would successfully connect to the wifi network and the Spark cloud.

It’s only when the Spark Core Remains connected to the WiFi network and the connection to spark cloud is lost that the blue led flash begins and never recovers.

I’ve been running a data stream to Xivley successfully but about 2 hours ago even though the Spark Core is successfully connected to my Wifi network and the network is connect to the internet because all my other devices work it has stopped sending data. I have the on board blue LED flash 3 times after a successful transmission to Xivley but it has just stopped for some unknown reason even though the Spark Core says its successfully connect to the internet and the Spark Cloud.

ryotsuke · February 6, 2014, 7:22pm

Speaking of Adafruit CC3000 WiFi module, they have issues too: https://github.com/adafruit/Adafruit_CC3000_Library/issues

RWB · February 6, 2014, 7:37pm

That’s a great resource ryotsuke.

I has to be helpful to the guys trying to figure out exactly whats going on.

wtfuzz · February 6, 2014, 8:02pm

Hey, I’ve had some success in my environment by turning on fast reconnect mode. Usually I would have the CC3000 act strange in that it would not respond to ICMP ping from my LAN like it normally does, but serial debugging output on the spark would show that netapp_ipconfig() was still returning the DHCP configuration. It would normally hang at this point on a connect() call with CFOD.

This would happen between 1 minute and 1 hour, and I’ve yet to have a CFOD in 8 hours with fast mode. I’m still running under heavily modified driver code (currently have DMA disabled on SPI).

In spark_wlan.cpp, the second argument to wlan_ioctl_set_connection_policy() will enable fast mode.

	if(!WLAN_MANUAL_CONNECT)
	{
		if(NVMEM_Spark_File_Data[WLAN_PROFILE_FILE_OFFSET] == 0)
		{
			WLAN_SMART_CONFIG_START = 1;
		}
		else if(NVMEM_Spark_File_Data[WLAN_POLICY_FILE_OFFSET] == 0)
		{
			// MT - Enable fast connect
			wlan_ioctl_set_connection_policy(DISABLE, ENABLE, ENABLE);

			NVMEM_Spark_File_Data[WLAN_POLICY_FILE_OFFSET] = 1;
			nvmem_write(NVMEM_SPARK_FILE_ID, 1, WLAN_POLICY_FILE_OFFSET, &NVMEM_Spark_File_Data[WLAN_POLICY_FILE_OFFSET]);
		}
	}

tdicola · February 6, 2014, 8:18pm

I’ve done some hacking on the Adafruit CC3000 library and am definitely watching this thread for any insight into TI CC3k driver issues. Unfortunately I’m not familiar with the spark core and don’t have a ton of input, other than to share your pain of working with the CC3k driver code.

Something that has been really helpful to me to debug lockup issues though is toggling output pins at the start and end of certain functions and using a multimeter to check if those outputs are high or low so I know where execution is getting stuck. In particular any functions in the driver that have loops are worth investigating deeply–it looks like there are quite a few places where some condition might not happen and the driver will happily bury itself in a loop forever (am looking at a lockup issue in the HostFlowControlConsumeBuff() function inside socket.cpp now actually).

Good luck tracking down the issue here!

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1796	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	3014	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1831	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12534	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1873	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related topics