Bug bounty: Kill the 'Cyan flash of death'

zach · January 15, 2014, 4:53am

Of course as soon as I get 3 Cores up simultaneously to test - running slightly different variations of the code - they stop crashing entirely. I’ve been running for 30+ hours without a crash. Which would be great, except it sounds like the issue remains, I just can’t replicate it any longer…

sjunnesson · January 15, 2014, 5:16am

I have no longer delays in my code but I use a couple of delayMicroseconds of up to 480 microSeconds to time a onewire read each loop. I guess this will be to short to affect it. I can slow down the reading to just happen every 500ms or something like that to see if that has any effect. I do use a lot of millis() to time things though if that has anything to do with it.

Otherwise Im polling the functions and variables a couple of times every 5 seconds through the api so the issue could maybe lay in there?

sjunnesson · January 15, 2014, 5:19am

one random idea is to put the Cores on a higher latency network. Try to share through your phone and see if that affects it.

zach · January 15, 2014, 5:33am

@sjunnesson good suggestion, I’ll try that.

btw, new commit to the core-firmware repositories that speeds up SPI, may help as well. pull the latest master of all 3 repositories to test.

crstffr · January 15, 2014, 6:02am

This might be totally obvious, but I just wanted to point out that if you are pulling this from Git to build on your local machine, make sure you actually checkout the feature/debug-cfod branch before building. I made the mistake of cloning the repo and adding my application before realizing that I hadn’t checked out the correct debug branch.

I now have two cores (that disconnect quite frequently) with latest firmware and SPI updates running my own custom applications. I’ll report back if I notice any change in behavior.

zach · January 15, 2014, 6:14am

Thanks @crstffr for pointing that out, although specifically for the SPI improvements, those changes were made to the master branch (since speeding up SPI is something that everyone can get behind)

jgoggins · January 15, 2014, 4:56pm

Though this issue is most likely a firmware issue, I don’t want to completely rule out the Spark Cloud. If anyone is willing to share detailed timing information (don’t forget to specify your timezone) associated with their debugging efforts, that could be handy (email me: joe at spark dot io). I’m seeing a clear pattern of periodicity of the total number of open files on all of our cloud nodes of a period of about 8-15 hours. See graph:

Though this periodicity is probably a natural cleanup process conducted by ZeroMQ or the OS itself, it could be contributing to this issue. If it were contributing to the problem, I’d hypothesize that you’d see disconnects at the inflection points, in particular (Central Standard Time):

8am (Jan 15)
10pm (Jan 14)
8am (Jan 14)
11pm (Jan 13)
5am (Jan 13)
4pm (Jan 12)

Again, I’m going out on a limb here with this hypothesis and I think it’s unlikely to be part of the problem, however, given the intense efforts people are making to solve this, I figured I’d share this data in case it helps others debug, correlate timing events, etc.

sjunnesson · January 15, 2014, 5:01pm

I had my code freeze up at about 9:05 PM CST (14 Jan) so that doesn’t seem to correlate with this time windows.

Running the same code today and so far I have had a couple of restarts of the Core but no complete freeze. Hoping it will not freeze to much since some steaks are cooking using this code.

AndyW · January 15, 2014, 6:14pm

Interesting.

Are those file descriptors web-side, core-side, internal, all the above ?

Can you break it down that way at all ?

All my tcpdumps of CFOD events show that the cloud continues trying to communicate with a mute core, I see the retrys (at the TCP protocol level) stretch out until it finally just abandons the connection - if it were file descriptor exhaustion, I’d expect to see different behaviour.

Of course there could be multiple effects at play here, but I do not believe I’m seeing the CFOD due to cloud resource issues.

jgoggins · January 15, 2014, 7:02pm

@sjunnesson thanks for your feedback, hope your steaks turn out delicious,
@AndyW thanks for sharing the TCPDump info you are seeing.
Both of your accounts help to rule out this as being a causal factor.

@AndyW, responses to your questions:

Are those file descriptors web-side, core-side, internal, all the above ?

core-side (device and api) internal (not web ide)

Can you break it down that way at all ?

Unfortunately, not easily/quickly at this time. I hope to be able to within the next couple weeks.

Your TCPDump information strengthens the hypothesis that someone has posed that a cause could be do to the CC3000 not correctly closing the TCP connection with the cloud under CFOD conditions, which then prevents it from tying again perhaps because some internals think it's still connected...

Frido · January 15, 2014, 9:04pm

So if I do:

cd core-firmware
git checkout feature/debug-cfod

Then I’m in the correct branch?

crstffr · January 15, 2014, 10:18pm

Yes, that is correct.

timb · January 15, 2014, 10:21pm

19 Hours, 46 Minutes, 28 Seconds

Of interesting note, now I'm getting the "CFoD" for a couple of minutes, then it switches to red and blinks slowly a couple of times before going back to cyan and repeating this cycle.

To put that time in perspective, it's on the short end of what I was getting pre-this-fix.

zach · January 15, 2014, 11:17pm

Thanks everyone for the continued work on this; we’re starting to hone in on the issue.

We now have two different networks that we can connect to in the office; on one, we never encounter CFOD (48 hours and going strong); on the other, the Core hits CFOD after 30 seconds. Now that we’ve got the bug in captivity and can replicate it quickly, debugging should move much faster.

In the short term if you’re working on a project that requires reliability (@sjunnesson I’m looking at your steaks), and you’re encountering CFOD, it may be worth trying a different network or fiddling with your router settings.

Also if you have encountered the CFOD (or even if not), can you please tell us your symptoms and share your router make and model and its settings? I.e. 802.11b/g/n, what security protocol, what the IP address range is, where in the world you’re located (country), whatever else seems pertinent.

sjunnesson · January 15, 2014, 11:44pm

@zach the steak turned out good so no worries about them. The problem is that the whole office smelled them so I think I need to repeat this soon… http://www.flickr.com/photos/sjunnesson/11970603084/

My router setting is in this case a Nexus 4 which is sharing its mobile network into a local WiFi hotspot. Secured over WPA2. Dont know any more details with this very limited setup.

Im also experiencing this over a Craddlepoint COR IBR600 4G router which shares its network over wifi to the Cores secured over WPA2. It has a WiFi (802.11 b/g/n). I will see if I can get the more detailed specs on this setup.

@zach When you say “fiddling with your router settings” is there anything specific you recommend to look at?

dorth · January 16, 2014, 12:20am

Shall we send you emails, or do you want us to post here? I know there was some initial question about having a space/blank in the SSID. I don't know if that was ruled out or not. Mine has a space in it.

Dave O

zach · January 16, 2014, 12:55am

OMNOMNOM

for router settings, I don't have any particular ideas in mind, unfortunately

zach · January 16, 2014, 12:55am

Whatever you’re comfortable posting that would be great, so we can get more eyes on it. If there’s info you don’t want to share, send me a PM

Carsten4207 · January 16, 2014, 1:21am

I am not getting a CFOD but my Spark Core randomly reboots.

My program on the SparkCore stores some data into a global variable. I have an Android application that reads the status of the variable through the cloud and sometimes I noticed that the value gets reset to 0.

At first I thought it was because I was powering my SparkCore through a NAS at home, but now I connected it to a USB Phone Charger and the same thing happens.

Good thing is that it doesnt freeze. It resets and reconnects to the cloud.

Again, not sure if its related but I figured I’d share it.

Carsten

dorth · January 16, 2014, 5:23am

I flashed https://github.com/spark/core-firmware/blob/feature/debug-cfod/build/core-firmware.bin to my Spark via the dfu-util which also enabled Tinker on the spark (so I didn’t push a new sketch via the cloud). I then turned a python script loose on it via cron which queried for the A0 value every minute via Cloud API/REST and logged the results (see table below.)

After the CFOD occurred, the Spark would not reset or reconnect on it’s on, only responding to a RESET or a power cycle. Once reset, it would immediately re-connect.

I’m using an Asus RT-N16 and running “EasyTomato Firmware Version 0.8” (http://www.easytomato.org). I’m using WPA2 with AES on Channel 6 (2.437 GHz). My SSID contains a “-” (dash) and a " " (space) character. I’m running in “Auto” mode and have B & G clients - I don’t believe I have any N clients. I’m using 192.168.1.0/32 for an IP range and am located in the US.

                     Seconds since Epoch	 json result	        Time (UTC)	Uptime (Minutes)
Start	1389798541		                1/15/14 15:09	
End	1389815832	 error: Timed out. 	1/15/14 19:57	288.18
				
Start	1389816842	                	1/15/14 20:14	
End	1389818951	 error: Timed out. 	1/15/14 20:49	35.15
				
Start	1389819002		                1/15/14 20:50	
End	1389819431	 error: Timed out. 	1/15/14 20:57	7.15
				
Start	1389819782		                1/15/14 21:03	
End	1389828071	 error: Timed out. 	1/15/14 23:21	138.15
				
Start	1389828122		                1/15/14 23:22	
End	1389829631	 error: Timed out. 	1/15/14 23:47	25.15
				
Start	1389831482		                1/16/14 0:18	
End	1389831731	 error: Timed out. 	1/16/14 0:22	4.15
				
Start	1389831782	         	        1/16/14 0:23	
End	1389834132	 error: Timed out. 	1/16/14 1:02	39.17
				
Start	1389839222		                1/16/14 2:27	
End	1389843851	 error: Timed out. 	1/16/14 3:44	77.15

Dave O

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1813	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	3017	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1838	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12667	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1880	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related topics