Bug bounty: Kill the 'Cyan flash of death'

Of course as soon as I get 3 Cores up simultaneously to test - running slightly different variations of the code - they stop crashing entirely. I’ve been running for 30+ hours without a crash. Which would be great, except it sounds like the issue remains, I just can’t replicate it any longer…

I have no longer delays in my code but I use a couple of delayMicroseconds of up to 480 microSeconds to time a onewire read each loop. I guess this will be to short to affect it. I can slow down the reading to just happen every 500ms or something like that to see if that has any effect. I do use a lot of millis() to time things though if that has anything to do with it.

Otherwise Im polling the functions and variables a couple of times every 5 seconds through the api so the issue could maybe lay in there?

one random idea is to put the Cores on a higher latency network. Try to share through your phone and see if that affects it.

@sjunnesson good suggestion, I’ll try that.

btw, new commit to the core-firmware repositories that speeds up SPI, may help as well. pull the latest master of all 3 repositories to test.

This might be totally obvious, but I just wanted to point out that if you are pulling this from Git to build on your local machine, make sure you actually checkout the feature/debug-cfod branch before building. I made the mistake of cloning the repo and adding my application before realizing that I hadn’t checked out the correct debug branch.

I now have two cores (that disconnect quite frequently) with latest firmware and SPI updates running my own custom applications. I’ll report back if I notice any change in behavior.

Thanks @crstffr for pointing that out, although specifically for the SPI improvements, those changes were made to the master branch (since speeding up SPI is something that everyone can get behind)

Though this issue is most likely a firmware issue, I don’t want to completely rule out the Spark Cloud. If anyone is willing to share detailed timing information (don’t forget to specify your timezone) associated with their debugging efforts, that could be handy (email me: joe at spark dot io). I’m seeing a clear pattern of periodicity of the total number of open files on all of our cloud nodes of a period of about 8-15 hours. See graph:

Though this periodicity is probably a natural cleanup process conducted by ZeroMQ or the OS itself, it could be contributing to this issue. If it were contributing to the problem, I’d hypothesize that you’d see disconnects at the inflection points, in particular (Central Standard Time):

8am (Jan 15)
10pm (Jan 14)
8am (Jan 14)
11pm (Jan 13)
5am (Jan 13)
4pm (Jan 12)

Again, I’m going out on a limb here with this hypothesis and I think it’s unlikely to be part of the problem, however, given the intense efforts people are making to solve this, I figured I’d share this data in case it helps others debug, correlate timing events, etc.

I had my code freeze up at about 9:05 PM CST (14 Jan) so that doesn’t seem to correlate with this time windows.

Running the same code today and so far I have had a couple of restarts of the Core but no complete freeze. Hoping it will not freeze to much since some steaks are cooking using this code. :slight_smile:

Interesting.

Are those file descriptors web-side, core-side, internal, all the above ?

Can you break it down that way at all ?

All my tcpdumps of CFOD events show that the cloud continues trying to communicate with a mute core, I see the retrys (at the TCP protocol level) stretch out until it finally just abandons the connection - if it were file descriptor exhaustion, I’d expect to see different behaviour.

Of course there could be multiple effects at play here, but I do not believe I’m seeing the CFOD due to cloud resource issues.

@sjunnesson thanks for your feedback, hope your steaks turn out delicious, :smile:
@AndyW thanks for sharing the TCPDump info you are seeing.
Both of your accounts help to rule out this as being a causal factor.

@AndyW, responses to your questions:

Are those file descriptors web-side, core-side, internal, all the above ?

core-side (device and api) internal (not web ide)

Can you break it down that way at all ?

Unfortunately, not easily/quickly at this time. I hope to be able to within the next couple weeks.

Your TCPDump information strengthens the hypothesis that someone has posed that a cause could be do to the CC3000 not correctly closing the TCP connection with the cloud under CFOD conditions, which then prevents it from tying again perhaps because some internals think it's still connected...

1 Like

So if I do:

cd core-firmware
git checkout feature/debug-cfod

Then I’m in the correct branch?

Yes, that is correct.

19 Hours, 46 Minutes, 28 Seconds

Of interesting note, now I'm getting the "CFoD" for a couple of minutes, then it switches to red and blinks slowly a couple of times before going back to cyan and repeating this cycle.

To put that time in perspective, it's on the short end of what I was getting pre-this-fix.

Thanks everyone for the continued work on this; we’re starting to hone in on the issue.

We now have two different networks that we can connect to in the office; on one, we never encounter CFOD (48 hours and going strong); on the other, the Core hits CFOD after 30 seconds. Now that we’ve got the bug in captivity and can replicate it quickly, debugging should move much faster.

In the short term if you’re working on a project that requires reliability (@sjunnesson I’m looking at your steaks), and you’re encountering CFOD, it may be worth trying a different network or fiddling with your router settings.

Also if you have encountered the CFOD (or even if not), can you please tell us your symptoms and share your router make and model and its settings? I.e. 802.11b/g/n, what security protocol, what the IP address range is, where in the world you’re located (country), whatever else seems pertinent.

@zach the steak turned out good so no worries about them. The problem is that the whole office smelled them so I think I need to repeat this soon… http://www.flickr.com/photos/sjunnesson/11970603084/

My router setting is in this case a Nexus 4 which is sharing its mobile network into a local WiFi hotspot. Secured over WPA2. Dont know any more details with this very limited setup.

Im also experiencing this over a Craddlepoint COR IBR600 4G router which shares its network over wifi to the Cores secured over WPA2. It has a WiFi (802.11 b/g/n). I will see if I can get the more detailed specs on this setup.

@zach When you say “fiddling with your router settings” is there anything specific you recommend to look at?

1 Like

Shall we send you emails, or do you want us to post here? I know there was some initial question about having a space/blank in the SSID. I don't know if that was ruled out or not. Mine has a space in it.

Dave O

OMNOMNOM

for router settings, I don't have any particular ideas in mind, unfortunately

1 Like

Whatever you’re comfortable posting that would be great, so we can get more eyes on it. If there’s info you don’t want to share, send me a PM

I am not getting a CFOD but my Spark Core randomly reboots.

My program on the SparkCore stores some data into a global variable. I have an Android application that reads the status of the variable through the cloud and sometimes I noticed that the value gets reset to 0.

At first I thought it was because I was powering my SparkCore through a NAS at home, but now I connected it to a USB Phone Charger and the same thing happens.

Good thing is that it doesnt freeze. It resets and reconnects to the cloud.

Again, not sure if its related but I figured I’d share it.

Carsten

1 Like

I flashed https://github.com/spark/core-firmware/blob/feature/debug-cfod/build/core-firmware.bin to my Spark via the dfu-util which also enabled Tinker on the spark (so I didn’t push a new sketch via the cloud). I then turned a python script loose on it via cron which queried for the A0 value every minute via Cloud API/REST and logged the results (see table below.)

After the CFOD occurred, the Spark would not reset or reconnect on it’s on, only responding to a RESET or a power cycle. Once reset, it would immediately re-connect.

I’m using an Asus RT-N16 and running “EasyTomato Firmware Version 0.8” (http://www.easytomato.org). I’m using WPA2 with AES on Channel 6 (2.437 GHz). My SSID contains a “-” (dash) and a " " (space) character. I’m running in “Auto” mode and have B & G clients - I don’t believe I have any N clients. I’m using 192.168.1.0/32 for an IP range and am located in the US.

                     Seconds since Epoch	 json result	        Time (UTC)	Uptime (Minutes)
Start	1389798541		                1/15/14 15:09	
End	1389815832	 error: Timed out. 	1/15/14 19:57	288.18
				
Start	1389816842	                	1/15/14 20:14	
End	1389818951	 error: Timed out. 	1/15/14 20:49	35.15
				
Start	1389819002		                1/15/14 20:50	
End	1389819431	 error: Timed out. 	1/15/14 20:57	7.15
				
Start	1389819782		                1/15/14 21:03	
End	1389828071	 error: Timed out. 	1/15/14 23:21	138.15
				
Start	1389828122		                1/15/14 23:22	
End	1389829631	 error: Timed out. 	1/15/14 23:47	25.15
				
Start	1389831482		                1/16/14 0:18	
End	1389831731	 error: Timed out. 	1/16/14 0:22	4.15
				
Start	1389831782	         	        1/16/14 0:23	
End	1389834132	 error: Timed out. 	1/16/14 1:02	39.17
				
Start	1389839222		                1/16/14 2:27	
End	1389843851	 error: Timed out. 	1/16/14 3:44	77.15

Dave O