Bug bounty: Kill the 'Cyan flash of death'

@RWB

git can feel very intimidating

Simple version.

Save your folders.

go to https://github.com/davids5/core-common-lib and https://github.com/davids5/core-firmware

Grab the zips (see below) and replace the code you have with them

1 Like

Yea thats real easy to understand. Thank you!

So I just click the Branch Button and then drop down and select the Spark_Master_new option and then it will pull up the correct files. Then I download the zip package.

Sounds easy enough.

thanks this is great. Is there any specific test you want to run or should I just load it with one of my own applications? A variable in my standard app is accessed through the cloud API every 5 seconds and thats about it. So far I have had CFOD every ~24 hours or so with that code in my network at work.

Either way will work, my test app or your real app both would be good tests

Hey guys,

Firstly thank you all for your continued effort with this! I was just wondering can anyone give me a quick summary of this ~350 post topic and tell me where youā€™re at with this horrible bug? I was following the Texas Instruments post but it went silent, I assume its because spark is working directly with TI.

Hope iā€™m not coming across as impatient, just a lot of info here and a summary would be great.

Thanks!

2 Likes

I believe I have the same problem. The LED is slowly breathing cyan, but in reality, when I try to call a function via REST, I get a time out.

Now Iā€™m on @david_s5 branch.

This is my shell script (on OSX) to clone into master_new

rm -rf git_master_new
mkdir git_master_new
cd git_master_new
git clone https://github.com/davids5/core-firmware.git
git clone https://github.com/davids5/core-common-lib.git
git clone https://github.com/spark/core-communication-lib.git
cd core-firmware
git checkout spark_master_new
cd ../core-common-lib
git checkout spark_master_new

My application.cpp shows the uptime on an 16x2 LCD. Letā€™s see what happensā€¦ :wink:

Yes - Spark are working directly with TI.

Spark have supplied them with reworked hardware, and they are actively working to reproduce the problem in the simplest possible test case.

The current working hypothesis is that there is an issue related to the ARP cache. Forcing ARP cache spills (e.g. with the python script posted to this thread) seems to generate problems immediately; however the CC3000 is capable of having ARP cache problems without external intervention (this is the mode that most of us experience as CFOD.)

1 Like

Does it behave this way immediately after a core reboot ? Or does it work for some seeminly random period, then start returning a timeout to the web-side interface ?

Does your REST call result in any externally observable event (e.g. the D7 led changing state) ?
If so, does the call seem to take effect, even though the web-side interface returns a timeout ?

Hi @AndyW, all, after a core reboot I think it works for some time. At least a call to get info about the exposed functions or variables will work for some time. The core is halted and stops working for me - while still breathing cyan - once I call an exposed function.

In my specific case, I call a function that is supposed to update a uint16_t that contains information about the on/off state of a connected NeoPixel board - I have a RGB123 board with 64 RGB leds that I am trying to control via 16 channels (4 LEDs per channel). I can call that function once, I see that it has an effect as the 4 LEDs that I toggle are correctly turned on, but then any subsequent call wil time out. Also, the LED on D7 that I toggle in the loop() stops flashing after the first call. All this is while the cyan RGB led of the core itself is still breathingā€¦

So - to be even more specific - after the first call, all subsequent calls report a timed out. Also, these subsequent calls do not seem to reach the device, nothing happens in terms of the LEDs etc. It seems the core is disconnected and stalled.

The Coreā€™s timeout period to hold the connection to the cloud is set to 10 seconds. If it looses the connection, it might take the same time period to realize it and make an attempt to connect again. Does your Core end up going through the sequence of making a connection after the timeout?

@hansamann - there is a failure mode where the web-side interface times out, but simple requests (e.g. a D7,(HIGH,LOW) talking to the tinker app) make it through to the core. It is possible that more complex operations do not complete in the same scenario, so that might explain what you see.

Are you capable of gathering tcpdumps of the entire life of a core in this scenario - from core reset past several timed-out responses to the web-side interface ? If so, I would like to inspect them to see if you are experiencing the same problem as the failure mode I am already working on.

@sjunnesson How is the new code running for you?

well not to well, I checked out the firmware, switched branch to ā€œspark_master_newā€, compiled and loaded it on 3 Cores and all have the same strange behavior looping over and over with some variation some times in the order it happens. But they never breath Cyan as they normally do when connected to the cloud.

Blinking green
Blinking Red
Quick breathing Cyan

Is this what you expect @david_s5 ? It is with your default application.cpp Just for reference with the default Web IDE firmware all connects to the network so the SSID should be correct.

@david_s5 I just cloned, built and installed your spark_master_new branch and, unfortunately, itā€™s a blinking rainbow. By that I mean, it only breathes cyan for a few seconds here and there. Most of the time, itā€™s blinking green and red at various frequencies, as well as fast cyan.

I see @sjunnesson posted similar results while I was writing this.

Are there some build options for debugging?

I remember that it sometimes just began blinking green again and then connected again. But that was sometimes, not always. For most of my tests, I was able to call the function once, then the core would look (RGB LED) completely normal but no function call would work any more - just time outs. Also, because the D7 LED that I toggled, I believe it really stopped.

My latest code is here: https://community.spark.io/t/adafruit-neopixel-library/1143/104?u=hansamann

What is a bit strange is that for @Bdub it seems to work. The only notable difference here seems to be that I have a Kickstarter RGB123 board, 64 xWS2812B while I believe his is an Adafruit 8x8 board. But if I understand the NeoPixels correctly, the spark just sends a signal - e.g. if that is wrong the worst would be the RGBs do not light up. But it should never halt the Spark.

@sjunnesson Oh that sounds not good. Letā€™s see if we can troubleshoot it.

How about adding DEBUG_BUILD=y to the make command line after a make clean. Do a make DEBUG_BUILD=y

If you are running my version of the application.cpp the debug output will be on tx,rx 3.3v pins.

Does it act the same?

@bkize sorry forgot to include you on last post see above

World Record: over 100.000 secs and still pulsing on ā€œmaster_newā€. Saw it losing the connection and recoveringā€¦

Hey all,

Wanted to share a report from Texas Instruments; thanks to @AndyW and @mohit for staying on them like white on rice.

Texas Instruments has realized that this issue is serious, and has sent our support person, Raghavendra, from India (where TI's support staff is based) to Israel (where TI's firmware team is based) to work on this issue with the firmware team. The root cause of CFOD is now on the top of TI's priority list.

Quote from @AndyW:

The bad news: the fix isn't immediately clear, so it's going to take some time for TI to turn this debugging work into a firmware patch. Hopefully the work that many on this thread are doing to protect against failure and ensure the Core and CC3000 recover will act as a stopgap while we wait for the official fix from TI.

Thanks everyone for the great work and support on this issue! We will continue to keep you updated.

6 Likes