Bug bounty: Kill the 'Cyan flash of death'

After 11 minutes my Core got stuck in a Cyan blinking loop and it took about 2,5 minutes for it to restart and reconnect to the network and start running the code again.

So I’ve got the fix running on my Core displaying a countdown on an I2C OLED display.

so my Core frooze up again this time breathing the Cyan Led nicely but no calls through the Spark Cloud api goes through. All returns a 408.

After 10 minute of no response I pressed reset on the Core and it connected straight back up and became responsive through the cloud API.

@zach Just downloaded the changed main.h file and rebuilt my firmware, I get further, but my core still goes into CFoD. This time it took 67 mins, before it was ~30. Sorry :frowning:

I was able to upload the core-firmware.bin by downloading it from git and then just using the dfu-util to flash it over. However, that will give me “Tinker” installed on my core and not my other simple app (which I know will CFOD within 120 minutes.)

I assume that pushing code via a REST call results in the same pipeline being exercised and it will revert my core to the compile-server2 branch as well. So it sound like if I want to test this out, I need to get the dev environment set up?

Trying to help out here, not add to the burden! :slight_smile:

Dave O

@sjunnesson that actually sounds like a different issue that has to do with blocking code and long delays in the loop().

Is that true of your code?

Yes, that’s correct - to test out firmware-in-progress, you’ll have to build on your own machine.

Of course as soon as I get 3 Cores up simultaneously to test - running slightly different variations of the code - they stop crashing entirely. I’ve been running for 30+ hours without a crash. Which would be great, except it sounds like the issue remains, I just can’t replicate it any longer…

I have no longer delays in my code but I use a couple of delayMicroseconds of up to 480 microSeconds to time a onewire read each loop. I guess this will be to short to affect it. I can slow down the reading to just happen every 500ms or something like that to see if that has any effect. I do use a lot of millis() to time things though if that has anything to do with it.

Otherwise Im polling the functions and variables a couple of times every 5 seconds through the api so the issue could maybe lay in there?

one random idea is to put the Cores on a higher latency network. Try to share through your phone and see if that affects it.

@sjunnesson good suggestion, I’ll try that.

btw, new commit to the core-firmware repositories that speeds up SPI, may help as well. pull the latest master of all 3 repositories to test.

This might be totally obvious, but I just wanted to point out that if you are pulling this from Git to build on your local machine, make sure you actually checkout the feature/debug-cfod branch before building. I made the mistake of cloning the repo and adding my application before realizing that I hadn’t checked out the correct debug branch.

I now have two cores (that disconnect quite frequently) with latest firmware and SPI updates running my own custom applications. I’ll report back if I notice any change in behavior.

Thanks @crstffr for pointing that out, although specifically for the SPI improvements, those changes were made to the master branch (since speeding up SPI is something that everyone can get behind)

Though this issue is most likely a firmware issue, I don’t want to completely rule out the Spark Cloud. If anyone is willing to share detailed timing information (don’t forget to specify your timezone) associated with their debugging efforts, that could be handy (email me: joe at spark dot io). I’m seeing a clear pattern of periodicity of the total number of open files on all of our cloud nodes of a period of about 8-15 hours. See graph:

Though this periodicity is probably a natural cleanup process conducted by ZeroMQ or the OS itself, it could be contributing to this issue. If it were contributing to the problem, I’d hypothesize that you’d see disconnects at the inflection points, in particular (Central Standard Time):

8am (Jan 15)
10pm (Jan 14)
8am (Jan 14)
11pm (Jan 13)
5am (Jan 13)
4pm (Jan 12)

Again, I’m going out on a limb here with this hypothesis and I think it’s unlikely to be part of the problem, however, given the intense efforts people are making to solve this, I figured I’d share this data in case it helps others debug, correlate timing events, etc.

I had my code freeze up at about 9:05 PM CST (14 Jan) so that doesn’t seem to correlate with this time windows.

Running the same code today and so far I have had a couple of restarts of the Core but no complete freeze. Hoping it will not freeze to much since some steaks are cooking using this code. :slight_smile:

Interesting.

Are those file descriptors web-side, core-side, internal, all the above ?

Can you break it down that way at all ?

All my tcpdumps of CFOD events show that the cloud continues trying to communicate with a mute core, I see the retrys (at the TCP protocol level) stretch out until it finally just abandons the connection - if it were file descriptor exhaustion, I’d expect to see different behaviour.

Of course there could be multiple effects at play here, but I do not believe I’m seeing the CFOD due to cloud resource issues.

@sjunnesson thanks for your feedback, hope your steaks turn out delicious, :smile:
@AndyW thanks for sharing the TCPDump info you are seeing.
Both of your accounts help to rule out this as being a causal factor.

@AndyW, responses to your questions:

Are those file descriptors web-side, core-side, internal, all the above ?

core-side (device and api) internal (not web ide)

Can you break it down that way at all ?

Unfortunately, not easily/quickly at this time. I hope to be able to within the next couple weeks.

Your TCPDump information strengthens the hypothesis that someone has posed that a cause could be do to the CC3000 not correctly closing the TCP connection with the cloud under CFOD conditions, which then prevents it from tying again perhaps because some internals think it’s still connected…

1 Like

So if I do:

cd core-firmware
git checkout feature/debug-cfod

Then I’m in the correct branch?

Yes, that is correct.

19 Hours, 46 Minutes, 28 Seconds

Of interesting note, now I’m getting the “CFoD” for a couple of minutes, then it switches to red and blinks slowly a couple of times before going back to cyan and repeating this cycle.

To put that time in perspective, it’s on the short end of what I was getting pre-this-fix.