Bug bounty: Kill the 'Cyan flash of death'

Hi all,

A number of you have encountered a bug where the Spark Core will not hold a persistent connection to the Cloud, and does not successfully attempt to re-connect to the Cloud. Here are a few threads on the topic:


NOTE: This is a different issue from an unresponsive Core that still claims to be connected (https://community.spark.io/t/known-issue-long-delays-or-blocking-code-kills-the-connection-to-the-cloud/950). We are separately working on that issue, as noted in the linked thread.

This is a top priority issue and our lead embedded developer is still on vacation. So weā€™re offering it up to the community: who wants to take a stab?

There seem to be two separate issues at hand:

  1. A Core with a solid, reliable internet connection will at some point disconnect from the Cloud, often after hours of being successfully connected.
  2. While the Core should automatically re-connect, it does not do so, and instead stays in a persistent state of ā€œflashing cyanā€ until it is manually reset. (NOTE: This is in contrast with disconnects that happen for other reasons; for instance, when we re-boot the Cloud, Cores automatically re-connect without issue).

These issues probably stem from our implementation of the CC3000 host driver.

We are issuing the following bug bounty:

  • If you (1) make an improvement to our firmware to fix the above issues and kill the ā€˜cyan flash of deathā€™ (2) send your improvements in as a pull request on Github, and (3) your pull request is accepted by the team, we will send you a $100 credit for Spark stuff (Cores, accessories, or kits), plus a Spark T-shirt to share your love!
  • If multiple pull requests are sent, the first that we accept will get the bounty. If we accept other pull requests for incremental improvements, they will receive a free Spark Core, up to a limit of 5 pull requests. Only one bounty per person.
  • If you make any other unrelated improvements to our firmware that specifically relate to improving reliability of the connection to the Cloud that we accept as a pull request (same 3-step process as above), we will also send a free Spark Core, up to a limit of 5. Only one bounty per person.

NOTE: You will need to sign our Contributor License Agreement for us to accept your pull requests.

Repositories below! Thanks for your help, and if you have any questions, please let me know.



18 Likes

The latest commit on core-firmwareā€™s master branch should fix the above mentioned issue:

5 Likes

Very very promising!

My core runs the default application now for over 1.5 hours (still running and perfectly replying to the iOS-App) instead of about 10 minutes!

Quite happy,

Frido.

1 Like

@Frido Did you flash your core manually or through the web IDE? I was under the impression that the code pushed out over the cloud comes from the compile-server2 branch, which does not yet have this fix.

pulled from git and compiled manually

1 Like

when can we expect this to be implemented in the cores with a flash of the software from the cloud?

Iā€™m just checking for a ā€œgo aheadā€ to make sure the latest master is good on the three firmware repos, and then we can fold those updates into the compile server branch. :slight_smile:

1 Like

When this happens, should we see the compile-server2 branch merge with another branch? (https://github.com/spark/core-firmware/network).

Iā€™m wondering because I uploaded a sketch and it took a long time to update with a different sequence of lights than I was accustomed to (I was assuming it was because there was a firmware update). And now my core has been running the unchanged sketch for over 6 hours with no issues, all the while polling for values every minute. This is by far a new record for me, never having eclipsed the 2 hour mark while polling for values.

I was attributing this to the fixes mentioned above, but they appear to not be deployed into the cloud version yet. Having a firmware version reported via the ā€œlist devicesā€ REST call or from a dedicated call would seem to be helpful in this regard.

Thanks for any information you can provide. (But whatever the case, Iā€™m digging on the fact that my Spark has been running for so long!)

Dave O

Even with CC3000-patch applied and self compiled firmware with your fix I get the ā€˜Cyan flash of deathā€™ after some time.

1st run up to approx 2400 sec
2nd run up to 5619 sec

After doing a reset it connects right away!

Cheers,

Frido.

I can confirm that these changes do not fix CFOD.

Any updates on this issue? I have 130 dollars worth of steaks planed to go into my Spark controlled Sousvide on Tuesday and would hate to have my Core die halfway through the cook session.

2 Likes

My last updates went in the ā€œDropping the connection to spark cloudā€ thread.

I have an SPI analyser hooked up, and am now working to get it to trigger on CFOD.

1 Like

Something Iā€™ve noticed is that the Core seems to go into the CFoD when Iā€™m actively using the internet. Once I goto sleep, my Core will stay connected from 5 to 8 hours. If Iā€™m actively using the internet I normally donā€™t make an hour. Iā€™ve got a little script setup to run a countdown clock.

int sec = 0;
int min = 0;
int hour = 0;

void setup() {
    
    Wire.begin();

}

void loop() {

    Wire.beginTransmission(0x4E);
    Wire.print("CL");
    Wire.print("SF");
    Wire.write(18);
    Wire.print("TT");
    Wire.print("Cloud Uptime");
    Wire.write(0x00);
    Wire.endTransmission();
    Wire.beginTransmission(0x4E);
    Wire.print("SF");
    Wire.write(51);
    Wire.print("TRT");
    Wire.print("TT");
    Wire.print(hour);
    Wire.print(":");
    Wire.print(min);
    Wire.print(":");
    Wire.print(sec);
    Wire.write(0x00);
    Wire.endTransmission();
    
    sec = sec + 1;    
    
    if (min == 59 && sec == 60) {
        sec = 0;
        min = 0;
        hour = hour + 1;
    } 

    if (sec == 60) {
        sec = 0;
        min = min + 1;
    }
    
    delay(950);

}

I am also experiencing the CFoD. I typically get 20-30 mins of activity before my core goes into CFoD. I was having a very limited success when flashing over the cloud, so I have pulled the sources locally and now flash over USB. As of my build @9:33PM MT today the CFoD still exists.

Thanks guys for all the feedback; it does seem like the issue remains, and weā€™re continuing to work on it. Anyone in the community working on this issue to earn the bug bounty?

Well from my perspective I am looking at the sources, however i am clueless where to start, you mention the CC3000 driver, I have not yet located where the state is tested and ends up with the Cyan flashing. Maybe if you can provide some more critical transition points (source files, function calls) we could try and isolate where this is happening.

I can tell you that I work from a DSL connection (1.5mb). Trying to flash from the cloud was successful only 1 out of every 5 times. And almost never worked unless I completely reset the core with the mode / reset buttons - holdling until the white flashing light. If I tried to reflash after I had already flashed my custom code it would either typically fail. I have seen all of the cases, flashing red, no light, solid red, or just not update the core with the new firmware.

To answer your question - I think there may be interest in this community, the bounty is attractive, however knowledge of the code may limit peopleā€™s ability. Possibly outlining functions that you suspect, would be a start. At least to learn the code.

I wonder if anyone has experienced the CFoD using a local server? If this problem is related to network latency and / or internet connectivity then having local server should help narrow down the problem space.

This thread is awesome. :slight_smile:

Based on descriptions, and reading over the code, it sounds like the problem could be in the wlan state machine. Some flag gets set at the wrong time and it canā€™t free itself after that. Maybe we could add a sanity timeout ā€“ e.g. if the core thinks it started reconnecting and didnā€™t succeed after ~10-30 seconds, it should bail and retry? Thoughts?

Please correct me if Iā€™m wrong!

Thanks,
David

Thatā€™s a good idea ā€“ when developing locally I havenā€™t noticed the CFoD, but I also am not typically leaving it up for hours at a time ā€“ since we test changes like that on staging. I generally leave my dev core either pointed at our staging cloud or production cloud for most of the day while Iā€™m developing.

@zach Is it possible to add a disable cloud feature (not Wifi) in the interim so we can have usable devices until this issue is resolved?