Dropping the connection to spark cloud

Definitely should be rock solid, Thanks for the feedback @masto !

1 Like

I reflashed the f/w last night with unchanged code (see below) but told my linux box to stop polling for information every minute. I also did NOT have a serial console open. My core has been running for almost 12 hours without issue. I just opened up a serial console and will see if that lasts for over 2 hours (my previous record). If so, then Iā€™ll re-enable my cron job and start polling for values and see how that goes.

It may be that requesting info from the core every minute increases the likelihood of a network issue. However, I thought I read somewhere that spark was pinging something every 15 seconds or so. Iā€™ll report back with what I find. Iā€™m hoping this is helpful.

@Dave - You mentioned ā€œbug where the core doesnā€™t always reconnect on its ownā€. Just to be clear, once I get the blue-flash-of-death, my core never reconnects without a physical reset.

Dave O

 int uptime, data;

const int LED=D7;
const int ANALOG=D1;

const unsigned long DELAY_TIME = 10000;   // number of milliseconds 

unsigned long delayTimer = 0;

void setup() {

    Spark.variable("uptime", &uptime, INT);
    Spark.variable("data", &data, INT);
    
    pinMode(LED, OUTPUT);
    Serial.begin(9600);

    delayTimer = millis ();
}

void loop(void) {
    
    if (millis () - delayTimer > DELAY_TIME)    // Run this function every DELAY_TIME milliseconds
    {
        // Turn on the LED when reading temperature - should be every DELAY_TIME
        digitalWrite(LED, HIGH);
        
        delay (2000);                           // Tossing this in as a "reasonable" delay that a temperature reading might encounter
        
        digitalWrite(LED, LOW);
  
        delayTimer = millis ();                 // update delayTime to new value of millis ()
        uptime     = delayTimer / 1000;         // keep track of how many *seconds* the spark has been running.
        
        Serial.print   ("uptime is ");
        Serial.print   (uptime);
        Serial.print   (" seconds, ~");
        Serial.print   (uptime / 60);
        Serial.print   (" minutes   ");
        
        
        data = analogRead(ANALOG);
        
        Serial.print   ("Data from ANALOG = ");
        Serial.println (data);
    }
    
}
1 Like

My core starts up fine and gets to breathing cyan, Iā€™m able to use tinker via android to switch or check any IO a few times (between two and 20 times), then I get: ā€œwe have encountered an error. Please check that your core is onlineā€ but core is still breathing. less than 30 seconds later I get flashing. Sometime is restarts itself, sometimes I have to physically reset it. IO pins retain their last state until a reset or regaining comm with the android app.

connected via a rock solid single ASUS RTN66U router with WPA2

1 Like

Really interesting results to my tests above. Opening the serial console didnā€™t appear to have any affect and the core continued running just fine for 4 more hours. I re-enabled my cron job which polls the core API once a minute for the ā€œuptimeā€ and ā€œdataā€ variable from the above firmware/sketch.

The script successfully polled values 3 times from the API (i.e. 3 consecutive minutes) and then my core went to the blue-flash-of-death. Based on the above, it seems clear that something is going on with the interface to the cloud API and the core when values are being polled.

Iā€™ve rebooted the core and will let it chug for at least 2 hours with polling disabled. Iā€™ll then re-enable the API polling and see if I can reproduce the results. Iā€™m going to yank the 2000 ms delay from the sketch as well.

If there are other things youā€™d like me to try, please let me know. Otherwise Iā€™ll blunder along here. :slight_smile:

Dave O

1 Like

Iā€™m not sure throwing anything into the mix will help whats worse Iā€™m not sure I have any new info other than the drop of connection to the cloud only appears to happen after regular polling (go figure :slight_smile: ). Iā€™m not sure Iā€™m seeing any correlation on time either. A 5 second poll or a 1 minute poll results in failures in roughly the same time span.

Yanked the delay from the code and let it run with no polling for 131 minutes. I then enabled polling and surprisingly it stayed alive for 157 minutes before BFOD (blue-flash-of-death.) So perhaps the removal of the 2000 ms delay has reduced the window of opportunity for some sort of race condition, but it definitely still exists.

Dave O

@Dom @dorth thank you for clarifying this issue, it sounds like there could be some kind of memory leak on the firmware associated with the read variable functionality that causes it to crash after a particular number of attempts. Iā€™ll use your input here and see if I can replicate the issue and generate some debugging input that can help us resolve the issue. Will post what I find here.

I didnā€™t want to add to this already huge thread, but I really think this polling is a red herring. My core goes into the blue flash of death running nothing but a simple blink sketch. I havenā€™t gotten to the point of trying any more complicated programs until this gets resolved, but if thereā€™s anything I can do to supply additional information/debugging, Iā€™d be glad to. So many projects await!

1 Like

I have a 100% correlation between the BFOD and polling. My core will run the sketch for at least 15 hours (so far) with no issues. Once I start polling, it will BFOD in less than 120 minutes, and sometimes much less.

So we're apparently suffering from different issues but with the same result. Just wanted to clarify that. :slight_smile:

Dave O

Hopefully just another data point here, and to support mastoā€™s commentā€¦ I just have a simple ā€˜blink the user LEDā€™ once per sec. no polling or writing data out, just blink. The best I can get is maybe 5 minutes, then it will do a retry and usually recovers, however I have been seeing the BFOD more now and have to reset to get another 5 minutes.

1 Like

Just to throw another irrelevance in, I have a core which connects then will blink blue after 6 to 30 hours. The core has never been programmed and is not physically connected to anything. New out of the box and set up with the Android app.

I understand interrupts will increase the likelihood of the Blue Blink of Death, but there is obviously a fundamental issue. I sure the SparkCore folks will nail this. Just my $0.022 (at current exchange rates).

Any ETA of a fix for these issues as they render the core pretty useless and Iā€™m desperate to deploy some projects :wink:

Iā€™ve tested mine with a stock firmware app, custom app with variables and custom app with functions only and all 3 variants go to Blue Blink after seemingly random durations.

1 Like

Itā€™s a pretty complex fix; I would throw out an estimate of two weeks, but donā€™t quote me on it, because it depends whether we run into any road blocks.

2 Likes

be one failure mode, but my BFOD scenario is just:

while true
    do
     curl <blah blah blah> D7,HIGH
     sleep 5
     curl <blah blah blah> D7,LOW
     sleep 5
    done

If it helps, Iā€™d just like to add my experience here (iā€™ve mentioned it elsewhere too).

I have two cores running 24/7. One does not use any network connectivity and the loop runs continiously, checking button states. The other uses the cloud service to receive messages (about 3 or 4 every 24 hours). Both go down daily and require power cycle to come back up. Iā€™ve noticed the one that does no network activity, but the loop iteration is more often goes down quicker. Not sure if it is related. Hope this helps.

I think any code that uses any sort of blocking process will drop the connection quicker. Iā€™ve noticed I2C use absolutely kills the connection.

When will the fix on github be picked up by the web firmware loader ?

BTW - my core has been up now for over 2 days solid - sods law isnā€™t it :slight_smile:

1 Like

Iā€™m testing firmware with the Set_NetApp_Timeout() fix right now.

Longest Iā€™ve ever run without failure before is ~18 hours, Iā€™ll report back in 24.

For those who are trying to manually fix this issue, get the latest master branch of both ā€œcore-common-libā€ and ā€œcore-firmwareā€ before building and loading the firmware onto the Core. The fix should also be made available soon via the cloud server.

1 Like

Sadly, the single Set_NetApp_Timeout() fix does not fix the Cyan Flash Of Death.

I now plan to try and probe the SPI bus between the STM32 and the CC3000 and see if that yields any clues.