Definitely should be rock solid, Thanks for the feedback @masto !
I reflashed the f/w last night with unchanged code (see below) but told my linux box to stop polling for information every minute. I also did NOT have a serial console open. My core has been running for almost 12 hours without issue. I just opened up a serial console and will see if that lasts for over 2 hours (my previous record). If so, then Iāll re-enable my cron job and start polling for values and see how that goes.
It may be that requesting info from the core every minute increases the likelihood of a network issue. However, I thought I read somewhere that spark was pinging something every 15 seconds or so. Iāll report back with what I find. Iām hoping this is helpful.
@Dave - You mentioned ābug where the core doesnāt always reconnect on its ownā. Just to be clear, once I get the blue-flash-of-death, my core never reconnects without a physical reset.
Dave O
int uptime, data;
const int LED=D7;
const int ANALOG=D1;
const unsigned long DELAY_TIME = 10000; // number of milliseconds
unsigned long delayTimer = 0;
void setup() {
Spark.variable("uptime", &uptime, INT);
Spark.variable("data", &data, INT);
pinMode(LED, OUTPUT);
Serial.begin(9600);
delayTimer = millis ();
}
void loop(void) {
if (millis () - delayTimer > DELAY_TIME) // Run this function every DELAY_TIME milliseconds
{
// Turn on the LED when reading temperature - should be every DELAY_TIME
digitalWrite(LED, HIGH);
delay (2000); // Tossing this in as a "reasonable" delay that a temperature reading might encounter
digitalWrite(LED, LOW);
delayTimer = millis (); // update delayTime to new value of millis ()
uptime = delayTimer / 1000; // keep track of how many *seconds* the spark has been running.
Serial.print ("uptime is ");
Serial.print (uptime);
Serial.print (" seconds, ~");
Serial.print (uptime / 60);
Serial.print (" minutes ");
data = analogRead(ANALOG);
Serial.print ("Data from ANALOG = ");
Serial.println (data);
}
}
My core starts up fine and gets to breathing cyan, Iām able to use tinker via android to switch or check any IO a few times (between two and 20 times), then I get: āwe have encountered an error. Please check that your core is onlineā but core is still breathing. less than 30 seconds later I get flashing. Sometime is restarts itself, sometimes I have to physically reset it. IO pins retain their last state until a reset or regaining comm with the android app.
connected via a rock solid single ASUS RTN66U router with WPA2
Really interesting results to my tests above. Opening the serial console didnāt appear to have any affect and the core continued running just fine for 4 more hours. I re-enabled my cron job which polls the core API once a minute for the āuptimeā and ādataā variable from the above firmware/sketch.
The script successfully polled values 3 times from the API (i.e. 3 consecutive minutes) and then my core went to the blue-flash-of-death. Based on the above, it seems clear that something is going on with the interface to the cloud API and the core when values are being polled.
Iāve rebooted the core and will let it chug for at least 2 hours with polling disabled. Iāll then re-enable the API polling and see if I can reproduce the results. Iām going to yank the 2000 ms delay from the sketch as well.
If there are other things youād like me to try, please let me know. Otherwise Iāll blunder along here.
Dave O
Iām not sure throwing anything into the mix will help whats worse Iām not sure I have any new info other than the drop of connection to the cloud only appears to happen after regular polling (go figure ). Iām not sure Iām seeing any correlation on time either. A 5 second poll or a 1 minute poll results in failures in roughly the same time span.
Yanked the delay from the code and let it run with no polling for 131 minutes. I then enabled polling and surprisingly it stayed alive for 157 minutes before BFOD (blue-flash-of-death.) So perhaps the removal of the 2000 ms delay has reduced the window of opportunity for some sort of race condition, but it definitely still exists.
Dave O
@Dom @dorth thank you for clarifying this issue, it sounds like there could be some kind of memory leak on the firmware associated with the read variable functionality that causes it to crash after a particular number of attempts. Iāll use your input here and see if I can replicate the issue and generate some debugging input that can help us resolve the issue. Will post what I find here.
I didnāt want to add to this already huge thread, but I really think this polling is a red herring. My core goes into the blue flash of death running nothing but a simple blink sketch. I havenāt gotten to the point of trying any more complicated programs until this gets resolved, but if thereās anything I can do to supply additional information/debugging, Iād be glad to. So many projects await!
I have a 100% correlation between the BFOD and polling. My core will run the sketch for at least 15 hours (so far) with no issues. Once I start polling, it will BFOD in less than 120 minutes, and sometimes much less.
So we're apparently suffering from different issues but with the same result. Just wanted to clarify that.
Dave O
Hopefully just another data point here, and to support mastoās comment⦠I just have a simple āblink the user LEDā once per sec. no polling or writing data out, just blink. The best I can get is maybe 5 minutes, then it will do a retry and usually recovers, however I have been seeing the BFOD more now and have to reset to get another 5 minutes.
Just to throw another irrelevance in, I have a core which connects then will blink blue after 6 to 30 hours. The core has never been programmed and is not physically connected to anything. New out of the box and set up with the Android app.
I understand interrupts will increase the likelihood of the Blue Blink of Death, but there is obviously a fundamental issue. I sure the SparkCore folks will nail this. Just my $0.022 (at current exchange rates).
Any ETA of a fix for these issues as they render the core pretty useless and Iām desperate to deploy some projects
Iāve tested mine with a stock firmware app, custom app with variables and custom app with functions only and all 3 variants go to Blue Blink after seemingly random durations.
Itās a pretty complex fix; I would throw out an estimate of two weeks, but donāt quote me on it, because it depends whether we run into any road blocks.
be one failure mode, but my BFOD scenario is just:
while true
do
curl <blah blah blah> D7,HIGH
sleep 5
curl <blah blah blah> D7,LOW
sleep 5
done
If it helps, Iād just like to add my experience here (iāve mentioned it elsewhere too).
I have two cores running 24/7. One does not use any network connectivity and the loop runs continiously, checking button states. The other uses the cloud service to receive messages (about 3 or 4 every 24 hours). Both go down daily and require power cycle to come back up. Iāve noticed the one that does no network activity, but the loop iteration is more often goes down quicker. Not sure if it is related. Hope this helps.
I think any code that uses any sort of blocking process will drop the connection quicker. Iāve noticed I2C use absolutely kills the connection.
When will the fix on github be picked up by the web firmware loader ?
BTW - my core has been up now for over 2 days solid - sods law isnāt it
Iām testing firmware with the Set_NetApp_Timeout() fix right now.
Longest Iāve ever run without failure before is ~18 hours, Iāll report back in 24.
For those who are trying to manually fix this issue, get the latest master branch of both ācore-common-libā and ācore-firmwareā before building and loading the firmware onto the Core. The fix should also be made available soon via the cloud server.
Sadly, the single Set_NetApp_Timeout() fix does not fix the Cyan Flash Of Death.
I now plan to try and probe the SPI bus between the STM32 and the CC3000 and see if that yields any clues.