Cyan breathing as a watchdog timer to do a software reset

My core has been dropping off my local net once or twice a day. I cannot get a response from it. It’s still “breathing” cyan. I need to push the reset button and the Core reconnects and its fine for for a while.

What is the underlying mechanism of the breathing? I suspect its an interrupt. Can I duplicate the process to create a watchdog to test for network connection, or periodically reset the Core via software.

I think a software reset is possible. Is it?

I have a similar problem with the beehive monitor. Once a week or so it fails to make a good WiFi connection when it makes its hourly wake from deep sleep and sits there inaccessible and doing very little. If I don’t notice, the Spark runs its batteries flat in a few hours. If I notice I can power cycle and everything springs back to life. Since both alternatives involve putting on a bee suit I’d rather have a watchdog recognize I’m stuck and try a reboot. Is there any news on the watchdog timer front?

1 Like

Phec,
try this:

unsigned int testConnectCount=0 ;

loop(){
	if( testConnectCount==0) {		//Every couple thousand or so loops, check to see if the Sparks connected
		if (!Spark.connected()) {	// If the Spark's not connected, turn off wifi, sleep 10 seconds and turn it on again
    		Spark.sleep(10);
			while( WiFi.status() != WIFI_ON ) delay(1000000) ; //Wait a second and check status, 
		}
		testConnectCount++ ;
	} else {
		testConnectCount = testConnectCount == 2000 ? 0 : testConnectCount + 1 ;
	}
...
}

I put it in last night and haven’t had to physically reset the Core since. In my case, I wait in a “while()” loop until the Spark reconnects. This is fine for my application. You might want to do something different.

Sorry about the code legibility, I don’t know how to make it look nice.

1 Like

I just read this..

Looks like delay is blocking, plus delay is milliseconds, not microseconds, so, I don't think delay should be in there.

Still, I wonder why it keeps my Core running?

@ronm, delay() does NOT block the background task anymore (for quite a while actually). It will “appear” to block the user code but not the background task. :smile:

Thanks for the clarification. It must have been an old post, I didn’t look at the date.Then the code is working as I expected, except it’s sitting around about a 1000x longer than I thought.

@ronm, that maybe because you use delay(1000000) which is 1 million milliseconds!! :stuck_out_tongue: It should be delay(1000) :smile:

@ronm Thanks - I had overlooked the Spark.connected() test. That is just what I want to trap with my watchdog.
I gave your code a try. I modified it slightly as follows:

void loop() {
//check that we are connected before proceeding - suggested by ronm :)
if (!Spark.connected()){
    digitalWrite(led, HIGH);
    Spark.sleep(10);                               //wifi off for a bit
    while (WiFi.status() != WIFI_ON) delay(1000);  //wait for wifi to resume
    delay(5000);                                   //allow time for re-connection
    digitalWrite(led, LOW);
}
else{
if (udp.parsePacket()>1){

I put in a 5 sec delay and tested every time round the loop rather than test for a connection every 2000 times round the loop bearing in mind that if I do have a connection I only go 4 times round the loop before going back into deep sleep and if I don’t have a connection I’m round the loop in microseconds. Maybe I need to give the Spark a bit longer than 5 secs to connect here - we’ll see.
I’ll let you know how this version goes. One plus is that if I don’t have a connection at least the Wifi is off for 2/3 of the time so the battery won’t go flat so fast.
Your code toggles the WiFi if the connection is lost. If this doesn’t improve the connection reliability I wonder whether I could connect one of the digital pins to the reset pin with a 10k resistor and pull it down if I have a long interval with no connection.

I’ve been having trouble with Spark.connected(). It seems my Spark is hanging on a TCP read or write. Spark.connected() is still true, but TCP is hung. I can’t get status on the client or server, so I’m a little stuck. Since the spark is still connected I can reload the firmware, effectively rebooting it. Im currently trying to get the cli on my PI to let it reboot the Spark when the tcp comms go dead.

@ronm did you get a lasting solution to this? Unfortunately, after a week or so it turned out that my variant of your code didn’t cure my hanging problem. Toggling WiFi doesn’t seem to be a “deep enough” reset and of course the trouble may lie at the router end. I have just tried another variant:

void loop() {
    if (millis()-startT > 30*60*1000){  //if Spark has been awake for more than 1/2 hr (i.e. stuck) send to sleep for 30 secs to reset
        Spark.sleep(SLEEP_MODE_DEEP, 30);
    } 
    if (udp.parsePacket()>1){
        ...
        ...
        startT = millis();
        ...

startT is set to millis() in setup() and on a successful UDP communication.
Correct operation is for the Spark to wake up, listen for UDP instructions and after transmitting some measurements the final instruction sends it to sleep. If it is still awake after 1/2 hour it has got stuck somehow and needs to be reset. Since toggle wifi didn’t help so I’m trying 30 secs deep sleep which seems to be the closest thing to a software reset.
So far so good but I need a couple of week’s running to check.

I’m having a similar problem with TCP hanging up periodically.

It’s a bit hacky but if we can’t find a way to test for the problem in code perhaps just doing a Spark.reset() every 12 hours or so would be a workaround for now.

@mrOmatic - is that new? I couldn’t find a software reset command before. How do you call Spark.reset()? When I tried from the IDE I get:
error: 'class SparkClass' has no member named 'reset'
and I can’t find a software reset in the documentation. On the Arduino I’m told the hack is to define a function pointer with the address 0 and call that.

It’s discussed here https://community.spark.io/t/how-to-perform-software-reboot-solved/6332/2

Might need the latest github code and build locally, I’ll investigate over the weekend.

1 Like

Thanks - I can only program the bee Spark remotely so that would be useful.
Edited: Yes I’ve checked. Spark.reset() is in spark_utilities.cpp but not in the IDE yet however NVIC_SystemReset(); does exactly the same job and can be called from the IDE.
Edit 2: System.reset() is in the IDE and also does the same thing.

Haven’t had a chance to try it on a core yet but from my up to date build environment I can confirm that System.reset(); compiles ok and Spark.reset(); fails.

I’ll do an actual test on a core when i get a moment.

I am using System.reset() using the IDE and can confirm that it works.

I’ve made a small webserver, but the Spark Core is far from stable. Most often the Core stops functioning properly within ten or twenty minutes, taking long times for the core to reset by itself.

To get more insight into its behavior I have added a watchdog function to my loop() which checks every second to see when it was last called. If the function was called more than three seconds ago, it prints a warning message to the serial connection and to a logfile on SD. If it was called more than ten seconds ago, it also prints the message (which mentions the actual duration of the outage) and subsequently call the System.reset function.

In the past days my logfile has logged many of these watchdog resets. I’ve noticed that most of these watchdog resets are preceded by a 60 second outage. Not, 57 seconds, not 62 seconds, always 60 seconds. To be more precise it also prints the milliseconds and then I get these kind of outage durations: 60.0, 60.880, 60.135, 60.679, etc.In other words: these outages took always 60 seconds (rounded to the second) and then control was returned to my loop. In other occasions I sometimes also get 20 second outages, but those are way less frequent.

I can only conclude that there is a 60 second blocking time-out somewhere in the Spark Core firmwarecode or perhaps the WiFi driver, which after time-out returns control to the user code. In earlier versions my server would just not respond for long times and sometimes reset automatically (I guess as instigated by the Spark firmware).
Unfortunately this instability makes the Spark Core unusable for my production environment. Hopefully the firmware will be improved soon.

To investigate a bit more, I went back to the beginning today. I took the sample blink program, added a tiny beep (using an active piezo buzzer) and adjusted it to blink three times, followed by a beep every second. The loop contains nothing but these blinks and the beep. All delays within the loop total to one second and all other code is only the digitalWrite of either the blue LED, or of the buzzer. In summary: this code should give a regular beep every second, accompanied by thee blinks.
I found that when running this code (not containing any TCP calls) the stability of the Core was much better, but the beep was still not as regular as it should. It sometimes skipped a second or more, I guess when executing Spark code.
I didn’t keep it running for a very long time, but long enough while typing this response to notice the irregularities…

2 Likes

I think there is a compiler bug (or at least a difference from the Arduino compiler). I used to be able to do something like this at the beggining of the function to ensure the function ran only once per second

static uint16_t start = millis();
if (((uint16_t)millis()) - start < 1000){
    return;
}
start = millis();

I did this, and printed the subtraction of millis() - start (with the variable casting) and one of the variables was negative!

Secs:58
1004
Secs:59
1005
Secs:60
-64530

It then went positive again – but for some reason this was causing my code to hang (still don’t know why…)

I changed my code from this:

#define RUN_ONLY_EVERY(time) \
    static uint16_t start = millis(); \
    if(((uint16_t)millis()) - start < time) \
        return; \
    start = millis();

to this:

#define RUN_ONLY_EVERY(time) \
    static uint16_t start = millis(); \
    if((uint16_t)(millis() - start) < (time)) \
        return; \
    start = millis();

And now my blink goes well over 60 seconds (was always failing around 60 seconds before, which makes sense as the max uint16 is 64000)

the rest of my code:

#define TIME_DELAY 1000
#define LED_PIN 7

void setup(){
    Serial.begin(115200);
    pinMode(LED_PIN, OUTPUT);
}

void blink(){
    static uint16_t seconds = 0;
    static uint16_t last_run = millis();

    RUN_ONLY_EVERY(TIME_DELAY);
    if(((uint16_t)millis()) - last_run > 1300){
        Serial.println("Delayed!");
    }
    Serial.println((uint16_t) millis() - last_run);
    last_run = millis();
    digitalWrite(LED_PIN, digitalRead(LED_PIN) -1 );
    Serial.print("Secs:");
    Serial.println(seconds);
    seconds ++;
}

void loop(){
    blink();
    //delay(TIME_DELAY);
}

This is a pretty odd gotcha, I wonder if it has to do with the 32 bit operating system? I guess subtracting two unsigned numbers can give you a negative in this compiler!

Edit
I am now seeing the problems you were having. Every once in a while the Sparkcore will just sit there – it won’t blink, it won’t do anything. This is a pretty significant problem – hopefully it can get addressed soon!

1 Like

The millis() function returns an unsigned long (32 bit unsigned integer) so your casting is throwing away the upper 16-bits. Try unsigned long or uint32_t as the return type for millis();

1 Like

I know I am throwing it away – who needs to keep track of more than 64 seconds???

Hello,

I ran a regular blink and the code had no problems for a good long while (over 1000 seconds). I then switched back to running my code which uses the TCPClient. My code attempts to connect to the TCPClient every 5 seconds (it never connects because currently the client is down). The code runs great for 310 seconds, and then every 15 seconds it has a long delay. There are a couple of possibilities here:

  • It is delayed every 3rd attempt at connecting to the server
  • It fails every 15 seconds for another reason (I don’t think it is my code)

I am going to shorten the connection time to test the first theory and get back to you.