Electron freezes with solid green or cyan LED

@scruffr, on your BTW, I think there is a place for deploying 0.7.0, but the place for that is not on devices in the field. The issue I am trying to fix is for a few devices situated in a place a few hours from where I work. If it makes you feel better I have 0.7.0 on a test rig right now. So far so good, but I was burned badly with 0.6.0, so I won’t be taking chances with it.

When Serial.print lags, can Serial.flush be used to counteract that. For example, if I place a Serial.flush after Log.trace(“TACH3”), will the ensure TACH3 will be sent to Serial prior to moving to the next piece of code to execute?

I see, but I read your OP as if you seem to experience the issue in a new version of your code (not yet deployed)

Should be.

Update on this issue: Last night I ran two tests, one on a device with 0.6.1 and one on a device with 0.7.0, with this modified code snippet, where I am using Serial.flush to hopefully get a specific location of crash:

int dataPin = D3;
int status;
uint32_t timerStart;
uint32_t timerStop;
uint16_t rpm;

volatile unsigned long rpmCount;

void getRpm() {
  Log.trace("TACH0");
  rpm = 0;
  delayTimer = 3000;

  rpmCount = 0;
  attachInterrupt(dataPin, interrupt, RISING);

  timerStart = millis();

  Log.trace("TACH1");
  while(millis() - timerStart < delayTimer) {
    Particle.process();
  }
  Log.trace("TACH2");

  timerStop = millis();
  Log.trace("TACH3");  
  //Disable interrupt when calculating
  detachInterrupt(dataPin);
  Log.trace("TACHDET");
  Serial.flush();
  Log.trace("TACH4");
  Serial.flush();

  rpm = (uint16_t)(rpmCount / ((timerStop - timerStart) / 1000));
  Log.trace("TACH5");
  Serial.flush();
  
  rpmCount = 0;

  Log.trace("TACH6");
  Serial.flush();
  if(rpm >= 0 && rpm < 5000) {
    status = SENSOR_SUCCESS;
  } else {
    status = SENSOR_FAIL;
    rpm = 0;
  }
  
  return;
}

void interrupt() {
  rpmCount++;
  return;
}

Both 0.6.1 and 0.7.0 devices froze at the same place, right after they logged “TACH3” to serial. The 0.6.1 device froze after 4 hours 12 minutes of running and the 0.7.0 device froze after 7 hours 9 minutes.

It looks to me like they froze on detachInterrupt. Is this a shared system resource?

How can I guard against a freeze on detachInterrupt? Or at least eliminate it from possibilities?

I’m also logging free memory, and this doesn’t seem to be an issue. 0.6.1 is running with 93296 free memory and 0.7.1 is running with 83096 free memory, neither of which change over the run times of the device.

@hwestbrook, can you provide context/code as to where and how getRpm() is called? If you can post more of your code, especially your globlal (vars, object instantiations, timers, etc) and setup() code, it might be helpful.

Couple of notes: return; is not required in your ISR. Also, you may want to disable interrupts prior to detaching the interrupt if that interrupt can possibly occur during that call. You can enable interrupts immediately after the detach.

@peekay123, thanks for the suggestion – I will try disabling and then enabling interrupts using noInterrupts() and interrupts(). I’ve just got that test rig going, so now just wait 12 hours or so… :grimacing:

I know this is annoying from your perspective, but our app is just too complex to post here. The general place this function runs is:

  • 5 or so sensors are read and saved into variables (including RPM from this function and this IC: LM1815)
  • Depending on sensor output and preset settings, our device actuates farm equipment
  • Sensor data and actuation data are uploaded over CoAP to our own backend server
  • Our backend server responds with any new settings
  • Repeat

This all happens in the loop function which is not a FSM (which is something I need to move to when I have time). We have 1000’s of combined hours of testing in the application, so know it is pretty stable.

I will follow up with something I can post on here that will crash the Electron that is just the code above and maybe some cloud activity.

Here is that shorter example code:

#include "application.h"

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);
STARTUP(System.enableFeature(FEATURE_RETAINED_MEMORY));

SerialLogHandler logHandler(LOG_LEVEL_ALL);

#define SENSOR_FAIL 2
#define SENSOR_SUCCESS 3

int dataPin = D3;
int powerPin = C0;
int status;
uint32_t timerStart;
uint32_t timerStop;
uint16_t rpm;
uint32_t delayTimer = 3000;

volatile unsigned long rpmCount;

void getRpm();
void interrupt();

void setup() {
  Serial.begin(9600);
  pinMode(dataPin, INPUT);
  pinMode(powerPin, OUTPUT);
}

void loop() {
  getRpm();
  Log.info("RPM: %u", rpm);
  delay(1000);
}

void getRpm() {
  digitalWrite(powerPin, LOW);
  delay(50);

  Log.trace("TACH0");
  rpm = 0;

  rpmCount = 0;
  attachInterrupt(dataPin, interrupt, RISING);

  timerStart = millis();

  Log.trace("TACH1");
  while(millis() - timerStart < delayTimer) {
    Particle.process();
  }
  Log.trace("TACH2");

  timerStop = millis();
  Log.trace("TACH3");  
  //Disable interrupt when calculating
  detachInterrupt(dataPin);
  Log.trace("TACHDET");
  Serial.flush();
  Log.trace("TACH4");
  Serial.flush();

  rpm = (uint16_t)(rpmCount / ((timerStop - timerStart) / 1000));
  Log.trace("TACH5");
  Serial.flush();
  
  rpmCount = 0;

  Log.trace("TACH6");
  Serial.flush();
  if(rpm >= 0 && rpm < 5000) {
    status = SENSOR_SUCCESS;
  } else {
    status = SENSOR_FAIL;
    rpm = 0;
  }
  
  digitalWrite(powerPin, HIGH);
  return;
}

void interrupt() {
  rpmCount++;
}

This crashed my Electron after 29 minutes. The electron running this code is on 0.7.0. The RPM pulse signal is about 3700 pulses per second. A link to the log file is here: https://pastebin.com/DFVM65CL

In parallel I ran a version of the code turning interrupts on and off before and after detachInterrupt, but in my main application. That too crashed that Electron. I will try this on the simpler code above and report back.

Both of these happened quicker than I would have expected, but I don’t think this bug is related to time, so it might be more random.

One obvious item to remove would be Particle.process, although I’m not sure why this would cause this crash?

@hwestbrook, my big concern was how your call calling the function but it sounds like you are calling it from loop() so that’s fine. One thing to remember is that the Log capability was never designed to be synchronous with events, only to log the events. For farm equipment I would have to say that an FSM and exception handling are a must!

At 3700 RPM, the Electron will be interrupted every 270us so disabling interrupts is important IMO. Given that you are using system threading, you should try without Particle.process() calls. I’m not quite certain why you have so many Log.trace() calls in getRPM(). If you only have one when you enter and another when you exit, does the Electron crash? What if you remove all of them in getRPM()?

@peekay123, the only reason I have the logging calls in this code is in order to trace the bug back to its source – all of these Log.trace calls have been added since the bug showed up. My production firmware does not log anything to Serial (it logs to a 200 byte buffer, but does not log anything related to getRpm).

My first versions of this tracing was just on function enter and exit, so I think I’ve already covered that one. And since my main firmware does not log anything in this function, I think that I’ve already covered this as well.

For faster output to Serial, should I be calling Serial.print as opposed to Log.trace? Is there some risk that the program is crashing somewhere other than detachInterrupt?

Here is an updated version with noInterrupts and a delay before detachInterrupt. This is on my test rig now:

#include "application.h"

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);
STARTUP(System.enableFeature(FEATURE_RETAINED_MEMORY));

SerialLogHandler logHandler(LOG_LEVEL_ALL);

#define SENSOR_FAIL 2
#define SENSOR_SUCCESS 3

int dataPin = D3;
int powerPin = C0;
int status;
uint32_t timerStart;
uint32_t timerStop;
uint16_t rpm;
uint32_t delayTimer = 3000;

unsigned long rpmCountFrozen = 0;
volatile unsigned long rpmCount;

void getRpm();
void interrupt();

void setup() {
  Serial.begin(9600);
  pinMode(dataPin, INPUT);
  pinMode(powerPin, OUTPUT);
}

void loop() {
  getRpm();
  Log.info("RPM: %u", rpm);
  delay(1000);
}

void getRpm() {
  digitalWrite(powerPin, LOW);
  delay(50);

  Log.trace("TACH0");
  rpm = 0;

  rpmCount = 0;
  rpmCountFrozen = 0;
  attachInterrupt(dataPin, interrupt, RISING);

  timerStart = millis();

  Log.trace("TACH1");
  while(millis() - timerStart < delayTimer) {
    Particle.process();
  }
  Log.trace("TACH2");

  timerStop = millis();
  rpmCountFrozen = rpmCount;
  Log.trace("TACH3");  
  //Disable interrupt when calculating
  noInterrupts();
  delay(1);
  detachInterrupt(dataPin);
  interrupts();
  Log.trace("TACHDET");
  Serial.flush();
  Log.trace("TACH4");
  Serial.flush();

  rpm = (uint16_t)(rpmCountFrozen / ((timerStop - timerStart) / 1000));
  Log.trace("TACH5");
  Serial.flush();
  
  rpmCount = 0;

  Log.trace("TACH6");
  Serial.flush();
  if(rpm >= 0 && rpm < 5000) {
    status = SENSOR_SUCCESS;
  } else {
    status = SENSOR_FAIL;
    rpm = 0;
  }
  
  digitalWrite(powerPin, HIGH);
  return;
}

void interrupt() {
  rpmCount++;
}

Also, I agree on FSM and exception handling, but I don’t want to derail this discussion, as I believe the code above should operate for at least 24 hours. (and should indefinitely, but I don’t want to set my expectations too high!)

@hwestbrook, I said my bit on FSMs so let’s leave it at that :wink:

I would remove all logging for now and replace the calls with Serial.print() in getRPM(). I am assuming that you came to the conclusion that things crash in getRPM() by using the Log.trace() calls, correct? From what I see, you are not using any software timers or other asynchronous bits, correct? How are you read “5 or so” sensors? Do they each have an interrupt?

If you want to narrow down on single lines of code you can use the D7 led instead of trace prints.
You switch it on before the line you suspect and then off after the suspected line.

For multiple checkpoints you could use a PWM pin and check its frequency on an scope.

@peekay123, no, I didn’t come to the conclusion of Log.trace calls are causing crash. They’ve been added after the fact and I think they are only pointing to where I think the problem is. I’m not sure what the evidence for Log calls causing crash? I do not have these calls in my normal firmware.

The code below and above is the exact code I am running on my test devices, so there are not other sensors or anything else being read from. In my normal firmware, there are other sensors, like a ds18b20, but nothing that requires pulse counting like this.

Here is a revision with change to Serial.print’s:

#include "application.h"

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);
STARTUP(System.enableFeature(FEATURE_RETAINED_MEMORY));

#define SENSOR_FAIL 2
#define SENSOR_SUCCESS 3

int dataPin = D3;
int powerPin = C0;
int status;
uint32_t timerStart;
uint32_t timerStop;
uint16_t rpm;
uint32_t delayTimer = 3000;

unsigned long rpmCountFrozen = 0;
volatile unsigned long rpmCount;

void getRpm();
void interrupt();

void setup() {
  Serial.begin(9600);
  pinMode(dataPin, INPUT);
  pinMode(powerPin, OUTPUT);
}

void loop() {
  getRpm();
  Serial.print("RPM: ");
  Serial.print(rpm);
  Serial.print(", millis: ");
  Serial.println(millis());
  delay(1000);
}

void getRpm() {
  digitalWrite(powerPin, LOW);
  delay(50);

  Serial.println("TACH0");
  rpm = 0;

  rpmCount = 0;
  rpmCountFrozen = 0;
  attachInterrupt(dataPin, interrupt, RISING);

  timerStart = millis();

  Serial.println("TACH1");
  while(millis() - timerStart < delayTimer) {
    Particle.process();
  }
  Serial.println("TACH2");

  timerStop = millis();
  rpmCountFrozen = rpmCount;
  Serial.println("TACH3");  
  //Disable interrupt when calculating
  noInterrupts();
  delay(1);
  detachInterrupt(dataPin);
  interrupts();
  Serial.println("TACHDET");
  Serial.flush();
  Serial.println("TACH4");
  Serial.flush();

  rpm = (uint16_t)(rpmCountFrozen / ((timerStop - timerStart) / 1000));
  Serial.println("TACH5");
  Serial.flush();
  
  rpmCount = 0;

  Serial.println("TACH6");
  Serial.flush();
  if(rpm >= 0 && rpm < 5000) {
    status = SENSOR_SUCCESS;
  } else {
    status = SENSOR_FAIL;
    rpm = 0;
  }
  
  digitalWrite(powerPin, HIGH);
  return;
}

void interrupt() {
  rpmCount++;
}

@hwestbrook, the question was whether you had ascertained that it was getRPM() where you crash was occurring. I am assuming that your test units are crashing consistently with the code. For the sake of non-blocking, can you change loop() to this:

uint32_t getRPMdelay = 0;

void loop() {
  if (millis() - getRPMdelay >= 1000) {
    getRPMdelay = millis();
    getRpm();
    Serial.print("RPM: ");
    Serial.print(rpm);
    Serial.print(", millis: ");
    Serial.println(millis());
    }
}

I would also get rid of the delay(1); after the noInterrupts(); call. Since you have retained memory enabled, you could create a counter and set it to zero. Increment the counter whenever you want to “count” a step in your code. Reset the counter before calling getRPM. When you reboot, if the counter is non-zero in setup() then print the value. From this you will be able to figure out which step the code had completed before it crashed.

The DS18B20 uses OneWire which also makes heavy use of interrupts - including noInterrupts()/interrupts(), hence we were asking for more info about your surrounding code.
We do have a feeling what may interfere with your code, but only when we know about these apparently unlrealated but actually very relevant facts.

However, if you really see the same issue without any extras, there is something rather fishy.

Yes, I am pretty sure it is getRpm where the crash is happening. I slowly put more and more Log.trace statements into my full application code and every time it crashed it crashed right in the middle of getRpm. There was never a time where it crashed on any other function or sensor. If it was a crash from too many Logs or something, I wouldn’t expect it to always crash in the same spot at different times.

From the traces that it was at getRpm, I simplified code down to what is at post #12 and then post #17. I think I should be able to iterate on this code (removing serial prints, etc…) until I can get it to run for 24 hours without a crash.

Do you agree that I should be able to run at least a version of this code for 24 hours without a crash?

The next time I get a crash I will remove all serial prints in getRpm and go with your idea of a counter to find position.

Do you think its OK to have a non-blocking serial print in my loop, like in your example above? It seems like it should be possible to print out data to Serial indefinitely with an Electron. Is there some limit to how many times you can call Serial.print?

The code at post #17 crashed, so now I am trying the following:

#include "application.h"

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);
STARTUP(System.enableFeature(FEATURE_RETAINED_MEMORY));

#define SENSOR_FAIL 2
#define SENSOR_SUCCESS 3

int dataPin = D3;
int powerPin = C0;
int status;
uint32_t timerStart;
uint32_t timerStop;
uint16_t rpm;
uint32_t delayTimer = 3000;

unsigned long rpmCountFrozen = 0;
volatile unsigned long rpmCount;

retained uint8_t location;
uint32_t getRpmDelay = 0;

void getRpm();
void interrupt();

void setup() {
  Serial.begin(9600);
  while(!Serial.available()) {}
  Serial.print("Last location: ");
  Serial.println(location);
  pinMode(dataPin, INPUT);
  pinMode(powerPin, OUTPUT);
}

void loop() {
  if(millis() - getRpmDelay > 1000) {
    location = 0;
    getRpm();
    Serial.print("RPM: ");
    Serial.print(rpm);
    Serial.print(", location: ");
    Serial.print(location);
    Serial.print(", millis: ");
    Serial.println(millis());
    getRpmDelay = millis();
  }
}

void getRpm() {
  digitalWrite(powerPin, LOW);
  delay(50);

  location = 1;
  rpm = 0;
  rpmCount = 0;
  rpmCountFrozen = 0;
  location = 2;
  attachInterrupt(dataPin, interrupt, RISING);

  location = 3;
  timerStart = millis();

  location = 4;
  while(millis() - timerStart < delayTimer) {
    Particle.process();
  }

  location = 5;
  timerStop = millis();
  rpmCountFrozen = rpmCount;
  
  //Disable interrupt when calculating
  location = 6;
  noInterrupts();
  location = 7;
  detachInterrupt(dataPin);
  location = 8;
  interrupts();
  
  location = 9;
  rpm = (uint16_t)(rpmCountFrozen / ((timerStop - timerStart) / 1000));
  
  location = 10;
  rpmCount = 0;
  if(rpm >= 0 && rpm < 5000) {
    status = SENSOR_SUCCESS;
  } else {
    status = SENSOR_FAIL;
    rpm = 0;
  }

  location = 11;
  digitalWrite(powerPin, HIGH);
  location = 12;
  return;
}

void interrupt() {
  rpmCount++;
}

Both with and without the Particle.process call in the while loop.

This shouldn't be an issue with the current code, but may become a time bomb if you ever decide to reduce the delayTimer to less than 1000 as this might cause a DIV/0 exeption.

I'm currently running this code as a test with 0.6.4

//SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);
STARTUP(System.enableFeature(FEATURE_RETAINED_MEMORY));

enum SENSOR_STATUS 
{ SENSOR_FAIL                   = 2
, SENSOR_SUCCESS                = 3
};

const uint32_t    msMeasuring   =  3000;
const uint32_t    msRefresh     =     0; // as rapid as possible 10000;
const int         pinData       = D3;
const int         pinPower      = C0;
const int         pinPWM        = D0;           // for isolated test bridge pinPWM to pinData

volatile uint32_t rpmCount;
uint32_t          rpm;
uint32_t          timerStart;
uint32_t          timerStop;
SENSOR_STATUS     status;

void getRpm();
void interrupt();

void setup() {
  Serial.begin();
  pinMode(pinData , INPUT);
  pinMode(pinPower, OUTPUT);
  pinMode(pinPWM  , OUTPUT);                    // for isolated test
}

void loop() {
  static uint32_t msLastRefresh = 0;
  static uint32_t pwmFrequency  = 0;
  if (millis() - msLastRefresh >= msRefresh) {
    getRpm();
    Serial.printlnf("RPM: %4lu, PWM: %4lu, millis: %10lu --------|---------|---", rpm, pwmFrequency, millis());
    Serial.flush();
    pwmFrequency = random(10, 5000);
    analogWrite(pinPWM, 1, pwmFrequency);       // set the test frequency for next turn
    msLastRefresh = millis();
  }
}

void getRpm() {
  pinResetFast(pinPower);
  delay(50);

  rpmCount   = 0;
  timerStart = millis();
  attachInterrupt(pinData, interrupt, RISING);
  while(millis() - timerStart < msMeasuring) Particle.process();  // in MANUAL mode no need
  timerStop = millis();
  noInterrupts();
  detachInterrupt(pinData);
  interrupts();

  rpm    = (1000 * rpmCount) / (timerStop - timerStart);
  status = (0 <= rpm && rpm < 5000) ? SENSOR_SUCCESS : SENSOR_FAIL;
  
  pinSetFast(pinPower);
}

void interrupt() {
  rpmCount++;
}

@ScruffR, yes, I agree that is a time bomb – not only for the reason you pointed out, but also because it limits my measurement intervals to whole seconds. Thanks for pointing that out.

@peekay123, the code in post #20 made it to location 8 and then crashed. Which I think means it crashed when interrupts() was called, re-enabling interrupts. Is this the right way to read this? Is there any other possible answer?

I can now see the same issue (but only at high trigger frequencies 30000+) and am now trying this

  attachInterrupt(pinData, interrupt, RISING);
  while(millis() - timerStart < msMeasuring) Particle.process();  // in MANUAL mode no need
  timerStop = millis();
  //noInterrupts();
ATOMIC_BLOCK() {
  detachInterrupt(pinData);
}
  //interrupts();

let’s see if that helps


Update:
That didn’t help.
Now testing 0.7.0


Update:
0.7.0 is no better in this regard.
How about not detaching the interrupt?
You depower the sensor, so no interrupts should happen anyway, right?

My code is failing when I attempt to start interrupts back up

location = 6;
noInterrupts();
location = 7;
detachInterrupt(dataPin);
location = 8;
interrupts();

location prints out to 8 when I start the system back up

With the interrupts() in place like in your code the same happens to me.
With ATOMIC_BLOCK() the freeze happens in there.
Without noInterrupts()/interrupts() or ATOMIC_BLOCK() it happens with detachInterrupt().

This is definetly a bug that needs reporting, but for the time being you may have to find a workaround, right?