Main Loop getting blocked when Cell Connection isn't made (or something like that)

I’m using several Electron’s with OS 1.2.1 and for some reason, the device bricks every few hours with the corresponding system logs:

0000087940 [system] INFO: ARM_WLAN_WD 2
0000087940 [system] INFO: CLR_WLAN_WD 1, DHCP success
0000087940 [system] INFO: Cloud: connecting
0000087943 [system] INFO: Read Server Address = type:1,domain:$id.udp.particle.io
0000087944 [system] ERROR: Failed to load session data from persistent storage

Visually, the LED is fast blinking cyan - I’ve tried @rickkas7 electron-clouddebug, but it seems like everything is fine.

I’m operating in the following mode:

SYSTEM_MODE(SEMI_AUTOMATIC);
SYSTEM_THREAD(ENABLED);

I use Particle.connect in the setup loop, which my understanding shouldn’t block the main loop. It most certainly does. After switching to OS 1.3.0-rc1, the connection issues seem to resolve themselves, but the main loop still will not start until after the particle connects.

Is there an easy way to ensure main loop continues to run with or without a connection - I have a fairly critical application and this is causing serious issues…

1 Like

Can you exclude heap fragmentation from the list of possible causes?
You can also try SYSTEM_MODE(MANUAL) - in conjunction with SYSTEM_THREAD(ENABLED) you should not need to change too much - if at all - in your code.

@ScruffR After doing some reading up on heap fragmentation - it seems likely this is the root cause as I do dynamically generate a String - I will re-factor this to use a char array and snprintf instead. Is there a way to verify/validate heap fragmentation is the cause?

AFAICT, not straight forward way.
When you encounter such an issue you could trigger a function that tries to get the max size of any available chunk on the heap.
One possible way to do that could look like in this StackOverflow answer

If you compare that to the size you'd get on a fresh start you could get a feeling if this may the the cause or not.

1 Like

So I haven’t quite tested for heap fragmentation yet, but in preparing to, I build a pretty tiny program and still experience the same issue (mostly). I am logging the available free memory and it hovers around 68k - there is nothing dynamic, so hopefully that ensures there are no heap issues.

The logs that are spit out include;

0000006569 [system] INFO: ARM_WLAN_WD 2
0000006569 [system] INFO: CLR_WLAN_WD 1, DHCP success
0000006569 [system] INFO: Cloud: connecting
0000006575 [system] INFO: Read Server Address = type:1,domain:$id.udp.particle.io
0000006577 [system] ERROR: Failed to load session data from persistent storage
0000017014 [system] INFO: Discarding session data
0000017015 [app] INFO: Free Memory = 69968
0000017027 [system] INFO: Resolved 2e0029000b47373336373936.udp.particle.io to 107.23.86.140
0000017099 [system] INFO: Cloud socket connected
0000017101 [system] INFO: Starting handshake: presense_announce=0

This is rapidly followed by:

0000022809 [system] ERROR: received from a different address 107.23.86.140:5684

It never connects to the cloud, and either is very quickly flashing cyan, entering into an SOS, or blinking green - typically some combination of the three.

Here is the entire program, minus additional libraries and files which i can post if needed:

#include "SdFat.h"
#include "SdLog.h"
#include "structs.h"
#include "utils.h"
#include "globals.h"
#include "PublishQueueAsyncRK.h"
#include "DeviceKeyHelperRK.h"
#include "CellularHelper.h"
#include "ConnectionEvents.h"
#include "ConnectionCheck.h"
#include "AppWatchdogWrapper.h"
#include "SessionCheck.h"
#include <vector>
#include <numeric>
#include <algorithm>

#define CS_SD D5

// Init functionality on startup
void startup() {
  System.enableFeature(FEATURE_RETAINED_MEMORY);
  System.enableFeature(FEATURE_RESET_INFO);
}

STARTUP(startup());
SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);

// Init a serial log handler so debug information will be dumped to serial
SerialLogHandler logHandler(LOG_LEVEL_WARN, {
  {"app", LOG_LEVEL_INFO},
  {"app.network", LOG_LEVEL_WARN},
  {"comm.sparkprotocol", LOG_LEVEL_ERROR},
  {"comm.sparkprotocol.handshake", LOG_LEVEL_ERROR},
  {"comm.protocol", LOG_LEVEL_WARN},
  {"comm.coap", LOG_LEVEL_WARN},
  {"comm.dtls", LOG_LEVEL_WARN},
  {"system", LOG_LEVEL_INFO},
  {"app.pubq", LOG_LEVEL_ERROR},
});

SdCardLogHandler logHandler1(LOG_LEVEL_WARN, {
  {"app", LOG_LEVEL_WARN},
  {"app.network", LOG_LEVEL_WARN},
  {"comm.sparkprotocol", LOG_LEVEL_ERROR},
  {"comm.sparkprotocol.handshake", LOG_LEVEL_ERROR},
  {"comm.protocol", LOG_LEVEL_WARN},
  {"comm.coap", LOG_LEVEL_WARN},
  {"comm.dtls", LOG_LEVEL_WARN},
  {"system", LOG_LEVEL_WARN},
  {"app.pubq", LOG_LEVEL_ERROR},
});
 
// declare the options for the state of the device
enum State {
  STARTUP,             // start i2c/SPI devices
  IDLE,                // wait between cycles
  PROCESS_STRING,      // take a measurement
  PUSH_DATA_AND_CLEAR, // push data to the server
  PRINT_DATA,          // print the data to serial
  GET_DEVICE_LOCATION,  // get the device's GPS location
};

//
ConnectionEvents connectionEvents("connEventStats");

// Connection Check
ConnectionCheck connectionCheck;

// DEBUG START
AppWatchdogWrapper watchdog(60000);
SessionCheck sessionCheck(3600*3);

// Create a buffer to hold queued data
retained uint8_t publishQueueRetainedBuffer[2048];
PublishQueueAsync publishQueue(publishQueueRetainedBuffer, sizeof(publishQueueRetainedBuffer));

// // Initialize Variables
String bufferStr, goodStr;
String LOGSTRING, CLOUDSTRING;

FirmwareSettings FW;

// Set up the SD card
SdFat sd(&SPI1);
File LFILE;

// Set up the device key helper to store keys on the SD card
DeviceKeyHelperSdFat deviceKeyHelper("keys");

long TIME_START;
unsigned long MILLIS_START, MILLIS_LAST, MILLIS_NOW;
unsigned long LED_MILLIS;
int FLAG;
String LOGTIME, FILENAME, RECORD, TMP_LOGTIME;
String strGas, strPM, metaResp;
double PRESSURE, BOX_TEMP, MANIFOLD_TEMP, MANIFOLD_RH, DEW_POINT,  NOISE, SOLAR, WD, WS;
double TMP_BOX_TEMP, TMP_PRESSURE, TMP_MANIFOLD_TEMP, TMP_MANIFOLD_RH, TMP_NOISE, TMP_SOLAR, TMP_WD, TMP_WS;
double TMP_DEW_POINT;
double CO_WE, CO_AE, NO2_WE, NO2_AE, OX_WE, OX_AE, NO_WE, NO_AE;
double TMP_CO_AE, TMP_CO_WE, TMP_NO_WE, TMP_NO_AE, TMP_OX_WE, TMP_OX_AE, TMP_NO2_WE, TMP_NO2_AE;
double SAMPLE_TIME, FLOWRATE, CO2, VOC, TMP_CO2, TMP_VOC;
double TMP_FLOWRATE, TMP_SAMPLE_TIME;
double BIN_DIAM_0, BIN_DIAM_1, BIN_DIAM_2, BIN_DIAM_3, BIN_DIAM_4, BIN_DIAM_5, BIN_DIAM_6, BIN_DIAM_7;
double BIN_DIAM_8, BIN_DIAM_9, BIN_DIAM_10, BIN_DIAM_11, BIN_DIAM_12, BIN_DIAM_13, BIN_DIAM_14, BIN_DIAM_15;
double BIN_DIAM_16, BIN_DIAM_17;
double BIN_COUNTS_0, BIN_COUNTS_1, BIN_COUNTS_2, BIN_COUNTS_3, BIN_COUNTS_4, BIN_COUNTS_5, BIN_COUNTS_6;
double BIN_COUNTS_7, BIN_COUNTS_8, BIN_COUNTS_9, BIN_COUNTS_10, BIN_COUNTS_11, BIN_COUNTS_12, BIN_COUNTS_13;
double BIN_COUNTS_14, BIN_COUNTS_15, BIN_COUNTS_16, BIN_COUNTS_17;
double TMP_BIN_DIAM_0, TMP_BIN_DIAM_1, TMP_BIN_DIAM_2, TMP_BIN_DIAM_3, TMP_BIN_DIAM_4, TMP_BIN_DIAM_5;
double TMP_BIN_COUNTS_0, TMP_BIN_COUNTS_1, TMP_BIN_COUNTS_2, TMP_BIN_COUNTS_3, TMP_BIN_COUNTS_4, TMP_BIN_COUNTS_5;
double INT_VAL_0, INT_VAL_1, INT_VAL_2, INT_VAL_3, INT_VAL_4, INT_VAL_5, INT_VAL_6, INT_VAL_7, INT_VAL_8, INT_VAL_9;
double INT_VAL_10, INT_VAL_11, INT_VAL_12, INT_VAL_13, INT_VAL_14, INT_VAL_15, INT_VAL_16, INT_VAL_17;
double TMP_INT_VAL_0, TMP_INT_VAL_1, TMP_INT_VAL_2, TMP_INT_VAL_3, TMP_INT_VAL_4, TMP_INT_VAL_5;
int BIN_SET_ON, BIN_SET;
int RECORD_NUMBER, OPC_FLAG;
int LAST_RECORD_NUM = -1;
int TOTAL_ITERS;
int COUNT_SET1, COUNT_SET2, COUNT_SET3; // the number of times a given OPC bin iteration has been seen
int COUNT_CO2;
int BLAST_NUMBER = 5;
int FLAG_LOCAL = 0;
int CYCLE_S;
bool DEBUG;
int BUF_NUM;
int LAST_HOUR = 99;
float LAT, LON;
int ALT, UNCERTAINTY;

// CellularHelperLocationResponse locResp;

std::vector<double> VEC_COWE;
std::vector<double> VEC_COAE;
std::vector<double> VEC_NOWE;
std::vector<double> VEC_NOAE;
std::vector<double> VEC_OXWE;
std::vector<double> VEC_OXAE;
std::vector<double> VEC_NO2WE;
std::vector<double> VEC_NO2AE;
std::vector<double> VEC_CO2;

// create a vector of strings
std::vector<String> BUF_STRINGS;


const size_t bufferLength = 512;
char buf[bufferLength];
size_t bufferOffset = 0;
const char CR = 13; // carraige return
const char LF = 10; // Line Feed
int idx;

// set the initial state to the startup state
State STATE = STARTUP;

// Write a serial event that reads and dumps the data from the Zilog Chip
// This runs in between every loop iteration
void serialEvent1(){
  while (Serial1.available()){
    char c;
    
    if (bufferOffset < bufferLength){
      c = Serial1.read();

      // add chars to the buffer as long as it't not a CR or LF
      if (!(c == CR) | (c == LF)){
        buf[bufferOffset++] = c;
      } else {
        buf[bufferOffset] = 0;
        bufferOffset = 0; 

        // at this point, we have a final string!
        bufferStr = String(buf).trim();

        Log.trace(bufferStr);

        if (bufferStr.charAt(0) == 0x02){
          bufferStr.remove(0, 1);
        }
        // if the first 3 chars are GET, we have good dataz!
        if (bufferStr.startsWith("GET")){
          goodStr = bufferStr;
          STATE = PROCESS_STRING;
        } else {
          bufferStr = "";
          STATE = STATE;
        }
      }
    } else {
      // reset the buffer position
      bufferOffset = 0;
    }

    // Keep the process alive
    Particle.process();
  }
}

// initialize a function to for quantaq's smart average function
double quantaq_smartavg(std::vector<double> vect);


void setup() {
  // Serial is used to output data over the normal micro-usb connector
  Serial.begin(9600);

  // Setup serial1 to read from the Zilog chip
  Serial1.begin(57600);

  pinMode(D7, OUTPUT);

  // Set up the particle variables and functions
  Particle.variable("OPC_FLAG", OPC_FLAG);
  Particle.variable("CYCLE_S", CYCLE_S);
  Particle.variable("DEBUG", DEBUG);
  Particle.variable("BLAST_NUM", BLAST_NUMBER);
  Particle.variable("BUF_NUM", BUF_NUM);
  Particle.function("changeCycle", changeCycleTime);
  Particle.function("changeState", changeState);
  Particle.function("changeThresh", changeThreshold);
  Particle.function("changeWarmup", changeWarmup);
  Particle.function("changeBlast", changeBlast);

  // Setup the publishQueue
  publishQueue.setup();

  // turn off the red charging status LED
  setCharging(false);

  // get the time set up
  Time.setFormat(TIME_FORMAT_ISO8601_FULL); // permanently make the time ISO formatted

  // let's give it a rest...
  delay(1000);

  if (!sd.begin(CS_SD))
  {
    Log.trace("Could not open SD card.");
  }
  else
  {
    Log.trace("Opened SD card.");

    // get the filename for today
    FILENAME = currentFile();

    // monitor the backup keys
    deviceKeyHelper.startMonitor();

    // if the file doesn't exist, make a new file
    if (!sd.exists(FILENAME))
    {
      makeNewFile(FILENAME);
    }
  }

  // connect to the particle.io server (NOTE: this is blocking!)
  Particle.connect();

  // start the connection events
  connectionEvents.setup();
  connectionCheck.setup();
  sessionCheck.setup();
}

void loop()
{
  switch (STATE)
  {
    case STARTUP:
    {
      EEPROM.get(_EEPROM_ADDR_, FW);

      if (FW.VERSION != 6)
      {
        FirmwareSettings tmpSettings;
        FW = tmpSettings;

        EEPROM.put(_EEPROM_ADDR_, tmpSettings);
      }

      // Read the settings
      EEPROM.get(_EEPROM_ADDR_, FW);

      DEBUG = FW.DEBUG;
      CYCLE_S = FW.CYCLE_S;

      // set some times
      TIME_START = Time.now();
      MILLIS_START = millis();
      MILLIS_LAST = millis();

      STATE = IDLE;
      break;
    }
    case IDLE:
    {
      MILLIS_NOW = millis();
      if ((int)(MILLIS_NOW - MILLIS_LAST) >= CYCLE_S * 1000)
      {
        MILLIS_LAST = MILLIS_NOW;
        STATE = PUSH_DATA_AND_CLEAR;
      }
      else if (Time.hour() != LAST_HOUR)
      {
        LAST_HOUR = Time.hour();
        STATE = GET_DEVICE_LOCATION;
      }

      if ((int)(MILLIS_NOW - LED_MILLIS) >= 250)
      {
        digitalWrite(D7, !digitalRead(D7));
        LED_MILLIS = MILLIS_NOW;

        Log.info("Free Memory = " + String(System.freeMemory()));
      }
      break;
    }
    case PROCESS_STRING:
    {
      STATE = IDLE;
      break;
    }
    case PUSH_DATA_AND_CLEAR:
    {
      STATE = IDLE;
      break;
    }
    case GET_DEVICE_LOCATION:
    {
      STATE = IDLE;
      break;
    }

  } // end switch()

  // Keep debugging processes running
  // connectionEvents.loop();
  // connectionCheck.loop();
  // sessionCheck.loop();
} // end loop()

Last comment - I just downgraded the OS to 1.1.1 and everything works perfectly! Seems like there might be something in the newer OS that causes the issue…

When stating that always tell how many blinks follow the SOS code too. SOS without that number is near useless info.

BTW, reading code like this always gives me the shivers - I'd not even start reading code that starts off like that - sorry :blush:

Please (:pray:) use arrays!!!

Also, it's commonly not helpful for quickly browsing code (for seasoned coders) to see all these capital letters in variable names - all-capital names are usually "reserved" for constants and #define macro names.
While this is not a "law" it makes reading code much easier - also using camel case rather than underscores (_) allows for shorter better readable names (unless company coding policy requires it otherwise).

Just another note to this

There is no need for that since \r (carriage Return) and \n (New line) already exist for that.
So you'd usually read this instead

if (!(c == '\r') || (c == '\n'))

(also note that I used the logical OR while you used the binary. You should make it a habit to use the logical version in conditional expressions as the binary will sometimes render a result you wouldn't expect)
BTW, do you really mean to apply the NOT (!) only to the first equality check?
If so, the second part of the OR operation is superfluous as that is already satisfied by not being the other :wink:

1 Like

Yea, thanks for the style suggestions - I don’t write in C++ very often and should grab a style guide and read it - do you suggest one?

Also, it seems I was a bit premature stating that v1.1.1 fixed everything - it fixed it for ~1hr and then the same issues creeped up. I’ll re-write this to be a bit more readable and re-post.

Particle.connect itself is never blocking (it simply sets a flag in Device-OS) but it will trigger the system thread to begin a connect attempt. This won’t directly block your application thread and main loop but certain API calls that depend on the system thread (ex registering a Particle variable) if done after Particle.connect will block on the system thread and delay execution of your main loop until it connects.

So… Are you sure it is Particle.connect that is blocking? I expect if you narrowed in on it a little more carefully you would find some of your follow-on setup functions are calling into a potentially blocking API and getting stuck until the device connects.

1 Like

@joel Thanks for the tip - I’m checking now and don’t see anything obvious. What is the best way to get the most verbose, but useful, logs that could reveal this?

With Workbench you could try and debug to see where it gets stuck. Otherwise you’re stuck with good ol’ print debugging. Sprinkle some Log.info around and see how far it gets to narrow in on the offending line. Assuming my hunch is right…

Alright, so it seems like it wasn’t necessarily a software/OS issue, but a hardware issue that manifested itself as a software issue? There are still some behaviors I don’t quite understand, but here is a rundown:

  • it seems like much of this stems from a lack of sufficient power to both the SD card and the Electron on startup
  • We are powering the Electron via VIN with a 470 uF cap as mentioned in the docs to deal with current transients
  • We also were logging cell information from the Log to the SD card (via SdLogger), which is fairly often on startup - by removing this, things began to work much better
  • strangely, when the SD logging was still enabled, the Electron would fire up okay, but would go straight to breathing green, which I understand as being connected to the internet, but not the Particle cloud - however, while in this state, I was easily able to interact with the device via the console and events were still sent as expected. I’m not sure if it was an LED error? Or if I understand the state of Electron operation incorrectly.
  • the overall behavior I saw actually follows what at least one other has noticed, as described here.

Last, @ScruffR - I did go ahead and re-write the whole thing last night to follow best (c++) practices as much as I could, even if some of the standards (i.e. camelCase for variables) is hard to look at as a python dev. Thanks for the recs though.

2 Likes