Deep update leads to SOS Hard Fault on previously working code

Hi, same overhere.

We had working firmware but after the Deep update we are not able anymore to call the Spark.function() remotely via the SparkCloud and also the 'automated' Over The Air flash does not work anymore using the Spark CLI resulting in

flash core got error: {"code":"ECONNRESET"}

First we though our SparkCore firmware was maybe too old (roughly one month).
Today we have cloned the latest master branch and the problem is still there.

Below the part for the Spark.function(), it is nothing fancy, just to test if it is working.

/* Function prototypes -------------------------------------------------------*/
int configData(String args);

    //Register the Duikel functions
    Spark.function("configData", configData);

int32_t getValue(String args) {
    String value = args.substring(args.indexOf(":")+1);
    return (int32_t)value.toInt();
}

int configData(String args) {
    return getValue(args);
}

BTW:
We also use the Spark.publish() which is still working after the Deep update
Our SparkCores without the Deep update are still working fine with the latest Spark firmware

For us it seems that something is wrong in the receiving part of the SparkCore after the Deep update, because the sending site is still working (Spark.publish()).

Thanks,
Henk

Hey Guys,

Thanks for the heads up, I’ll ping @zachary and @satishgn and see if we can’t look into this today. We haven’t updated the compile-server2 branch in a while, so nothing should be different about code compiled on the build ide.

Any chance anyone can send us code examples that are crashing consistently with the new firmware that didn’t before, and if you’re building locally, which branches / commits you’re at?

Thanks,
David

Just a quick counter-point: I tested all my tutorials involving publishing and functions and variables after deep update and everything was fine.

In the code above, do you really call Spark.function() not in setup()?

Substring can be a memory intensive operation since it dynamically allocates a new string every time.

I did make a change very recently that caused the cloud to ask the core what functions and variables were exposed a few seconds after the handshake. This made the OTA flash more stable for deep update on older firmwares, but if you are waiting 5+ seconds to register your functions and variables, they may not be caught in time for the request. I’ll work on a fix for this.

Thanks!
David

Also @mrOmatic (and @nika8991 if you’re building locally), please make sure you’ve pulled the latest master in all 3 repos (core-communication-lib: aef1c7f3, core-common-lib: 160e2dfc, core-firmware: 9b01d795) and also do a clean build with make clean all.

Like @bko, I tested lots of different firmware variations and never saw this issue, so I’m quite surprised. If you still see the problem after a clean build on those commits, please do share you code with us. If you need to keep it secret, share a secret gist with me (towynlin) on github, but of course more people than just me can help if you post it publicly. :wink:

Looking forward to getting you past that hard fault ASAP!

1 Like

@bko, the register of the Spark.function is in the setup(). The functions configData() and getValue() are outside the setup() and the loop().

I don't see a strange behaviour of the RGB led, so I don't think there is a memory problem causing this issue. Also when I look at the output of the build it looks fine.

arm-none-eabi-size --format=berkeley core-firmware.elf
   text       data        bss        dec        hex    filename
  84356       1264      12624      98244      17fc4    core-firmware.elf

Thanks,
Henk

@Dave, the register of the Spark.function is done in the setup(). So it will be done immediately after the SparkCore starts.

BTW
I just realize that we start without the WiFi and the SparkCloud connection because we want to save power as much as possible. Only at the end of the whole procedure, when we have all data, we make the connection to the WiFi and the Cloud send over the data and shut down again. While writting this i think that we have to move the register of the Spark.function to the point where we get connetion to the SparkCloud.

Thanks again,
Henk

2 Likes

I have also experienced similar problems. I have used two ‘black’ cores of which one has had the deep update applied.
Both have been loaded with a previously known working firmware code. I have recompiled and downloaded through cloud and also through the local compiler.
The core without the deep update runs as expected and accepts TCP IP as well as cloud variables. The one with the deep update only accepts cloud variables with nothing happening across TCP IP.
Both devices have been factory reset!

Hi @stevespark

I think the Spark team would be very interested in your code that does not do TCP after a deep update. Can you share the code?

There was a change at the same time as deep update that currently forces you to declared Spark variables and functions early in setup. Doing it after a delay was not working for some folks. You seem to have the opposite problem of Spark protocol stuff works but TCP does not. Another side effect of deep update is that TI removed the autonomous ping answer code so if your TCP connection depends on be able to ping the core for some reason, it will no longer work. Would any of this apply to your code?

Pinging @Dave and @zachary

1 Like

I moved the register function to the point where we have an established connection to the SparkCloud but still the SparkCore does not respond as it should.
This is the response we get when we have loaded the Deep update:

Successfull duikelConfig
{ ok: false, error: 'Timed out.' }

This is the response we get when we have a SparkCore without the Deep update, the return value is the value we expected:

Successfull duikelConfig
{ id: '50ff6d065067545608240587',
  name: 'chip-0005',
  last_app: null,
  connected: true,
  return_value: 200 }

Thanks,
Henk

Edit:
I have added the server part that calls the Spark.function()

function duikelConfigFunction(arg, sessionCoreId, funcParam, accessToken, callback) {
    var functionName = "configData";
    request({
        uri: "https://api.spark.io/v1/devices/" + sessionCoreId + "/" + functionName,
        method: "POST",
        form: {
            arg: funcParam,
            access_token: accessToken
        },
        json: true
    },
    function (error, response, body) {
        if (error) {
            console.log("Error duikelConfig");
            console.log(error);
        }
        else {
            console.log("Successfull duikelConfig");
            //console.log(response);
            console.log(body);
            callback(arg*2);
        }
    });
}

So I’ve solved my Hard Fault problems.

I had some MQTT publish messages being sent every loop. Simply not pushing the connection so fast seems to restore the stability. The reason it was sending so fast was based on an earlier design decision and I didn’t really need it like that anymore so it seems like I’m all good.

I did try to write a test case to test the problem but it seems like while i hadn’t found the problem before it is possible to crash out a non patched core in a similar way. Given that my code needs a local MQTT broker it’s probably not the most useful for testing the problem anyway.

I’m guessing the problem has got to do with traffic over time, changes to TCP buffers or something like that.

2 Likes

Hi @nika8991,

It sounds like your firmware is managing the connection very closely? Are you giving the core enough time to acknowledge the function call w/o blocking? Any chance you can share more code?

Thanks,
David

1 Like

Hey All,

We’re also talking about this here - https://community.spark.io/t/local-cloud-on-raspberry-pi/5708/52

I have a theory I want to test out. :slight_smile:

Thanks,
Davd

Here is the version of code I had working before the deep update:-

    //MONITOR-REV5.0

// This #include statement was automatically added by the Spark IDE.
#include "flashee-eeprom/flashee-eeprom.h"
using namespace Flashee;

#include "application.h"
#include "spark_disable_wlan.h"
#include "spark_disable_cloud.h"

#define tempPin A0
#define batPin A1
#define solarmAPin A2
#define loadmAPin A3
#define latchAPin D0        
#define latchBPin D1
#define batAorBPin D2

static uint16_t tableindex;
static short mVA = 0;
static short mVB = 0;
static short loadmA = 0;
static short solarmA = 0;

int Latch(String command);
int CheckTime();
void GetData();
void Table();
void ReadBat();
void TCPClientCall();
void netinfo();

static float zoneoffset = 2;                   // Sets time offset to UTC
static int sleepflag = 1;
int year;
int month; 
int date;
int hours; 
int mins;
int secs;


unsigned long  datetimestamp;         // unsigned 32 bit number in format YYMMDDhhmm so limited to year 2042 without increasing bytes


static const int MINUTE_MILLIS = 60 * 1000;       // 1 minute milliseconds count
static const int HOUR_MILLIS = 60*30*1000;    // 1 hour milliseconds count
static const int DATA_BLOCK_SIZE = 60;
static const int MAX_DATA_BLOCK = 2000;


float tempC;
static const float tempoffset = 0;                  // Offset correction for temp sensor location and intrinsic error
static const short mVoffset = 60;                    // Correction for voltage loss between battery terminal and pcb regulator
static const float mAmVfactor =0.62;                   //Correction for load mA
static const float Vfactor = 1.61;                    // factor to account for reading through potential divider 
static const float mAfactor = 2;                //load mamp current proportional to mvolts across load resistor  0.5 ohms
static const float loagain = 1.2;       // load op amp correction for current multiplier
static const float loaoffset = -0.088;
static const float soagain = 4.412;          //solar op amp correction for current multiplier  
static const float soabiasfactor =0.18;     //diif amplifier bias to reduce supply voltage measure
static const  float soaoffset = -9;          // mA offset with night or panel disconnected 


//char table_data[MAX_DATA_BLOCK*DATA_BLOCK_SIZE];
char NVdata[DATA_BLOCK_SIZE];


char Coredata[52] = "awaiting MAC, SN, Build, Version";


char Battdata[DATA_BLOCK_SIZE] = {"Wait a minute!"};

TCPServer server(8002);
FlashDevice* flash;


void setup() 
{
    WiFi.on();
    while(WiFi.status() != WIFI_ON)
        SPARK_WLAN_Loop();    // run backgound task so wifi can turn off
   
    Spark.connect();
    delay(1000);
   
    
   // flash = Devices::createAddressErase();      // Set up extenal EEPROM for data logging etc.
    flash = Devices::createWearLevelErase();
        
    uint8_t valid = EEPROM.read(20);
    if(valid == 82)                        // ASCII for 'R' you can read a valid table address
        tableindex = EEPROM.read(21) + (256*EEPROM.read(22));
    else
    {
        tableindex = 0;
        EEPROM.write(21,0); 
        EEPROM.write(22,0);
        EEPROM.write(20,82);                     //table index now ok to read
    }
 //   valid = EEPROM.read(3);
 //   if(valid == 82)                        // ASCII for 'R' you can read valid calibration data
 //   {
 //       tempoffset = EEPROM.read(4);
 //       tempoffset += EEPROM.read(5)/100;
//    }    
 //   else
  //  {
  //      tempoffset = 0.0;
  //      EEPROM.write(4,0);
  //      EEPROM.write(5,0);
  //      EEPROM.write(3,82);                 //Calibration OK to read
  //  }

            
    delay(200);
  //  for(int i = 0; i <MAX_DATA_BLOCK; i++)
 //   {
 //      flash->writeString(Battdata,i*DATA_BLOCK_SIZE);
//       delay(2);
//    }


   
    Time.zone(zoneoffset);                                   // Set timezone for Core location
    Spark.function("Latch", Latch );            // 100ms square wave to latch switch
 
    
    Spark.variable("Battery",Battdata, STRING);      // Returns Centigrade from TMP36 device

    server.begin();
    Serial.begin(9600);
    delay(100);

    pinMode(tempPin, INPUT);        //analog input tied to 3.3v reference 12 bit adc ( giving approx 0.8mV per unit)
    pinMode(batPin, INPUT);        //analog input tied to 3.3v reference 12 bit adc ( giving approx 0.8mV per unit)
    pinMode(loadmAPin, INPUT);        //analog input tied to 3.3v reference 12 bit adc ( giving approx 0.8mV per unit)
    pinMode(solarmAPin, INPUT);   //analog input tied to 3.3v reference 12 bit adc ( giving approx 0.8mV per unit)
    pinMode(batAorBPin, INPUT);       // digital input true if above 3v threshold
    pinMode(latchAPin, OUTPUT);     // digital output for pulse on latch reset
    pinMode(latchBPin, OUTPUT);     // digital output for pulse on latch set

    
    Spark.syncTime();
    delay(1000);
    year = Time.year();
    month = Time.month();
    date = Time.day();
    hours = Time.hour();
    mins = Time.minute();
    secs = Time.second();
    Coreinfo();   // Called once to get local connection data

}


void loop() 
{
 
 //int x = 0;
 
// if(IWDG_SYSTEM_RESET == 1)
 //   x = 1;
    

static int lastdata = millis();

  
    if(CheckTime() == 1)  // Update time and check to see if new hour
    {
        GetData();
        Table(); 
    }
 
    if (millis() - lastdata > 2*MINUTE_MILLIS)  // every  minute
    {
        if (!Spark.connected())
            Spark.connect();
        else
            Spark.disconnect();
        GetData();           // Read analog and digitalinputs 
        lastdata = millis();
    } 
    TCPClientCall();
   
}


// Checks to see if it is a new hour and 

int CheckTime()  
{

hours = Time.hour();
mins = Time.minute();
secs = Time.second();
static int lasthour;

    if(mins ==  0)  // ensure once per hour
    {
        if((hours == 0 || hours ==12) && sleepflag == 1)  // ensure once at midnight and midday
        {
            Spark.syncTime();  
            year = Time.year();
            month = Time.month();
            date = Time.day();
            //Spark.disconnect();
            WiFi.off();
            sleepflag = 0;
        }
        else if (hours == 8 || hours == 20)            // ensure once at midday
        {
            WiFi.on();        // Wake up!!
            while(WiFi.status() != WIFI_ON)
                SPARK_WLAN_Loop();    // run backgound task so wifi can turn off 
            Spark.connect();
            delay(1000);
            sleepflag = 1;
        }
        if(lasthour != hours )
        {
            lasthour = hours;
            return 1;
        }
        else
            return -1;;
    }
    else return -1;
}



// This function to read data from inputs
void GetData()
{
float tempread; 
float tempvolts;
static short solarmApoint[5];
static int i;
short tempmA = 0;
int j;


    tempread = analogRead(tempPin);
    tempvolts = tempread* 3.3;                 // Reference voltage supplied by Spark Core
    tempvolts /= 4096;                          // Approx 8mV per integer unit
    tempC = (tempvolts - 0.5)*100;              // TMP36 gives 10mV per C with 500mV offset at 0C
    tempC += tempoffset;                        // temperature correction for sensor location and intrinsic error
 
    delay(300);                                 // Delay to compensate for ADC sampling rate
    
  
    tempread = analogRead(loadmAPin);             
    tempvolts = tempread*3.3;                       // Reference voltage supplied by Spark Core
    tempvolts /= 4096;                              // Approx 8mV per integer unit
    tempvolts = tempvolts*loagain;    // Converting mvolts to amps through resistor
    tempvolts += loaoffset;
    tempvolts *= mAfactor;
    loadmA = (1000 * tempvolts) + 0.5;           //Conversion and rounding of positive  amps to milliamps     
    
    delay(300);                                     //Delay to compensate for ADC sampling rate
    
    tempread = analogRead(solarmAPin);             
    tempvolts = tempread*3.3;                       // Reference voltage supplied by Spark Core
    tempvolts /= 4096;                              // Approx 8mV per integer unit
    if (mVB == 0)
        tempvolts -= (mVA*soabiasfactor/1000);        // Allowance for reference offset due to 17k resistor etc.
    else
        tempvolts -= ((mVA+mVB)*soabiasfactor/1000);      // Battery B connected then refrence voltage double to about 13v
    tempvolts = - tempvolts;
    tempvolts = tempvolts*soagain;     // converting mvolts to mamps
    tempvolts *= mAfactor;
    tempmA = (1000 * tempvolts) + soaoffset+0.5;          //conversion and rounding of positive  amps to milliamps  and offset correction
    if (tempmA < 0)
        tempmA = 0;
    
    if(i>=0 && i <4)
        i++;
    else
        i = 0;
    solarmApoint[i] = tempmA;
    tempmA = 0;                             // initialise to zero before calculating moving  average
    for(j = 0; j <=4; j++)                  // moving average of 5 points
        tempmA += solarmApoint[j]; 
    solarmA = tempmA/5;
    
    delay(300);                                 //Delay to compensate for ADC sampling rate
    
    ReadBat();                              // Read which battery is connected and get its voltage
    
    datetimestamp = ((year-2000)*100000000) + (month*1000000) + (date*10000) + (hours*100) + mins;
    sprintf(Battdata,"%10lu,%.1fC,A: %4dmV,B: %4dmV,L: %4dmA,S: %4dmA",datetimestamp,tempC,mVA,mVB,loadmA,solarmA);
    Battdata[59] = '\0';   // ensure no corruption when wake up from WiFi off.
    
}

// This function to write data to two dimensional array with datetimstamp
void Table()
{
    EEPROM.write(1,tableindex%256);             // store current tableindex every hour
    EEPROM.write(2,tableindex/256);
    delay(50);

    flash->writeString(Battdata,tableindex*DATA_BLOCK_SIZE);
    delay(50);
   
    if(tableindex < (MAX_DATA_BLOCK-1))                              // barrel log resets to the beginning
        tableindex++;
    else
        tableindex = 0;
}


void ReadBat()
{
    
float tempread;
float tempvolts;

   
    tempread = analogRead(batPin);             
    tempvolts = tempread*3.3;                   // Reference voltage supplied by Spark Core
    tempvolts /= 4096;                          // Approx 8mV per integer unit
    tempvolts = Vfactor*tempvolts*(tempvolts - 1);            // Voltage increased by factor of potential divider
    tempvolts = (1000 * tempvolts) +mVoffset + (mAmVfactor*loadmA) +0.5;                               //conversion and rounding of positive  volts to millivolts 
    if (digitalRead(batAorBPin) == LOW)
        mVA = tempvolts;
    else
        mVB = tempvolts;
}

int Latch(String command)                       // Allows devices to be turned on or off through 1A latch relay
{
    
    if(command == "ON")
    {
        digitalWrite(latchAPin, LOW);        // ensure digital otput is low before pulse
        digitalWrite (latchAPin, HIGH);
        delay(20);                           // 10ms square wave pulse low-high-low
        digitalWrite(latchAPin, LOW);
        return 1;
    }
     else if(command == "OFF")
    {
        digitalWrite(latchBPin, LOW);        // ensure digital otput is low before pulse
        digitalWrite (latchBPin, HIGH);
        delay(20);                           // 10ms square wave pulse low-high-low
        digitalWrite(latchBPin, LOW);
        return 1;
    }
    else
        return -1;
}


// Create client connection through DNS server

void TCPClientCall()
{
int tempindex;
char tempbuf[7];
int count = 1;
char nnbuf[20] = "-01 days";
int readbuf = 0;


 

   // listen for incoming clients
    TCPClient client = server.available();

    if (client)
    {
       // Spark.disconnect();
        Serial.println("new client");
        // an http request ends with a blank line
        boolean currentLineIsBlank = true;
        while (client.connected())
        {
            if (client.available())
            {
                char c = client.read();
                if ( c == '=' )
                    readbuf = 1;
               else if (readbuf  ==1 || readbuf == 2)
               {
                    nnbuf[readbuf] = c;
                    if (c > '9')
                        readbuf =3;
                    else if (readbuf ==1)
                        count = (c - '0')*10;
                    else
                        count += c -'0';
                   readbuf++;
                }
                Serial.write(c);
                // if you've gotten to the end of the line (received a newline
              //   character) and the line is blank, the http request has ended,
              //   so you can send a reply
                if (c == '\n' && currentLineIsBlank)
                {
                    // send a standard http response header
                    client.println("HTTP/1.1 200 OK");
                    client.println("Content-Type: text/html");
                    client.println("Connection: close");  // the connection will be closed after completion of the response
                    client.println();
                    client.println("<!DOCTYPE HTML>");
                    client.println("<html>");
                    client.print(Coredata);
                    client.println("<br />");
                    client.print(Battdata);
                    client.println("<br />");
                    client.println("<br />");
                    client.print("NV MEMORY DATA");
                    if (readbuf == 4)
                        client.println("-command error - value between 00 and 99 please");
                    else
                        client.print(nnbuf);
                    client.println("<br />");
                    sprintf(tempbuf,"I:%4d",tableindex);
                    client.println(tempbuf);
                    client.println("<br />");
                    for (int i = count*24; i > 0; i--) 
                    {
                       tempindex = tableindex - i;
                       if(tempindex >=0)
                       {
                         // flash->read(NVdata,(tableindex-1)*60,60);
                           flash->read(NVdata,tempindex*DATA_BLOCK_SIZE,DATA_BLOCK_SIZE);
                           NVdata[59] = '\0';                                   //Ensure any sring corruption is restricted
                           client.print(NVdata);
	                        client.println("<br />");
                        }
                       // else
                        //    break;
                    }
                    client.println("</html>");
                    break;
                } 
                if (c == '\n') 
                {
                    // you're starting a new line
                    currentLineIsBlank = true;
                }
                else if (c != '\r')
                {
                    // you've gotten a character on the current line
                    currentLineIsBlank = false;
                }
            }
        }
        // close the connection:
        client.stop();
        Serial.println("client disconnected");
        client.flush();
      //  Spark.connect();
    }
}


void Coreinfo()                                  // Called in setup to pull netwok connection details
{                                               // One or more of function calls prevented WiFi.off/on and Spark.sleep from functioning when in main loop
    char macstr[24];
    byte mac[6];
    
    Network.macAddress(mac);   //get MAC address
    sprintf (macstr,"MAC %2x:%2x:%2x:%2x:%2x:%2x,",mac[5],mac[4],mac[3],mac[2],mac[1],mac[0]);


//	for (int i =0; i<12;i++)
//	    SerialID[i] = EEPROM.read(i);
//	SerialID[12] = '\0';
//	for (int i = 0; i < 6; i++)
//	    BuildDate[i] = EEPROM.read(i+12);
//	BuildDate[6] = '\0';
//	Version[0] = EEPROM.read(18);
//	Version[1] = EEPROM.read(19);
//	Version[2] = '\0';

	sprintf(Coredata,"MAC:%s,SN0123456789,140722,5.1",macstr);
	
}

Hi @stevespark

Thanks for posting the code! I added the magic markup to make the code look nice in your post.

Did you know that the wear leveling code in Flashee is causing problems for other folks every time they restart their core? Sometimes it works, often you read garbage or worse old data from a previous write. Could that be messing up your TCPServer?

The author @mdma is away on holiday for a bit but promised to look at it when he comes back. See this thread for details:

I would also try moving your Spark.function() and Spark.variable() calls up to right after you know the connection is up, like right after the 1 sec delay after Spark.connect(). There have been problems with a change put into the cloud at the same time as deep update that requires spark functions to be called within a few seconds of the Spark protocol starting up. This change helped with the checking of the deep update status that happens automatically each time a core starts its cloud connection, but it hurt some users with long delays in setup() or who manage the connection closely.

2 Likes

Hi @Dave,

You were right.
We now have a delay between moment that we have connection to the SparkCloud (Spark.connected() == true) and the moment we start publishing. We tested a number delay times:

- 3 seconds, goes wrong
 - 4 seconds, is instable
 - 5 seconds, goes well
 - 10 seconds, that is what we have running now, to be on the save side

Is it possible to get a kind of feedback from the SparkCloud to know when we can start publishing?
To me that feels more reliable then depending on a delay.

Thanks,
Henk

1 Like

Hi @nika8991,

Hmm, either the handshake calls should probably block, or the status shouldn’t change fully until the connection is truly ready to go. Something like a 5 second delay is really long. We’re looking into why the handshake stuff doesn’t seem to be blocking quite right here: https://github.com/spark/spark-server/issues/18

Thanks!
David

1 Like

Thanks for the feedback. I am aware of the garbled information occasionally appearing with Flashee. Whilst the Flashee libraries appear to be compromising my own code I think it is related to their size. I have now built a Core totally separate from the cloud and running Flashee wear levelling libraries to see if I can isolate particular problems such as string size and speed of read write operations. I look forward to mdma’s return as the Flashee is critical for those of us with embedded data logging requirements.

2 Likes

Hi @Dave,

I’m the collegue of Henk (@nika8991).  The publish
behavior looks like the same after the deep-update. But the Spark.function() and
the OTA only  are possible with a time delay of about 5 seconds between
Spark.connect() and starting publishing.  By us the order after
Spark.connect() is first publishing, then the Spark.function() and then the OTA.

But the Spark.function() and the OTA are not possible when
there is not a delay between the Spark.connect() and start publishing. Before
the deep_update that extra delay was not needed.

In fact the Spark.connected() comes too early, because after the Spark.connected() we assume that the SparkCloud connection is fully established.

We assume, that the connection from the SparkCloud to the SparkCore is still not fully established, when the Spark.connected() is received. And for publishing only the connection from SparkCore to the SparkCloud is used and that is established. But for the Spark.function() and the OTA the other not fully established connection (from SparkCloud to SparkCore) is necessary…

BTW
We notice, that sometimes the first published messages are lost. And we are sure, that we have sent them out. But this problem was also existing before the deep_update.

Regards.
Albert.

Hi @elnavdo,

I haven’t forgotten about you! We’re pushing out another big round of firmware improvements today to the build IDE, and I’m hoping we can see if they improve the hard fault after connection issue we’re seeing here, and on the local cloud.

Thanks,
David