Random EEPROM write/put failures - ARGON/XENON

So (thinking aloud) perhaps this is where the issue lies (IMHO) - in single threaded mode, access is no issue as all access is happening “sequentially” in the same thread but when is SYSTEM_THREAD(ENABLED) - there is concurrent access to the LittleFS library and the issue arises - same problem for example, when using this mode and the LCD library and software timers - it causes corruption as the underlying drivers are not thread aware and can be ungraciously interrupted as DeviceOS assumes its in control. Even though SYSTEM_THREAD(ENABLED) works well in most cases, all built in libraries (like EEPROM, WIRE, SPI etc) would need to have “transactional” support refined going forward.

@shanevanj, yup, thread safe behaviour for shared resources in the DeviceOS seems to not be consistent. Transnational devices (SPI, I2C, etc) also need to be protected, which seems to be lacking (eg the LCD example).

Wild thought here - could we start a tutorial type exercise to convert the current wire library to a transactional based one - I for one would love to learn how to do something like this as my code skills are not much beyond pointers, never mind classes and transactional code etc. I could not contribute in terms of code but would be happy to destructively test …

1 Like

@shanevanj, I’m not sure why you think the existing Wire support in DeviceOS is not transactional. Remember that it is tied to the hardware and may be implemented in the Nordic SoftDevice stack. Furthermore, I2C is intrinsically transactional. Perhaps you can explain more about what you want to achieve?

The SPI API does have a locking semantic.

https://docs.particle.io/reference/device-os/firmware/electron/#begintransaction-

The Wire I2C API does as well via Wire.lock/unlock but I can’t see it in the latest API documentation unless I’m missing something.

1 Like

I may have used the wrong layer here :-), and I may also have mangled some concepts - on reflection, if I take the issue I had with LCD corruption - I see now that the issue is going to be in the I2C LCD library and not the wire library - so this requirement of mine should change to say how do we make the I2C LCD library “transactional” i.e. if app thread is updating LCD and software timer (system thread) fires an lcd update - the two somehow play nice so you don’t end up with a corrupted display - an example would be a software timer updating the time of day in top left corner of LCD, while the app thread updates values from a sensor in the second line of the LCD. Currently with system_thread(enabled) this results in a corrupted display as the (in my understanding) LCD library is not thread aware (if thats even possible) nor is there a way to “take exclusive” ownership of the lcd. Perhaps it is as simple as implementing a few new functions like

  • bool lcd.available() - returns a boolean if not locked.
  • int lcd.lock() - returns a int (lockNumber) if completed, -1 if locked already
  • bool lcd.release(int lockNumber) - true completed, false lock number not found

@shanevanj, I2C is transactional. The problem is when the software timer interrupts an ongoing transaction being done by the user application. The typical solution is to create a MUTEX which prevents the single resource, the I2C hardware, from being interrupted in the middle of a transaction until it is complete. Thus “with lock” and “release” you mention.

Another solution, a la @rickkas7, is to create a thread that handles all display transactions with a buffer whose elements are built in the user application and/or software timer. This way ONLY the thread transacts with the I2C device.

1 Like

Retested with DeviceOS1.4.0 - it happens far less frequently, but it still fails. If system thread disabled, it does not fail in 24hrs of iterations of test code below. Device is locked up and needs a hard reset.

SYSTEM_THREAD(ENABLED);

const int dbOffset = 50;
const int sizeOfAttributeName = 4;
const int largestAttributeValue = 20;
const int numOfAttributes = 150;
retained unsigned int iterationCounter = 0;

struct
{ // common struct for attributes stored in EEPROM
  int attributeNum;
  char attributeValue[largestAttributeValue + 1];
  time_t lastChanged;
} attributes;

// ------------------------------------------------------------------------------
void eepromCreateDataTable() //set NVram to defaults
// ------------------------------------------------------------------------------
{

  for (int recno = 0; recno < numOfAttributes; recno++)
  { //

    attributes.attributeNum = recno;
    attributes.lastChanged = Time.now();
    snprintf(attributes.attributeValue, largestAttributeValue, "%05d", recno);

    EEPROM.put((recno * sizeof(attributes)) + dbOffset, attributes);

    Serial.printf("\nRecord %d added", recno);
  }
}

// ------------------------------------------------------------------------------
void selectAll()
// ------------------------------------------------------------------------------
{
  Serial.print("\n#\tName\tUpdated\t\tValue");

  for (int recno = 0; recno < numOfAttributes; recno++)
  {

    EEPROM.get((recno * sizeof(attributes)) + dbOffset, attributes);

    Serial.printf("\n%d\t%d\t\t%lu\t%s", recno, attributes.attributeNum, attributes.lastChanged, attributes.attributeValue);
  }
}

// ------------------------------------------------------------------------------
void setup()
// ------------------------------------------------------------------------------
{
  Serial.begin();               // open serial over USB
  while (!Serial.isConnected()) // wait for Host to open serial port
    Particle.process();
}
// ------------------------------------------------------------------------------
void loop()
// ------------------------------------------------------------------------------
{
  Serial.print("\nStarting EEPROM test");
  Particle.publish("EEPROM_TEST", "Started", PRIVATE);
  eepromCreateDataTable();
  selectAll();

  Particle.publish("EEPROM_TEST", String::format("Iteration %u completed", iterationCounter++), PRIVATE);
  Serial.printf("\nIteration %u completed", iterationCounter++);

  delay(5000);
}

This code fails every time

SYSTEM_THREAD(ENABLED);
STARTUP(System.enableFeature(FEATURE_RETAINED_MEMORY));

const int dbOffset = 0;
const int sizeOfAttributeName = 4;
const int largestAttributeValue = 20;
retained unsigned int iterationCounter = 0;
int maxRecords = 0;

struct
{ // common struct for attributes stored in EEPROM
  int attributeNum;
  char attributeValue[largestAttributeValue + 1];
  time_t lastChanged;
} attributes;

// ------------------------------------------------------------------------------
void eepromCreateDataTable()
// ------------------------------------------------------------------------------
{
  Serial.print("\nCreating data table");

  for (int recno = 0; recno < maxRecords; recno++)
  {
    attributes.attributeNum = recno;
    attributes.lastChanged = Time.now();
    snprintf(attributes.attributeValue, largestAttributeValue, "%05d", recno);

    EEPROM.put((recno * sizeof(attributes)) + dbOffset, attributes);

    //Check what we wrote
    // EEPROM.get((recno * sizeof(attributes)) + dbOffset, attributes);
    // if (atoi(attributes.attributeValue) != recno)
    // {
    //   Serial.print("\n#\tName\tUpdated\t\tValue  - # Name and Value should all be the same face value");
    //   Serial.printf("\n%d\t%d\t\t%lu\t%s", recno, attributes.attributeNum, attributes.lastChanged, attributes.attributeValue);
    //   Serial.printf(
    //       "- Error record %d value is %d",
    //       recno,
    //       attributes.attributeValue);
    // }
  }
}

// ------------------------------------------------------------------------------
void selectAll()
// ------------------------------------------------------------------------------
{
  Serial.print("\nVerifying data table");

  for (int recno = 0; recno < maxRecords; recno++)
  {

    EEPROM.get((recno * sizeof(attributes)) + dbOffset, attributes);

    if (atoi(attributes.attributeValue) != recno)
    {
      Serial.print("\n#\tName\tUpdated\t\tValue  - # Name and Value should all be the same face value");
      Serial.printf("\n%d\t%d\t\t%lu\t%s", recno, attributes.attributeNum, attributes.lastChanged, attributes.attributeValue);
      Serial.printf(
          "- Error record %d value is %d",
          recno,
          attributes.attributeValue);
    }
  }
}

// ------------------------------------------------------------------------------
void setup()
// ------------------------------------------------------------------------------
{
  Serial.begin();
  while (!Serial.isConnected())
    Particle.process();

  maxRecords = (EEPROM.length() - 1) / sizeof(attributes);
}
// ------------------------------------------------------------------------------
void loop()
// ------------------------------------------------------------------------------
{
  Serial.printf(
      "\nStarting EEPROM test - EEPROM size - %d",
      EEPROM.length());
  Serial.printf(
      "\nSpace for %d records of test struct of %d bytes",
      maxRecords,
      sizeof(attributes));
  Serial.print("\nNo progres shown unless error is detected");

  Particle.publish("EEPROM_TEST", "Started", PRIVATE);

  eepromCreateDataTable();
  selectAll();

  Particle.publish("EEPROM_TEST", String::format("Iteration %u completed", iterationCounter), PRIVATE);
  Serial.printf("\nIteration %u completed", iterationCounter++);

  delay(5000);
}

In what way does it fail exactly?
Do your serial prints provide any hints about how far it works and where it breaks?
Can you add more to get a clearer picture?
With SYSTEM_THREAD(ENABLED) you must ensure Particle.connected() == true before trying to Particle.publish().

BTW, it doesn’t matter for your code as is, but for your maxRecords calculation you’d also need to deduct dbOffset for the case dbOffset may not be zero anymore.

1 Like

So this is interesting .... sometimes it just stops in the Creating table function.

If I uncomment the //Check what we write section, then it takes much, much longer to lock up, which is why o wanted to use the retained iterationCounter - but that is not behaving as I expect see:

I have left the Argon off for 4 hours now and just connected it and so far has done 150 iterations all ok .. so I will let it run overnight and see what I get?

So there is definitely an anomaly that randomly happens

Test code:

SYSTEM_THREAD(ENABLED);

const int dbOffset = 0;
const int sizeOfAttributeName = 4;
const int largestAttributeValue = 20;
int maxRecords = 0;

struct
{ // common struct for attributes stored in EEPROM
  int attributeNum;
  char attributeValue[largestAttributeValue + 1];
  time_t lastChanged;
} attributes;

// ------------------------------------------------------------------------------
void eepromCreateDataTable()
// ------------------------------------------------------------------------------
{

  Serial.println("\nCreating data table");

  for (int recno = 0; recno < maxRecords; recno++)
  {
    attributes.attributeNum = recno;
    attributes.lastChanged = Time.now();
    snprintf(attributes.attributeValue, largestAttributeValue, "%05d", recno);

    EEPROM.put((recno * sizeof(attributes)) + dbOffset, attributes);
    printDots();

    // Check what we wrote
    EEPROM.get((recno * sizeof(attributes)) + dbOffset, attributes);
    if (atoi(attributes.attributeValue) != recno)
    {

      Serial.print("\n#\tName\tUpdated\t\tValue  - # Name and Value should all be the same face value");
      Serial.printf("\n%d\t%d\t\t%lu\t%s", recno, attributes.attributeNum, attributes.lastChanged, attributes.attributeValue);
      Serial.printf(
          "- Error record %d value is %d",
          recno,
          attributes.attributeValue);
    }
  }
}

// ------------------------------------------------------------------------------
void selectAll()
// ------------------------------------------------------------------------------
{
  Serial.println("\nVerifying data table");

  for (int recno = 0; recno < maxRecords; recno++)
  {

    EEPROM.get((recno * sizeof(attributes)) + dbOffset, attributes);
    printDots();

    if (atoi(attributes.attributeValue) != recno)
    {
      Serial.print("\n#\tName\tUpdated\t\tValue  - # Name and Value should all be the same face value");
      Serial.printf("\n%d\t%d\t\t%lu\t%s", recno, attributes.attributeNum, attributes.lastChanged, attributes.attributeValue);
      Serial.printf(
          "- Error record %d value is %d",
          recno,
          attributes.attributeValue);
      while (1)
      {
        Particle.process();
      }
    }
  }
}

// ------------------------------------------------------------------------------
void printDots()
// ------------------------------------------------------------------------------
{
  const int maxDotWidth = 40;
  static int dotWidth = 0;

  if (dotWidth++ < maxDotWidth)
  {
    Serial.print(".");
  }
  else
  {
    Serial.println();
    dotWidth = 0;
  }
}

// ------------------------------------------------------------------------------
void setup()
// ------------------------------------------------------------------------------
{
  Serial.begin();
  while (!Serial.isConnected())
    Particle.process();

  maxRecords = (EEPROM.length() - 1 - dbOffset) / sizeof(attributes);
}
// ------------------------------------------------------------------------------
void loop()
// ------------------------------------------------------------------------------
{
  static int iterationCount = 0;

  Serial.printlnf(
      "\nStarting EEPROM test - EEPROM size - %d \
      \nSpace for %d records of test struct of %d bytes",
      EEPROM.length(),
      maxRecords,
      sizeof(attributes));

  eepromCreateDataTable();
  selectAll();

  Serial.printf("\nIteration %u completed", iterationCount++);

  delay(5000);
}

Sometimes only a manual DFU mode will bring it back.

If the SYSTEM_THREAD(ENABLED); is commented out - it runs perfectly okay.

Update:

SYSTEM_THREAD is NOT enabled - 2150 iteration completed with no issue

With SYSTEM_THREAD(ENABLED) fails within first iteration

1 Like

Just to confirm the test code when bang on first test on Argon running 1.4.2 and is probably the cause of all the hassles we have had with devices going solid cyan in the field affecting about 400 customers. Thank you so much for posting this.

@shanevanj Did you get a resolution to the original github issue or Particle support ticket raised? I just read through this long topic and I am now thinking that my use of EEPROM to store parameters on a Xenon Ethernet mesh network gateway might be the cause of occasional lock-ups (loss of cloud connection and a failure to reconnect) - I have a work around which is to call reset if the cloud connection is lost for more than 30 seconds - not ideal but works.

I have not run the benchmark test from the GitHub issue recently - but I will do so this weekend to check on the same hardware I used originally and latest OS

Ok. Run test APP on 1.4.4 and it seems to be running as expected. No freezes in the first few iterations - I will leave on for a few hours and see how it goes

Test App

SYSTEM_THREAD(ENABLED);

const int dbOffset = 50;
const int sizeOfAttributeName = 4;
const int largestAttributeValue = 20;
const int numOfAttributes = 75;

struct
{ // common struct for attributes stored in EEPROM
  int attributeNum;
  char attributeValue[largestAttributeValue + 1];
  time_t lastChanged;
} attributes;

// ------------------------------------------------------------------------------
void eepromCreateDataTable() //set NVram to defaults
// ------------------------------------------------------------------------------
{

  Serial.printf("\nStarting EEPROM test");

  for (int recno = 0; recno < numOfAttributes; recno++)
  { //

    attributes.attributeNum = recno;
    attributes.lastChanged = Time.now();
    snprintf(attributes.attributeValue, largestAttributeValue, "%05d", recno);

    SINGLE_THREADED_BLOCK()
    {
      EEPROM.put((recno * sizeof(attributes)) + dbOffset, attributes);
    }
    Serial.printf("\nRecord %d added", recno);
  }
}

// ------------------------------------------------------------------------------
void selectAll()
// ------------------------------------------------------------------------------
{
  Serial.print("\n#\tName\tUpdated\t\tValue");

  for (int recno = 0; recno < numOfAttributes; recno++)
  {

    EEPROM.get((recno * sizeof(attributes)) + dbOffset, attributes);

    Serial.printf("\n%d\t%d\t\t%lu\t%s", recno, attributes.attributeNum, attributes.lastChanged, attributes.attributeValue);
  }
}

// ------------------------------------------------------------------------------
void setup()
// ------------------------------------------------------------------------------
{
  Serial.begin();               // open serial over USB
  while (!Serial.isConnected()) // wait for Host to open serial port
    Particle.process();

}

// ------------------------------------------------------------------------------
void loop()
// ------------------------------------------------------------------------------
{
  static int iterationCount = 0;
  eepromCreateDataTable();
  selectAll();
  Serial.printlnf("\n\n----------------------------\nIteration:%d\n----------------------------", ++iterationCount);
  delay(5000);

}

I spoke too soon …

XENON:
Standalone - no issue
When plugged into a ethernet wing - as soon as it connects to particle cloud - it freezes on the write operations

ARGON:
Freeze almost immediately

I would not call EEPROM from within a SINGLE_THREADED_BLOCK. It might work sometimes, but it can also very easily deadlock in the following scenario:

On Gen 3 devices EEPROM is implemented in a file on the LittleFS file system, which is on the external QSPI flash.

SPI transactions are guarded by a mutex so multiple threads will not try to simultaneously access the SPI bus, but thread swapping is still enabled during the operations.

The deadlock can occur if your code is swapped in while the system thread is performing a SPI transaction. The system is still holding the SPI mutex, and this is fine. However, if your code enters a SINGLE_THREADED_BLOCK and tries to make a call that also needs to use SPI, it will block forever. Your code cannot obtain the SPI mutex because the system thread is holding it. However the system can’t release it because the SINGLE_THREADED_BLOCK prevents it from swapping in.

Basically, you must never use a SINGLE_THREADED_BLOCK around a mutex-protected object. Since it’s difficult to know what resources are mutex protected, it’s best to just never use SINGLE_THREADED_BLOCK,.

1 Like

I understand what you are saying. I am using a Xenon in an ethernet featherwing and I don’t believe I am using SINGLE_THREADED_BLOCK in my application and not certainly not around EEPROM.put or EEPROM.get. I am using I2C bus with the FRAM and the PublishAsyncQueue - both your libraries - without looking at them I assume that you would not have used SINGLE_THREADED_BLOCK in either (I haven’t checked yet). What I am seeing is something different which is the cloud connection is lost and then is never re-established unless reset is called.