Random EEPROM write/put failures - ARGON/XENON

xenon
argon
Tags: #<Tag:0x00007fe21dd84148> #<Tag:0x00007fe21dd83f40>

#1

I have been noticing a random issue affecting two gen3 devices an Argon and a Xenon. In summary after uploading firmware and then doing a series of more than 10 EEPROM put/write (or sometimes even a read/get - will cause the device to SOS 10 (or sometimes lock up solid cyan) - if the device self resets it will run the app again but will NOT fault. It only happens after a firmware upload.

This code below will cause the issue almost every time (It may be my code and dont be shy to tell me - I did try two completely different methods of using the EEPROM to check and both show same issue)

SYSTEM_THREAD(ENABLED);

const int dbOffset = 50;
const int sizeOfAttributeName = 4;
const int largestAttributeValue = 20;
const int numOfAttributes = 75;

struct {  // common struct for attributes stored in EEPROM
  int attributeNum;
  char attributeValue[largestAttributeValue + 1];
  time_t lastChanged;
} attributes;


// ------------------------------------------------------------------------------
void eepromCreateDataTable() //set to defaults
// ------------------------------------------------------------------------------
{
    
    Serial.printf("\nStarting EEPROM test");
    
    for ( int recno = 0; recno < numOfAttributes; recno++ ) { //
    
        attributes.attributeNum = recno;
        attributes.lastChanged = Time.now();
        snprintf(attributes.attributeValue, largestAttributeValue, "%05d", recno  );
    
        EEPROM.put((recno * sizeof(attributes)) + dbOffset, attributes);
        Serial.printf("\nRecord %d added", recno);
    }
}

// ------------------------------------------------------------------------------
void selectAll()
// ------------------------------------------------------------------------------
{
    Serial.print("\n#\tName\tUpdated\t\tValue");
    
    for (int recno = 0; recno < numOfAttributes ; recno++) {

        EEPROM.get((recno * sizeof(attributes)) + dbOffset, attributes);
        
        Serial.printf("\n%d\t%d\t\t%lu\t%s", recno, attributes.attributeNum, attributes.lastChanged, attributes.attributeValue);
    }

}

// ------------------------------------------------------------------------------
void setup() 
// ------------------------------------------------------------------------------
{
    Serial.begin();   // open serial over USB
    while(!Serial.isConnected()) // wait for Host to open serial port
        Particle.process();
    
    eepromCreateDataTable();
    selectAll();
    
    Serial.print("\nTest completed");
    
}

// ------------------------------------------------------------------------------
void loop() 
// ------------------------------------------------------------------------------
{
    delay(5000);
    Serial.print(".");
}


This is a show stopper as I need to store about 800 bytes as configuration parameters for a project.


#2

Even with

ATOMIC_BLOCK() {
             EEPROM.put((recno * sizeof(attributes)) + dbOffset, attributes);
        }

It still fails randomly


#3

So… some progress

If the above code is called with SYSTEM_THREAD(ENABLED); it failed every time (90%) - if without it works every time - I would suspect a cross thread issue or something in DeviceOS (I dont have the skills to even know where to start looking)

My workaround is

SYSTEM_THREAD(ENABLED);
SYSTEM_MODE(SEMI_AUTOMATIC);

and then once my EEPROM stuff is done

Particle.connect();

Not pretty - but it works reliably!


#4

You probably could wrap the EEPROM stuff in a SINGLE_THREADED_BLOCK

ATOMIC_BLOCK is a nuclear option (only meant for very specific short blocks) that may also interfere with interrupts causing other troubles when the EEPROM action takes longer (e.g. involving data relocation and/or a page erase - particularly on Gen1 & 2).
Hence your random problems I guess.


#5

Has this been flagged as a github issue?


#6

@ScruffR I’ll test that out
@peekay123 - I dont know how to - let me test the above first and I’ll then give it go


#7

Nope no joy - solid lock up after adding 47 records - so I need to raise issue


#8

Flagged here : https://github.com/particle-iot/device-os/issues/1832


#9

I don’t think it’s safe to use EEPROM.write() in a SINGLE_THREADED_BLOCK.

On Gen3 devices, EEPROM is implemented as a file in the LittleFS file system, not actually as EEPROM emulation in flash, as in the Gen2 STM32 devices.

You could definitely run into a problem in the following scenario, which may or may not be related to the problem you are seeing:

When manipulating the file system, the LittleFS uses a Mutex to prevent simultaneous access to the file system from different threads.

FreeRTOS can thread swap during the operation because swapping and interrupts are still enabled.

However, if you do an EEPROM write in a SINGLE_THREADED_BLOCK the EEPROM code will try to access the file system, where it will hit the file system mutex and halt waiting for it. However, since it’s in a SINGLE_THREADED_BLOCK, you’ll get into a situation where the mutex can never be obtained because the other thread won’t release it, because it can’t swap in.

Using a mutex instead of preventing thread swap will probably work better.


#10

I only posted the code with the SINGLE_THREADED_BLOCK as it was a diagnostic attempt based on @ScruffR suggestion. The code still still fails without it.


#11

@rickkas7, that seems to be a somewhat quirky approach especially since it isn’t really obvious that, why and under what circumstances the system thread would lock the LittleFS mutex all on its own.
Is any of that documented?

Since the SINGLE_THREADED_BLOCK appears to prevent interference of other threads during the action is would seem logical to use that in cases like the one outlined in this thread.


#12

It would seem to me that neither of these approaches is acceptable since the HAL should remove all this under-the-hood complexity and keep the API as-is. The command should work as documented under all circumstances.


#13

I fully agree on that.

However, when reading this

The “logical” first thing that jumps to mind in order to work around that “bug” would be SINGLE_THREADED_BLOCK which in itself should not create another issue :wink:

In which and how many other cases would that “mutex” issue (or anything comparable) come and bite us in the butt again??? :flushed:


#14

The idea to use SINGLE_THREADED_BLOCK to workaround the issue is a reasonable idea but fundamentally dangerous. Device-OS will often use a mutex to try and provide thread-safe operation in Device-OS resulting in possible deadlocks.

In general SINGLE_THREADED_BLOCK should be used for the shortest time necessary, do the minimum amount of work possible, and not unconditionally access any shared resource that could potentially be owned by another thread. Serial ports, cellular, WiFi, EEPROM, dynamic memory alloc/free, etc are all shared resources.

If reproduced this will likely have to be fixed in Device-OS.


#15

This needs to be in the reference documentation and list out the potential shared resources - I had a similar issue when I started my current Gen3 project - I was using the DeviceOS timers to update a LCD over I2C and this caused corruption as I was unaware of the shared nature of the various resources. It is obvious now in hindsight - but to someone starting out in a multithreaded world - there are some traps…


#16

Agreed, for the “shortest time necessary” from a useres perspective a EEPROM.put() should fullfill that condition but how would one know which combos and circumstances may lead to a deadlock (stretching the anticipated shortes time indefinetly) if it wasn’t clearly documented? :wink:

Yup, exactly that - for EEPROM and SINGLE_THREADED_BLOCK alike.
e.g. SINGLE_THREADED_BLOCK may need to check for system mutexes before entry and provide some means to bail out (with feedback).


#17

Can this also affect v2 devices (photon/p1) or is the implementation significantly different?


#18

@Elco, EEPROM in Gen3 devices is implemented as a LittleFS file in flash. On Gen2 devices, it is implemented entirely differently so the issue has not come up.


#19

What else is stored in LittleFS - does DeviceOS use it for internal purposes?


#20

@shanevanj, yes, DeviceOS uses LittleFS to store data in the external flash. There is a github PR asking for access to the unused portion of external flash for user access.