SD Card Crashes periodically in embedded PCB with Electron

Hey y'all.

Been tearing my hair out recently trying to troubleshoot and determine the root cause of frequent SD card crashes in our embedded systems. This is related to a discussion between myself, @SCM and @mbyamukama that was started a while back (see link below), but I wanted to move the discussion to its own thread.

Some relevant info:

  • We are using SDFat (SPI running at 10 MHz) running on Particle OS release/0.6.1
  • We are also running a LCD display off of this SPI bus (write-only)
  • We are using authentic SanDisk 8 GB Standard and 16 GB Ultra microSD cards
  • We are powering the SD card off of the 3v3 power supply that is built into the Electron
  • I have implemented a cooldown period between calls to Particle.publish() and SDFat sync operations (writing info to the SD card) to try and avoid transmission-induced voltage supply transients causing problems with SD card writes
  • We have observed that SD card crashes are much more prevalent when there are local network connectivity issues (which presumably would cause the Electron to increase it's transmission power, thus increasing potential for EMI and voltage supply dips/transients)
  • The SDFat error codes and error data associated with SD card crash events seem to be widely variable
  • (edit) I'm running SYSTEM_THREAD(ENABLED) and SYSTEM_MODE(SEMI-AUTOMATIC)
  • (edit) I'm using SDFat V 1.0.5 downloaded directly from GitHub (not the SDFat from Particle libraries)
  • (edit) I'm using the primary SPI peripheral (the one that uses pins A3, A4, A5)

The current working theory:

Electron transmission bursts are causing voltage supply dips/transients and/or EMI that are in turn causing the SD card to crash

Currently planned PCB re-design solutions to this problem:

  • Move the location of the SD card holder such that it will be located away from the side of the Electron that the cellular antenna connects to
  • Add a dedicated 3v3 regulator (with EN pin) to power the SD card (can be power cycled programmatically by the Electron in the event of an SD crash)

Questions:

  • Anyone else having issues with SD cards crashing in embedded applications with the Electron?
  • What about in non-Electron embedded applications?
  • Anyone else have any theories on what the root cause of these SD crashes might be?
  • Does anyone know of any way to force an Electron to transmit cellular bursts at full intensity so that I can test the current working theory?

Do you have ongoing SD Card operations interspersed with networking commands? I use SDFat regularly, but I’m generally doing writes of about a block once every 5 seconds, so I just force a sync on every write, so I never have a networking function called when I am un-synced. What SYSTEM_MODE are you using? If you’re single threaded, it’s possible you’re getting a long blocking networking reconnect attempt that’s causing some kind of timeout for the SD library.

Voltage transients could definitely be a cause. How quickly can you detect the crash - would you be able to use some other signal as a trigger for a scope capture? Assuming that you are running off of non-battery power, have you tried adding the LiPo battery and observing if the behavior changes? You could also simply scope with a one-shot trigger at the rated min supply voltage for your SD card and see what you get.

Either way, having the ability to power cycle your SD card sounds like a good idea.

To test higher power usage, you can try wrapping the antenna in aluminum foil and sticking it in a metal water bottle - I normally get -89dB RSSI and that takes me right to the minimum strength ~ -110dB RSSI. I ran a test like that for a few weeks and never got an SD card crash with SanDisk 8GB cards, and I would expect that if the issue was a voltage transient, that it doesn’t matter what else you’re doing read/write wise.

2 Likes

Every time I publish data using Particle.publish(), I also record that published data to the SD card so that if any published data doesn't make it to my web server, I can call a function on the Electron to get it to re-publish that missed publish. Publish.

Any time I write to any of my SD files, I immediately sync them before doing anything else (including working with another SD file). So, I should never be "un-synced".

I'm using SYSTEM_THREAD(ENABLED) and SYSTEM_MODE(SEMI-AUTOMATIC)

Assuming the crash is occurring during a read/write to the SD card, then I can detect it pretty much as soon as the SDFat function that was doing the reading/writing returns. In the case of read/write timeout errors, the SDFat function will not return until a timeout is reached (which I believe is about 300 ms).

If the crash is not happening during an SD card read/write then I guess I have no way of knowing when the crash is occuring.

I have the LiPo battery plugged into all currently deployed Electrons (in Tanzania). Here on the development Electron (in Halifax), I usually have the LiPo battery removed to try and force an SD crash but haven't noticed any difference. Also, on my development Electron I am using a modified USB cable with the V+ wire cut so that I am not supplying any USB power to the Electron while collecting serial debug logs.

I have written a program that I can run on a second Electron sitting in the same enclosure as the deployed Electron that can continually taking analog readings of the deployed Electron's 3v3 bus, and then take a bunch of samples whenever the voltage dips below some pre-set threshold. It then publishes info about the waveform observed (minimum, maximum, average, us below threshold, etc). I plan on implementing this in a deployed hub at some point soon.

Great tip! Thanks.

@jaza_tom, are you using SPI DMA transfers? If so, are you blocking during the transfer or did you specify a callback?

Incidentally @jaza_tom, I never asked which SPI port you are using. Is it the primary SPI with DMA? Have you tried bit-banging using other digital pins to observe any change. SDFat supports this. I doubt very much that locating the card a few cms away will change any EMI effects, and that EMI is causing this anyway. It really has all the characteristics of a power issue.
The wireless nodes in my referenced thread use the FatFs library when hooked up with SD cards and we power everything using an LP2590 3.0V LDO regulator. You can source up to 250mA so your solution of a dedicated supply sounds promising. In fact, you should probably hook the new regulator to the LCD instead and source power for the SD card directly from the Electron.

If you could share the code, or some abridged version of it, we could test and see if network connectivity is in fact a cause or a trigger.

Not that I'm aware of. I'm using SdFat 1.0.5 downloaded straight from GitHub (as opposed to the library available in Particle Libraries)... which seems to support DMA but I don't know how to tell if it is actually using DMA or not.

If I do a global search through all the SDFat files for the transfer() function, here is what I get:

So, based on those function signatures, I would say that if indeed SDFat is using DMA transfers when syncing to the SD card, it is doing so synchronously (since no callbacks are specified). Also, these search results would seem to indicate that the LCD is not using DMA (the ST7565.cpp hit is only transferring 8 bits)

Here are the relevant pieces of code which I'm guessing would determine whether or not SDFat is using DMA:

SdFat sd;            //The instance of the SDFat utility
SdFile file;   //Instance of the SdFile class (from SDFat library)

when initializing SDFat instance:

   if (!sd.begin(chipSelect, my_spi_settings)) {
      //Initialization error
      SD_error_handler(__LINE__);
      //Done
      return;
   }

where

SPISettings my_spi_settings(10 * MHZ, MSBFIRST, SPI_MODE0);

and later on when I'm done altering a file, I use

bool syncSuccess = file.sync();

Yes, it is the primary SPI with DMA (the SPI object in the Particle firmware reference that uses pins A3, A4, A5)

I haven't tried this yet. Do you know how to force SDFat to use software (bit-banged) SPI instead of the hardware peripheral?

I'll keep this in mind

You're talking about the library linked to here, right? Why did you opt to use FatFS instead of SDFat? Does it compile into a smaller program size or have higher reliability or some other benefit over SDFat?

What makes you say that? I thought the whole idea of a dedicated power supply is getting the SD card powered from a different supply than the one powering the Electron so that cellular transmission burst don't affect the output voltage of the power supply that's powering the SD card. Am I missing something?

Thank you for the generous offer! I will work on putting together an abridged version of the code that you could test out for us on your Electrons running in Uganda.

I just tried connecting the antenna and having it in a metal thermos... didn't seem to make any difference to received signal strength (pressing MODE button once resulted in 4 blinks .... i.e 4 bars). Didn't have any aluminum foil handy however.

I normally do not have the antenna connected at all (just a bare uFL connector) in order to try and simulate poor network conditions. Without an antenna, I get 0 blinks (i.e. 0 bars). Yet somehow the Electron is still able to connect.... I think there are some cellular network repeaters literally installed on the roof of our offices...

I haven't been able to cause an SD crash on the development Electron running with no antenna (0 bars), which to me would indicate that power supply transients are unlikely to be the issue (since this Electron would be transmitting at maximum power, right?)

@jaza_tom, it looks like the SD is being run with non-DMA SPI. Why did you chose to not use the SDFat library available on the Web IDE? Using DMA would allow for faster block transfers. Are you using the SPI in a software timer or ISR?

If I can recall correctly, it wasn't actually a decision I ever had to make. When I started using SDFat, I believe it was before you could access the library files via the Web IDE, and was definitely before the support for libraries was integrated into the desktop IDE.

Looking at the web IDE version now, I see that clicking on the GitHub icon brings me to a GitHub project called greiman/SdFat-Particle

I didn't know there was a Particle-specific version of SDFat!

I'm guessing the main difference between

and

is that the former supports DMA SPI features on Particle devices. Is that correct?

I tried looking through:

but failed to find any Particle-specific documentation that differs from the SdFat repository doxygen documentation.

In any case, do you think that using SPI with DMA would help my situation? Or is there any other difference between the two versions of SdFat that could cause problems?

@jaza_tom, I can’t vouch for how you architected your code. However, if you are sending blocks of data to the SD, having DMA will makes things a lot faster. The Web IDE library was tuned for the Particle platform and supports multiple SPI models, not just DMA. It may be worth testing a branch of your code with the IDE library.

However, you did not answer my question re Software Timers or ISRs.

Woops right you are. Sorry :open_mouth:

I am not using Software Timers or attachInterrupt() anywhere in my code, so I assume that means that I am not using the SPI in a software timer or ISR.

I do use the SPI in a system callback though. Basically like this:

System.on(reset_pending + reset, reset_pending_callback);

void reset_pending_callback(system_event_t event, int data){
//If event is a reset pending event
   if(event == reset_pending){
      //Save cellular data consumption info to SD card
      CellularData dataUsage;
      Cellular.getDataUsage(data);
       function_that_writes_celldata_to_SD_card(dataUsage);
      //Authorize reboot
      System.enableReset();
      return;
   }
}

Does that count?

Hi again.

Yes. Look at lines 23-28 in the TryMeFirst.cpp example here. The good thing is you still use the same Sd object. Everything else stays the same apart from the changed pin connections.

Yes. The wireless sensor nodes we are using are based on the ATMEGA256RFR2 and running the Contiki OS. FatFs implementations for ATMEGA devices were readily available. We kind of "dived straight into" writing the driver and it worked almost immediately so we never looked back. FatFS however is generic while SDFat is developed for SD cards. It's the kind of solution you'd want if, say, you needed to use a USB stick.

Yes. Sorry, It was I who was mistaken here. I'd lost that all-important fact in the flurry of the detail and forgotten that this is what we want to do.
This got me curious to look at the design schematics in more detail. We read here that

The output (3.8V net) of the PMIC is fed directly to the u-blox cellular module and a 3.3VDC high efficiency switching regulator (TPS62290). This 3.3VDC regulator helps power the microcontroller, fuel gauge and the buffers.

So the 3.3V line is not directly shared with the modem (the modem needs >3.3V to start anyway according to the Sara U2 series datasheet, page 25). The TPS62290 datasheet doesn't specify any line regulation data, but it is plausible that we have some dips on that bus when the modem powers on.
Do you have a high frequency oscilloscope available? Before redesigning, it would be prudent to have a look at the detailed voltage profile on the 3.3V pin when the modem switches on and starts transmitting.

By the way, do they ever crash indefinitely or it's only once in a while?

I attempted to use SDFat for a project about 9 months ago, and it would not work consistently for me.
The SD cards were constantly becoming corrupted and needed to be reformatted.

I am probably going to keep an eye on this thread as I would like to be able to use SD cards with my particle projects.

My use case is to log 1 second data for less than 20 floating point variables, periodically compress that data and send it up to the cloud.

1 Like

I have done this with a 100 MHz digital storage oscilloscope but did not detect any appreciable dip on the 3V3 bus. I think I'll hook it up again and let it sit running for a while with my scope set to trigger on voltage dips.

Not sure what you mean. In most of our deployments, an SD error will be encountered once every few days. I would say that about 80% of the errors we encounter can be solved by rebooting the Electron. My app automatically reboots when there is an SD error so that usually fixes things immediately. This still leaves me with that other 20% of the time though, in which case the Electron will sit in a custom "safe mode" that I coded, and informs me hourly that the SD card needs to be power cycled in order to recover. We then send out a technician to power cycle the SD card and attempt to recover any corrupt files from a previous backup.

Daily backups is one technique that has been saving our bacon lately, as it seems that even when an SD card gets corrupted, the backup folders are usually still usable.

I think you could probably get your use case to work well with SdFat as long as you design such that you are only using the SD card when the modem is off. Also, you should be able to recover from most SD errors if you design your hardware such that the Electron can programmatically power cycle the SD card if/when it crashes.

Just an update on this issue (thanks to @grantWTI for reaching out to prod me on this).

Our fleet of Electrons seems to be behaving much better in terms of rate of SD card errors and corruption.

Here are the things that we changed that seemed to have really helped us:

  • Realize that corruption of SD card can only happen when writing to the SD card (read operations cannot corrupt the SD card as far as I can tell)
  • Realized that we could back up all the files on the SD card into subfolders on a daily basis, and that these subfolders for some reason seem to be protected entirely from corruption (only files in root folder seem to get corrupted)
    • If an SD error does occur, then you can always just revert to your last saved backup)
  • Visited each of our deployed Electrons and repositioned the cellular antennas such that they were as far as possible from any conductors that connect to the SD card
  • Added a blocking 250 ms cooldown period after any call to Particle.publish() if a write request to the SD card is requested (i.e. file.sync() in SdFat library is called)
    • In other words, right before calling file.sync(), block until it has been at least 250 ms since the last time you called Particle.Publish()
  • Running SPI bus at 10 MHZ seems to work pretty good
  • Went through my code and looked for opportunities to skip writing to the SD card (for example, if you are backing up a RAM value to the SD card once per minute, consider changing it to once per 5 minutes)
  • Refactored my higher level SD card interface code to do lots of error checking. Every time you call an SDFat library function, check the returned value. As soon as SD error is detected, immediately terminate all pending SD operations that your code would otherwise have coming, and then call your SDFat initialization procedure
  • Ensured that we were using authentic Kingston 32GB SD cards (cheaper cards were corrupting way more frequently, some were just unusable)

Here is my high level SD code, it may be useful to you:

I can’t recall, but I think I may have tweaked a few files in SDFat to make it play nicely with the Particle platform.

2 Likes