Possible issues with large SPI DMA transfers

I have a project which uses the SPI DMA capabilities built into DeviceOS (in this case 3.2.0) running on a Boron. A part of the code calls for a DMA transfer of about 26KB (at 32Mbps) of data, in 16-byte chunks, to the (dedicated) SPI1 port once every minute. The code copies 8 16-bit words to a buffer (doing a byte-swap in the process) then does the SPI DMA transfer with no callback. This is repeated until all 26K of data is transferred.

Generally, the code works well but it seems that after some (unpredictable) time, the DeviceOS cloud connectivity goes sideways with the device no longer publishing though the cloud continues to indicate it is connected. Another “symptom” is that the device no longer responds to USB commands (e.g. particle usb dfu).

I noticed in the DeviceOS SPI code that when no callback is specified, the function will block until the DMA transfer is completed:

void SPIClass::transfer(const void* tx_buffer, void* rx_buffer, size_t length,
                        wiring_spi_dma_transfercomplete_callback_t user_callback)
{
    hal_spi_transfer_dma(_spi, tx_buffer, rx_buffer, length, user_callback);
    if (user_callback == NULL)
    {
        hal_spi_transfer_status_t st;
        do
        {
            hal_spi_transfer_dma_status(_spi, &st);
        } while (st.transfer_ongoing);
    }
}

During a large (segmented in smaller chunks) transfer, can this code affect the Thread Scheduler or other RTOS functionality during the wait that might create the symptoms stated above?

What’s in the code that surrounds the large transfer? Anything that uses a SINGLE_THREADED_BLOCK or disables interrupts will be a red flag. Also is the code surrounding that correctly using a SPI transaction around it?

The wait inside the SPI transfer has interrupts enabled so task swaps should still occur. However, the SPI transaction will hold the SPI mutex for the length of the transfer.

The problem with SINGLE_THREADED_BLOCK is that if you obtain the SPI mutex within the block by using a transaction, and the system thread already has locked SPI, the system will deadlock since the mutex can’t be freed because a task swap cannot occur.

The problem with not having a SPI transaction is that the system can attempt to use SPI while it’s already in use and bad things happen.

Those might not be the problem, but are what came to mind as possibilities.

@rickkas7, the only device on SPI1 is an ST7789 display with no CS (always low). I use the Adafruit_GFX_RK library coupled with an optimized Arduino_ST7789_Fast library that I modified to use SPI DMA transactions.

The code makes use of SPI Transactions and has no SINGLE_THREADED_BLOCK anywhere. The code uses an Adafruit_GFX Canvas object to draw a portion of the scree to RAM and then uses an optimized drawImage(), which uses those bursts of 16byte SPI DMA transfers, to paint the image to the screen. In addition, fonts are primarily rendered using the (optimized) fillRect().

The refresh of screen portions using fonts is done in a round-robin fashion, from loop(), with each portion of the screen being updated every 2 seconds or so, and only if their values changes.

What I find strange is that everything keeps working except that the cloud connection stops and USB stops responding to commands. When I say everything works, I mean:

  • Display updates continue
  • Writes to a microSD on “regular” SPI port continue
  • Reads from an I2C sensor continue
  • Posts to PublishQueueAsyncRK (using retained RAM) continue but are queued instead of published
  • Particle functions and variables stop working, of course.

I’m not even sure how to debug this since it seems to be a DeviceOS “failure”. The device keeps thinking it is still connected (breathing cyan), even after 30 minutes or more.

One oddity, is that I have seen is that even though the publishes (done every minute) are set to WITH_ACK, PublishQueueAsyncRK never seems to indicate a failure in the publish (logging set to ALL) with publish failed, will retry in XX ms. I have also not seen any communications error messages. I will do another run and let it go well past failure to see if these messages come up in the log output.

I’m guessing that the system thread is blocked waiting on a mutex that’s not getting released. That’s why the application thread is still running, but all of the cloud stuff stops. I have no idea which mutex is blocked, but that’s the behavior you’d see.

The device doesn’t indicate offline because the system thread is what updates that state, and if it’s blocked by a mutex, the state won’t update.

@rickkas7, are there any mutexes related to SPI1? I will disable all display updates to see if the problem arises again. Is there a way to poll the state of mutexes in the DeviceOS to see if they are locked or not once the problem occurs?

@rickkas7, is it possible that calling Cellular.RSSI() from a separate thread every 35s might cause an issue?

void rssiFunction(void *param) {
    while(true) {
	    os_mutex_lock(RSSImutex);   // block until mutex unlocked in Setup()

        // Cellular.RSSI() can disrupt cloud connection so wait until
        // cloud is connected before running it
		if (Particle.connected()) {
			CellularSignal sig = Cellular.RSSI();
			strength = (int)(sig.getStrength() *10.0);
		}
		else
            strength = 0;

	    os_mutex_unlock(RSSImutex);   // unblock mutex for next run after delay

        // Delay so thread is run every xx secs
        // Using delay() allows thread to yeild.
        delay(35*1000);
    }
    // You must not return from the thread function
}

Cellular.RSSI() obtains the modem mutex since it needs to access the cellular modem. This shouldn’t cause a problem, and it should be safe to call Cellular.RSSI() from a worker thread. However, since both the system thread, the RSSI thread, and possibly also the application thread will all be attempting to lock the modem, while that’s probably not the exact cause, it still might be involved somehow, but I’m not quite sure how.

@rickkas7, in this case it would be the RSSI work thread, the PublishQueueAsyncRK thread, the system thread and the user thread. I have configured the code with a Particle.function() to disable different parts of the code for further testing. I am also going to check (again) that the ST7789 library doesn’t overwrite some memory boundaries anywhere, though I doubt this is the case since the unit works for up to 48hrs before “failing”.

@rickkas7, I bypass the call to Cellular.RSSI() in the dedicated thread and now I do not get the failure, in this case likely a mutex deadlock. So it seems this is the source of the problem. Is there any mechanism (e.g. mutex) that I can query prior to calling Cellular.RSSI() to prevent this deadlock? Or, is there another method or existing source of data where I can get signal strength?

UPDATE: As the devices are fixed (not mobile) and powered externally, there is less need to read the RSSI often. As such, I am now testing a 15 min interval instead of 30 secs to see if the mutex deadlock condition still occurs.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.