Photon 2 SPI DMA transfer deadlock - Take 2

Following up on my earlier question, which I had not resolved yet, and its haunting me now.

I get these SPI transfer freezes now and then, and since I could not locate it, I radically simplified the code to a debug project: It intializes LVGL in setup(), there’ an animated widget, the main thread only calls lv_timer_handler() to drive the display update.

I send the data using the DMA transfer function. As in the last post, Periodically, it just does not return.

I noticed that this happens during WiFi connect. I tried setting SYSTEM_MODE(MANUAL); - et voila - it does not hang. When going back to AUTOMATIC, it usually hangs. If I have a tight animation loop (and thus basically constantly transfer data over SPI) at the time the WiFi connects, it is almost guaranteed to hang.

I can reproduce with the following snippet:

/*
 * Project myProject
 * Author: Your Name
 * Date:
 * For comprehensive documentation and examples, please visit:
 * https://docs.particle.io/firmware/best-practices/firmware-template/
 */

// Include Particle Device OS APIs
#include "Particle.h"

// Let Device OS manage the connection to the Particle Cloud
SYSTEM_MODE(AUTOMATIC);

// Run the application and system concurrently in separate threads
SYSTEM_THREAD(ENABLED);

// Show system, cloud connectivity, and application logs over USB
// View logs with CLI using 'particle serial monitor --follow'
SerialLogHandler logHandler(LOG_LEVEL_INFO);

char buffer[15360];

void setup()
{
  waitFor(Serial.isConnected, 5000);

  SPI1.begin();
}

int loop_count = 0;
system_tick_t time_next_log = 0;

void loop()
{
  loop_count++;
  auto now = millis();
  if (now > time_next_log)
  {
    Log.info("Main Heartbeat %d loops per second",
             loop_count);
    time_next_log = ((now / 1000) + 1) * 1000;
    loop_count = 0;
  }

  SPI1.beginTransaction(SPISettings(40 * MHZ, MSBFIRST, SPI_MODE0));
  SPI1.transfer(buffer, nullptr, sizeof(buffer), nullptr);
  SPI1.endTransaction();
}

which produces the following logs - as you can see, the main loop just gets stuck.

0000004232 [app] INFO: Main Heartbeat 1 loops per second
0000005001 [app] INFO: Main Heartbeat 150 loops per second
0000006003 [app] INFO: Main Heartbeat 196 loops per second
0000006011 [ncp.rltk.client] INFO: Try to connect to ssid: schneipfer (70:3a:cb:6b:b1:8d)
0000007004 [app] INFO: Main Heartbeat 196 loops per second
0000008001 [app] INFO: Main Heartbeat 195 loops per second
0000009002 [app] INFO: Main Heartbeat 196 loops per second
0000010466 [app] INFO: Main Heartbeat 152 loops per second
0000010598 [ncp.rltk.client] WARN: Disconnect linkError=00000100 reason=001e: assoc stage, assoc reject (assoc rsp status > 0)
0000011001 [app] INFO: Main Heartbeat 102 loops per second
0000011101 [ncp.rltk.client] INFO: Try to connect to ssid: schneipfer (70:3a:cb:6b:b1:8d)
0000012001 [app] INFO: Main Heartbeat 194 loops per second
0000013004 [app] INFO: Main Heartbeat 196 loops per second
0000014002 [app] INFO: Main Heartbeat 195 loops per second
0000015001 [app] INFO: Main Heartbeat 195 loops per second
0000015093 [net.ifapi] INFO: Netif wl3 link UP, profile=schneipfer
0000015103 [system.nm] INFO: State changed: IFACE_UP -> IFACE_LINK_UP
0000015140 [hal] INFO: DNS server list changed
0000015143 [system.nm] INFO: State changed: IFACE_LINK_UP -> IP_CONFIGURED
0000015168 [system] INFO: Cloud: connecting
0000015189 [system] INFO: Read Server Address = type:1,domain:$id.udp.particle.io
0000015223 [system] INFO: Loaded cloud server address and port from session data
0000015258 [system] INFO: Cloud socket=0, connecting to 3.222.143.118#5684 using if 4
0000015297 [system] INFO: Bound cloud socket to lwip if 4 ("wl3")
0000015323 [system] INFO: Cloud socket connected
0000015344 [system] INFO: Starting handshake: presense_announce=0
0000015374 [comm.protocol.handshake] INFO: Establish secure connection
0000015409 [comm.dtls] INFO: session has 0 uses
0000015437 [comm.dtls] INFO: (CMPL,RENEG,NO_SESS,ERR) restoreStatus=0
0000015465 [comm.dtls] INFO: out_ctr 0,1,0,0,0,0,1,61, next_coap_id=133
0000015494 [comm.dtls] INFO: restored session from persisted session data. next_msg_id=307
0000015528 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 2
0000016636 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000016668 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000016695 [comm.protocol.handshake] INFO: Sending HELLO message
0000036207 [net.ifapi] INFO: Netif wl3 link DOWN, profile=NONE
0000036209 [system.nm] INFO: State changed: IP_CONFIGURED -> IFACE_LINK_UP

When setting SYSTEM_MODE(MANUAL); it goes on forever.

I suspect that some DeviceOS internals conflict with a open SPI transaction? I’m using 6.3.3 on a Photon 2.

Thanks,

MikeS

Device OS does not use SPI1 on P2/Photon 2, and even if it did, the beginTransaction/endTransaction pair should prevent simultaneous access to the SPI1 peripheral through the SPI1 transaction mutex.

If you are doing a very large number of SPI transactions I would consider restructuring your code like this, which may or may not help.

Move the entire SPI transaction into a separate worker thread. You can trigger it from some sort of queue or global flag.

Use the SPI1.transfer call with a callback function, but the callback function shouldn't do anything other than release a mutex. The completion callback is an ISR and there are many limitations on what you can do.

After the transfer call, block on the mutex. The reason for this is that it will halt the thread but immediately relinquish the thread time slice, which will be more efficient than the nullptr for callback which busy waits until the transaction is complete. This is especially important if you are doing large SPI transfers.

In any case, I can't be sure this will help at all, but what you're seeing isn't a known problem.

Thanks for the answer - I’ll give it a try. Before, I want to ask a few clarifying questions:

The API use in my example is in line with the documentation / intented usage, right? I do understand that the busy wait underneath is not what I want, but I also would have expected that the DeviceOS uses a mutex internally to block.

In my actual code, I want to tell LVGL with lv_display_flush_ready that the transfer is done, so that would be perfect from within that callback (docs). However, I also would need to set the CS signal high, but the particle documentation explicitly says this is not allowed either

You should not set the CS pin high from the callback ISR. The callback is called when the DMA transaction is completed, which can be before the data is actually transmitted out of the SPI FIFO.

So, the documentation / use of the SPI API seems to have all the proper pieces, but I cannot use them easily:

  • If I use the null callback version (I did that on an extra thread), DeviceOS busy waits and seems to cause problems with the rest of the OS, and eventually hangs. This post from another user seems to hint at the same.
  • If I use a callback, I have no guarantee when the SPI transfer is actually complete - I woud have to delay for some unknown buffer amount to be transmitted before I can set CS high again.

Since I also need to do the endTransaction after the callback fired, but not from within the callback, do I understand correctly that the recommended implementation for the user is to start a thread AND use the callback? The thread's sole purpose would be to trigger the transfer, then call endTransaction when the done signal is received via a mutex from the callback.

I see if it actually fixes the issue, but I’m surprised that the API does not take away the burden from me. Is my understanding above correct?

Thanks,
MikeS

The way you used it should work. I just suggested this to narrow down the problem.

Basically, the purpose is to move the busy wait from inside SPI.transfer to your own code, so you could possible instrument that in the next phase of debugging,

You can’t release the transaction from the callback, but if you just set a flag from the callback and service it from your code, in hopes of figuring out more precisely where it’s getting stuck.

I’ve tried that suggestion, to no avail. This is what I ended doing in a separate thread that is only there to flush display frames: (the rest of the code can be found here)

 spi_interface_.transfer(request.px_map, NULL, len, [] {
    // Signal DMA complete callback and semaphore, rather than NULL callback:
    // The null callback will busy way, burning through precious cycles.
    os_semaphore_give(Display::instance_->dma_complete_semaphore_, false);
  });
  
  os_semaphore_take(dma_complete_semaphore_, CONCURRENT_WAIT_FOREVER, false);

In a desperate attempt to debug, I added counters all over the code, and I can confirm that I always call transfer, but wont get the callback anymore when the WiFi starts connecting.

Is this something the particle team can look into?

Thanks,
MikeS

1 Like

If it hangs, cancelling the transfer with SPI.transferCancel() seems to work.See this commit.

This of course is only a workaround, but I I guess i can re-try the transmit, and I’ll be able to guess the timeout pretty well, based on transfer speed and sice. In that last commit, I added a count on how often I have to cancel, and its shockingly frequent: .2% of all SPI transfers failed.

I would love to see this fixed, burned quite some time on this.

1 Like

I’ve not used the SPI on the Photon 2 but I have seen similar delays in user code when wifi is connecting, even on high priority threads. I’ve not looked into it as much as you have though.