Boron 404X using SPI Slave Mode - Unexpected Rebooting

Afternoon All,

We're running an IoT solution that uses a Boron 404X as a general supervisor, and it brings a Raspberry Pi 4 online to do some special purpose tasks throughout the day.

The architecture we use has the Boron configured as an SPI Slave, with the Raspberry Pi having access to a selection of functions over that interface.

It's been working fine for a couple of years now, but I've just discovered a really serious issue. When I power up the Pi, it reaches a stage in the process where there's a brief strobe on the Boron's CS line, along with a falling edge on the SCK line. As soon as this glitch happens, the Boron freezes absolutely solid. The status light that was breathing cyan even locks at whatever level it was at when the glitch occurred. The Boron stays in this jammed state until my application launches on the Pi, and it manipulates the CS and SCK lines to make a normal transaction.

The issue that I've found today is that after some amount of time in the frozen state (doesn't seem to be deterministic/repeatable, but seconds 5-15 seconds), the boron will reset with a white flash on the status LED. No Red flashes at all, nothing printed on the USB serial, just a straight reset. I don't even get a reset reason publish when it reconnects to the cloud, just the spark/status offline and spark/status online publishes.

Previously the Pi would be up and running, and would presumably make an SPI hit fast enough that this Boron reset would never occur, but it's happening more and more often on our systems, possibly correlated with slightly slower Pi boots as I bring more features online.

This one's got me pulling my hair out at the moment, as the hardware is all out in the field. It's been running fine for the last couple of years, with only the occasional reset, but now the systems are dropping data 3-4 times a week on this issue.

Any thoughts on:

  1. What the cause may be on the particle side?
  2. What I can do to remediate it remotely?

Details:

  • Boron 404X running Device OS 5.3.0
  • Boron WDT is enabled, on a 2-minute timeout, and getting poked regularly.
  • Power rails are all 100% stable (Boron's backed by on-board LiPo, and 5V is derived from 130 Ah of LiFe cells)

P.S. I am also looking into whether I can make the Pi not generate these glitches on the SPI interface, but it looks like it would require major OS-level work to achieve anything, which I can't really do on fielded devices.

Do you have logs from the Boron when this happens?
What is the duration of the CS/SCK glitches?
How long are your SPI traces?

5.3.1 has a WDT fix that might be useful: nRF52: watchdog timeout is not accurate by XuGuohui · Pull Request #2635 · particle-iot/device-os · GitHub (but won't address the issue)

Hi Chris:

Thanks for the follow-up questions.

I don't have any logs at all from the boron that indicate the failure. I've added a regular heartbeat Log.info() message in loop() that goes off every 3 seconds and those stop coming while the boron's frozen. It also stops responding to pings from the cloud-side. There's no logging that comes out of my SPI wrapper class unfortunately. Since most of that is at ISR level, then I can't use Log.info() or anything similar anywhere in there or I get red-flash reboots. There's a service method that gets called from loop(), but that freezes up just like the heartbeats.

On the occasions where it doesn't reset during this little hang-up, it basically behaves as if it had been spinning in an ISR or something. Based on the behaviours I think the clocks are still all running while its 'frozen', but I'm not running a JTAG debugger so I can't be sure.

The CS glitch is about 1.6 mSec long. The SCK line goes low at the same time as CS, but stays low until the Pi initiates its first SPI transaction. This can be up to 20 seconds depending on how long the Pi takes to boot and do some initial housekeeping.

The SPI traces are ~50mm long, and I'm only clocking at 250 kHz when it's running, since I don't need it to be fast. I've scoped both ends and I'm not seeing any skew or noise going on with them. I'm not running them 5m parallel to a 20A motor power feed or anything silly like that. :sweat_smile:

-- Dave W