Chasing a HardFault... into FreeRTOS list.c?

I've been chasing a HardFault for some time and recently setup hardware on my bench and have caught it twice, which I described below.

My program is currently halted within Particle's prvGetRegistersFromStack(), so I welcome any input/ideas, especially from the FW guru's at Particle. Happy to have discussions here or elsewhere.

TLDR findings:

  • Hit breakpoint on HardFault_Handler(): fault indicated in app_thread but the IMPRECISERR bit of BusFault Status Register is set (thus LR and PC don't indicate exact fault location). Backtrace reported fault down in FreeRTOS code (taken with grain of salt because IMPRECISERR=1).
  • I set DISDEFWBUF bit of the ACTLR reg (disables write buffering, should force most IMPRECISE errors to be PRECISE) and reset the system.
  • 24h later: Hit breakpoint HardFault_Handler(): same region of app_thread still showing fault in FreeRTOS code, but this time with PRECISERR and BFARVALID, so I trust the backtrace more.
  • Last frame of my code called delay(1).
  • Debugger indicates fault occured in uxListRemove( ListItem_t * const pxItemToRemove ) of list.c
  • The call stack for one of my threads is incorrect when halted on the breakpoint, and appears as a duplicate of the app_thread. When I pause program execution manually, it shows up as expected.

Background:

  • Borons, DeviceOS v6.2.1
  • Faults always occur, sometimes after 5 mins, sometimes after 4 days, but this occurs on every Boron that has my recent firmware.
  • I have four user threads:
    • app_thread (process Particle events, measure stuff (I2C ADCs), control stuff (I2C GPIO expanders)
    • PublishQueueAsyncRK (@rickkas7's async pub event queue lib)
    • ThreadedSDInter (uses os_queue to do async SD card writes)
    • appwdt (set with 1024 byte stack, dropped in long ago as a debugging aid along with the HW WDT)

First debugger catch (yesterday)

  • Set a breakpoint on HardFault_Handler, caught it ~19.5h later
  • Backtrace is below
  • Info from the Configurable Fault Status Registers
// Entire CFSR
print/x *(uint32_t *)0xE000ED28
$15 = 0x400

// OK... so flag is in BusFault Status Register
print/x *(uint8_t *)0xE000ED29
$17 = 0x4
  • This is the IMPRECISERR flag of the BFSR, indicating the hardware cannot tell exactly where the fault occurred (likely because it occurred on a store operation which occurs asynchronously from program execution). See the excellent blog post at Memfault on debugging these errors.
  • So the code likely advanced, but not too far before before it was halted, so we're in the neighborhood.
  • Checked stack usage of app_thread with the xRTOS viewer (I assume this can be trusted?), seems fine: 1140 bytes used, 3440 bytes peak
  • In previous efforts I instrumented (and ruled out) heap problems even though I avoid dynamic allocation like the plague. (I'll admit to rare usage of std::string). Devices typically have ~47k free, with largest contiguous block 46.6k after running 17h. Heap problem unlikely.

Backtrace from first catch (some function/filenames changed/redacted):

bt
#0  prvGetRegistersFromStack (pulFaultStackAddress=0x2003eef8) at ../../../hal/./src/nRF52840/core_hal.c:123
#1  <signal handler called>
#2  uxListRemove (pxItemToRemove=pxItemToRemove@entry=0x20027bf0) at ./freertos/FreeRTOS/Source/list.c:183
#3  0x00096b5e in xTaskIncrementTick () at ./freertos/FreeRTOS/Source/tasks.c:2883
#4  0x0007d16c in SysTick_Handler () at ../../../third_party/nrf5_sdk/./nrf5_sdk/external/freertos/portable/CMSIS/nrf52/port_cmsis_systick.c:75
#5  0x00049fde in SysTickChain () at ../../../hal/./src/nRF52840/core_hal.c:219
#6  <signal handler called>
#7  nrf_rtc_event_pending (event=NRF_RTC_EVENT_OVERFLOW, p_reg=0x40024000) at ../../../hal/../third_party/nrf5_sdk/nrf5_sdk/modules/nrfx/hal/nrf_rtc.h:353
#8  (anonymous namespace)::getOverflowCounter () at ../../../hal/./src/nRF52840/timer_hal.cpp:143
#9  (anonymous namespace)::getOverflowCounter () at ../../../hal/./src/nRF52840/timer_hal.cpp:135
#10 0x0005bb6c in (anonymous namespace)::getCurrentTimeWithTicks (baseOffset=<synthetic pointer>, micros=<synthetic pointer>, ticks=<synthetic pointer>) at ../../../hal/./src/nRF52840/timer_hal.cpp:201
#11 hal_timer_micros (reserved=reserved@entry=0x0) at ../../../hal/./src/nRF52840/timer_hal.cpp:350
#12 0x0005bc64 in HAL_Timer_Get_Micro_Seconds () at ../../../hal/./src/nRF52840/timer_hal.cpp:361
#13 0x0005f376 in hal_i2c_request_ex (i2c=<optimized out>, config=0x20016a38, reserved=<optimized out>) at ../../../hal/./src/nRF52840/i2c_hal.cpp:501
#14 0x000c7bd8 in TwoWire::requestFrom (this=<optimized out>, transfer=...) at src/spark_wiring_i2c.cpp:96
#15 0x000c7c00 in TwoWire::requestFrom (this=<optimized out>, address=<optimized out>, quantity=quantity@entry=1, sendStop=sendStop@entry=1 '\001') at ./inc/spark_wiring_i2c.h:63
#16 0x000c7c0c in TwoWire::requestFrom (this=<optimized out>, address=<optimized out>, quantity=quantity@entry=1) at src/spark_wiring_i2c.cpp:91
#17 0x000be7cc in PCA9538::readI2CBuffer (this=this@entry=0x2003a4f0 <io>, dest=dest@entry=0x20016adf "", startRegister=<optimized out>, len=len@entry=1) at //////lib/pca9538/src/PCA9538.cpp:274
#18 0x000be8d2 in PCA9538::readI2CRegister (this=this@entry=0x2003a4f0 <io>, dest=dest@entry=0x20016adf "", registerAddress=registerAddress@entry=PCA9538_REGISTER_OUTPUT_PORT) at //////lib/pca9538/src/PCA9538.cpp:307
#19 0x000be9ee in PCA9538::write (this=this@entry=0x2003a4f0 <io>, pin=pin@entry=3 '\003', value=value@entry=1 '\001') at //////lib/pca9538/src/PCA9538.cpp:120
#20 0x000b8ad2 in bsp::PetWDT () at /home//////src/bsp.cpp:206
#21 bsp::PetHardwareWDT () at ///////src/bsp.cpp:198
#22 0x000c0e30 in std::function<void ()>::operator()() const (this=0x20027e1c) at /home/frost-lab1/.particle/toolchains/gcc-arm/10.2.1/arm-none-eabi/include/c++/10.2.1/bits/std_function.h:622
#23 methaniel::UartDriver::pet_wdt (this=this@entry=0x20027e00) at //////lib/methaniel_comms/src/methaniel_comms.cpp:428
#24 0x000c0ecc in methaniel::UartDriver::receive_response (buffer_size=<optimized out>, timeout_ms=1500, buffer=0x20028328 "\020", this=0x20027e00) at //////lib/methaniel_comms/src/methaniel_comms.cpp:353

I re-set the trap, but this time with store buffering disabled (per Memfault blog) in hopes of getting closer to the bug:

void setup()
// ...
  uint32_t* actlr = (uint32_t*)0xE000E008;
  // Set the DISDEFWBUF bit (bit 1)
  *actlr |= 0x02;
// ...

Second catch (this AM)

print/x *(uint32_t *)0xE000ED28
$1 = 0x8200

print/x *(uint8_t *)0xE000ED29
$3 = 0x82

Nice, this means the

  • PRECISERR bit AND
  • BFARVALID bit
    are set, so we should have caught the error, and the contents of $LR and $PC should more accurately capture the location of the fault:
p/a *(uint32_t[8] *)$psp
$18 = {
0x20027bf8, 
0xcc08080a, 
0x6c2edb4, 
0x1d3d11e3, 
0xfa000000, 
0x96b5f <xTaskIncrementTick+174>, 
0x95bd4 <uxListRemove+6>, 
0x21000000
}

Which matches with the call stack that's indicated in the left hand side of the debugger:

If we trust the output of the debugger and resultant the backtrace, the fault occurred when I called delay(1), in my watchdog service code, which looks like this:

void bsp::PetHardwareWDT() {
  if (enableWdt) {
    bsp::GetIoExpander().write(PIN_GPIO1_EXP, HIGH);
    delay(1);
    bsp::GetIoExpander().write(PIN_GPIO1_EXP, LOW);
  }
}

Here are some screenshots showing execution path after the call to delay(1)






Looking at the local variables at the offending location I see this:

If you're still with me, I will also note that the backtrace for one of my other threads is totally buggered up.


Note that the name of the thread is incorrect, the address of the thread pointer is wrong, and the thread's stack boundaries are nonsense.

Here's how it should appear (paused program execution manually, no fault condition):

Next steps
I am at the boundry of my debugging/GDB/FreeRTOS knowledge, and welcome any/all ideas.

  • Should I believe that the hard fault is in fact occurring down in FreeRTOS as the debugger indicates? This would be surprising to me, as I naturally assume it would be on my side of the fence.
  • What can I examine now, while things are halted, to gain additional information?
  • The fact that one of my other threads doesn't have correct stack info is suspicious, but I lack the intuition if this is a red-hearing.
  • I could disable that thread all together.. but because this bug is infrequent and unpredictable I would want to wait many times (3-10x maybe?) the longest duration uptime I've observed before having confidence the bug resides in that thread.

Given the corrupted data for another thread, and the location where you faulted deep inside FreeRTOS, the most likely reason is that the heap was corrupted. Thus the bug is somewhere else, typically where you have a buffer overrun, underrun, using a freed memory block, or an invalid pointer.

1 Like

Thank you for the sanity check @rickkas7. I agree it seems highly unlikely the fault would occur down in one of the most fundamental (and oft-utilized) region of the kernel, but I lack significant experience debugging multi-threaded RTOS'es, so I didn't know how much to trust when GDB halted there, but in the context of my main task.

I'll dig into the anticipated memory region for the apparently corrupted task (not statically allocated unfortunately), given it looks like the Task Control Block is corrupted. Thinking out-loud: I would posit that the the fault occurred on switching to/from that task, and operating on the corrupted TCB triggered the hard fault.

If the bug is reproducible, for memory corruption issues I'd recommend taking a "divide and conquer" approach. Disable or comment out half the code to see if the problem remains. And narrow down the problem region from there. Good luck!

Indeed, if the bug occurred more frequently, I would have opted to trim the possibility tree by commenting out blocks/modules branch by branch. Unfortunately sometimes it takes 5 days to crop up, and I only have one piece of hardware on the bench to run experiments on, so I can't run 10 different treatments in parallel.

Currently looking at heap allocated objects near the above-mentioned clobbered TCB and looking for corresponding over/under writes.

Hey folks

I would recommend to upgrade to 6.3.3 (available in early release) and give this a check again. It contained a few memory corruption fixes as well as the ability to report back the crash mini stack to the platform (as vitals).

100% agree on the memory corruption issue - I hate to suggest 'increase the stack size' on the threads that look unusual, but its usually my first port of call to eliminate things. I believe 6.3.3 has better heap/stack protection as well (start/end markers) to assert automatically if corruption is detected.

Thanks

Nick

1 Like

Thanks for your thoughts, @mrlambchop.

I tried setting a hw watch point on the memory location that contains the task name that resides in the Thread Control Block (watch *(unsigned char*)0x2002...) , but after 6 days, the device faulted again, and was caught on the HardFault_Handler() breakpoint, not the change in memory contents of the clobbered TCB. Alas.

I'm embracing the try everything approach at this point rather than searching for a smoking gun. I've upgraded my bench DUT to v6.3.3, increased stack size 50% on the suspected thread (still seems unlikely as I see ~25% stack high water mark after many days of uptime), and upgraded the 3rd party lib to the latest release that's used in that thread (SdFat by Bill Greiman).