Loop() hangs after several hours. Please give debugging suggestions!

For months now, something in my code has caused loop() to hang a few times a day. I use a watchdog to reboot the system in this event, but I’d love to properly fix the bug. I have plenty of debug Serial prints but don’t see an obvious pattern/culprit. What is the best practice here? Is there a way to use the watchdog to track what function gets caught in a loop or is the last one called, or if there’s a memory issue?

My code is 6000+ lines and employs several libraries, so I don’t expect anyone to examine it, but it’s up on GitHub if you’re curious.

Since trying to debug this, I’ve enabled the system thread but that has no effect besides allowing the device to maintain its connection to the cloud.

I’ve read that Strings can cause memory problems, and my code makes extensive use of them. Is converting as many of these as possible to char arrays the first order of business?

I have noticed that disabling the Webserver library and corresponding code appears to greatly diminish the frequency of the problem (if not remove it entirely), but I need to research this more. I don’t know if this simply frees up memory or actually keeps an offending process from running.

I would say so.
But what exactly does the device do when it’s “blocking”?
Have you been able to locate where loop() hangs?
Have you tried monitoring System.freeMemory()?

I would try something like this.





This way just look for the printf’s when it hangs, you will see which one didn’t print.

1 Like

Thanks guys!

That’s a good question, and exactly what I’m trying to find out. I have tried monitoring System.freeMemory() and it doesn’t appear to change much from its ~50k free state.

Cool, this is exactly what I did after posting this thread, and I managed to observe two hang ups in the same vicinity. They both appeared near SD reads, so I changed the SdFat initialization from SPI_FULL_SPEED to SPI_HALF_SPEED. So far, so good. It’s been running for about 10 hours, which is not unprecedented but is a good sign. I’m using my own PCB and the components are very close to one another, but perhaps things aren’t quite tight enough for FULL speed. I’ll follow up once this is more conclusive.

Since you mentioned SD speed possibly influencing your failure rate; check your PSU - do you have ample power available?