Tracking down memory stomp

Could use some guidance with how memory is allocated and where to go look for what might be causing a memory stomp/overwrite error – I have a large particle firmware design that’s been out in the field for over a year with multiple users, and a change I introduced recently causes a memory stomp where a variable gets corrupted maybe every 2-3 days of operation. I’m trying to track down what’s causing the memory stomp. It’s not clear to me what variable to go examine to see where the overwrite might be happening because it doesn’t appear that variables defined globally and sequentially are allocated contiguously in memory.

I have a number of variables defined globally at the top of the program. I tried an experiment to see if I write past the end of an array if the variables defined sequentially before or after the array were overwritten, but they didn’t appear to be.

// near top of the program

char tempBuffer1[10];
char tempBuffer2[10];
char tempBuffer3[10];

// ......

void StartupSequence()
{
    // typical startup sequence stuff, works correctly 
  SYSTEM_MODE(SEMI_AUTOMATIC);
..... etc.
}
void setup()
{
  // typical setup sequence including serial port, etc.  works correctly
}

void loop()
{
        sprintf(tempBuffer1, "0123456789");
        sprintf(tempBuffer2,"0123456789");
        sprintf(tempBuffer3,"0123456789");
        
        Serial.printlnf("Running overwrite test");
        
        for (int i = 0; i <15; i++)
        {
            tempBuffer2[i] = i % 256;
        }
        Serial.printlnf("Before temp = %s",tempBuffer1);
        Serial.printlnf("After temp = %s",tempBuffer3);

   //.... loop repeats

}

tempBuffer1 nor tempBuffer3 are affected by writing past the end of tempBuffer2. Their contents remain the same.

Any guidance on how memory is allocated and how to detect that I’ve written past the end of an array?

Thanks…

I’d recommend that you use the safer snprintf() function which will help you avoid the possibility of over-running the end of your char arrays.

there are a number of these safer methods in the C standard string library.

Good idea, in the couple of places where I do this type of operation I’ll make that change. The problem is that in much of the code, I’m individually modifying array entries (this code is fairly complex) and it is an index that has gone incorrect somewhere. I’m hoping to find a way to know / identify which incorrectly indexed array variable writing operation might have gone out of bounds. Knowing the memory map of how variables are allocated might help. It’s more complex than the variable defined before or after evidently… Any ideas how particle is allocating array memory locations?

your issue seems obvious to me:

global arrays of size 10:

char tempBuffer1[10];
char tempBuffer2[10];
char tempBuffer3[10];

here you write 11 bytes to each array:

sprintf(tempBuffer1, “0123456789”);
sprintf(tempBuffer2,“0123456789”);
sprintf(tempBuffer3,“0123456789”);

'cuz sprintf() adds the null terminator and has no problem running off the reservation

you should look at C arrays as merely a pointer…

Sorry-- that was just a simple example to illustrate what I was looking for and you are correct. I just re-ran the example and changed the strings to be “012345678” just so we don’t get caught up here in my example… and it’s the same result where overwriting the tempBuffer2 doesn’t affect either variable defined before or after. The problem is that I need to understand the relationship in variable allocation and when an array index goes out of bounds where that would occur as a debugging aide. Appreciate you trying to help-- obviously I didn’t explain my issue very well.

overwriting the boundary of any array is Always problematic! It doesn’t matter what’s adjacent to your array.

Patient:

Doctor, it hurts when I do this…

Doctor:

Well, don’t do that!

I don’t know what your experience level with these types of problems are, but I’ve found them to rarely be the result of the thing I think is going on. Usually something I’ve changed has twiddled the allocation table, resulting in a much older bug making itself known…

If you can get the JTAG header on one of your units, this would be a really good time to get GDB running…

The reason why you don’t see the overwrite happening directly might well be rooted in several factors.

  1. You are dealing with a 32bit controller and hence variables are by default located at 4byte boundaries
  2. How far do you overwrite?
  3. The optimizer might play some tricks on you. Variables that won’t ever be changed in your code may well be substituted by their literal representation in flash, not ending up in the variable map at all.

When you have two char[10] buffers you actually have a gap of 2 bytes at the end of the first before the second variable starts. So if you only overshoot by one or two bytes, you won’t see the effective violation of the boundary but it still happened.

If you want to get a feeling for the variable allocation you can always print out a memory map.
If you can build with a local toolchain you will get that as by-product. If you are cloud-building you can just have some test project that calls Serial.printlnf("varX: %08x", &varX) for all of the variables you are interested in.

Also if you happen to use pointer operations you might not even need one variable to overshoot into another but a wrong pointer (of any origin) may just happen to point to your “victim variable” by chance.

1 Like

For things that can be separated out I’ve used the technique in the test directory of the JsonParserGeneratorRK library. I have a minimal set of firmware features like String and millis() other things that are commonly used that can be compiled into a native C++ Linux binary.

Then, under Linux, you can run the binary under valgrind, a utility that can detect even a single byte of buffer overwrite and also detect memory leaks.

It’s great for unit tests, but you can’t run your full firmware that way.

3 Likes

Yep-- you are quite astute in your observation. The tricky ones for me have been just as you say. I’m trying to get it to occur locally. It’s at 2 customer sites that are remote now and I’ve been using particle’s excellent infrastructure to post variable values that change out to the cloud and IFTTT to email me when they get out of bounds. I’ll investigate GDB… haven’t taken that plunge yet.

Very helpful. As I’ve added more debugging info to track down the problem, it moved and a different variable got corrupted. I didn’t realize an optimizer was in play and your tip of printing out the memory location should do the trick. I’m also going to do a code read and see if I have any wrong pointers that might be the cause.

1 Like

I didn’t know about your json parser and wish I had it available back last summer. I have a number of places where the current code interacts with a web server and I have json conversion going on at both ends – it looks like you have a nice solution. On your test method-- when you compiled to native c++ Linux, were there many other changes other than calls like millis() that required you to alter the code before testing?

My apologies to anyone who’s trying to read this-- next time I’ll put my replies into one place. I thought they would interleave.