Dynamic linking

dan · March 17, 2017, 7:20pm

I noticed you guys have a dynamic linking library that you use for linking firmware. Do you think we could use that library to lazily link in a library that is fetched from the web? We are trying to bring in and out code as needed to get around memory constraints.

zachary · March 24, 2017, 11:03pm

Good question. My impression is that there’s no easy way for you to use this. If you want to do a lot of work, of course, anything’s possible. @mdma Any ideas here?

mdma · March 25, 2017, 12:00pm

It is possible, but full general dynamic linking of arbitrary modules would take a lot of effort.

There are some things you can do to try to simplify things:

only use dynamic memory. This avoids the module needed to statically allocate RAM which isn’t really possible since how does each module get assigned a region of RAM? Using dynamic memory only side-steps this (e.g. implementing the functionality as a class and then doing new MyClass().
if it is just to save space, you could use (for example) the region reserved for EEPROM emulation, assuming your application doesn’t use it. You can then duplicate our dynamic linking code and save your own module there. Alternatively, the OTA backup region could also be used. Although be cautious in threaded mode, since OTA updates can run while the application is running.

What other options have you explored to reduce the memory footprint? 128k of code is quite a large application! If you have large data resources, it would be simpler to load these dynamically rather than load code.

I hope that plants some seeds of thought

zachary · March 27, 2017, 1:19am

This isn’t what you originally asked for in this thread @dan, but based on our other conversations, I made this PoC that demonstrates that the fundamental thing you need to accomplish is possible. Consider me officially nerd sniped.

const unsigned char d7on[] = {
    0x40, 0x21, 0x09, 0x02, 0x02, 0x20, 0x01, 0x43,
    0x09, 0x04, 0x20, 0x20, 0x00, 0x02, 0x88, 0x61,
    0xf6, 0xe7 };

void setup() {
    pinMode(D7, OUTPUT);
    delay(10000);
    void *p = malloc(18);
    memcpy(p, d7on, 18);
    goto *p;
}

The d7on array holds machine code for turning on D7 in an infinite loop. It turns on the D7 LED by setting PA13 using the GPIOA_BSRR register. That is, it writes the value 0x00020000 to memory address 0x40020018. Then it branches back to the beginning of that array.

I expected, after 10 seconds of breathing cyan, to see either (1) D7 turn on if this works or (2) some kind of fault if the heap wasn’t allowed to execute. It turns on D7. (It also shortly afterward disconnects from the cloud turning the status LED breathing green because of the infinite loop unless you add SYSTEM_THREAD(ENABLED).)

zachary · March 29, 2017, 6:34pm

I haven’t verified this, but I believe if you change the following 2 things you can call into the array, then return and free the memory, instead of going into an infinite loop.

change the final two bytes of the array
- from 0xf6, 0xe7 (corresponding to a branch-and-link 20 bytes before the current program counter, which is 4 bytes ahead of the currently executing instruction)
- to 0x70, 0x47 (corresponding to bx lr, branch exchange with the link register, basically a “return” statement)
instead of goto *p, call it as a subroutine, like p()

dan · March 30, 2017, 5:04pm

You guys are awesome. So glad I was able to nerd-snipe you @zachary, please look out for big trucks.

So, if I’m following along well enough, I think @zachary has proved that the memory is not execute protected (prerequisite!), and @mdma has suggested where we might be able to store the downloaded module.

Say we picked the first option and we implemented our functionality in a class - did you have an idea how we’d interact with that module at runtime? It will come in as (probably) precompiled binary (probably via pubsub) and we can stick it into a memory location, but how do we link into that memory location and use the functionality within at runtime?

I’m wary of using the OTA region, since I know you guys have done a great job making that rock-solid. don’t really want to mess around in there. The EEPROM region is an interesting choice as well, when not being used.

Thanks for the other tips on reducing memory overhead - we’ve started looking into a couple of those (debugging options at compilation, some potential (re)moving of data overhead etc).

ScruffR · March 30, 2017, 6:15pm

The `const``` array lives in flash, so in connection with this

you may have to explicitly check whether this is also possible with a RAM based array or do you intend to push your dynamic firmware into a flash area anyway?

peekay123 · March 30, 2017, 6:37pm

@ScruffR, the STM32 memory map is flat meaning code can be run out of flash or RAM. The big difference is that the flash areas can be locked.

ScruffR · March 30, 2017, 6:50pm

Hmm, I thought I’ve read of a fuse to prevent code from running in RAM for safety reasons.
I’d have to dig up the STM programming guides again, but I can’t help the feeling of having read it.

peekay123 · March 30, 2017, 6:58pm

@ScruffR, I believe that’s correct and needs to be considered. Not sure what Particle does with that fuse.

zachary · March 30, 2017, 10:56pm

I don’t know about such a fuse. In the code example above I’m using malloc to pull from the heap, so I’m pretty sure that’s RAM. I’m open to being proven wrong as always.

zachary · March 30, 2017, 10:59pm

If you use pub-sub, keep in mind that the binary must be encoded as ascii — try base64 or base85 if you want higher compression.

ScruffR · March 31, 2017, 3:56am

@zachary, I couldn't find my original source of "wisdom" , but this is what I dug up instead
http://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf

Especially 2.3 XN-Flag in MPU_RASR register

But if you are not setting any of this, all's fine

Sorry, I missed your malloc() just looked at the opcode array

csaba · May 15, 2017, 8:43pm

We have made some progress on this, but we’ve found that the Photon will go into red SOS flashing whenever we try to actually execute a function loaded into memory via new or malloc, followed by memcpy. We can, however, read & write a variable; it is only executing a function that causes the crash. Also, we were able to reproduce @zachary 's proof of concept; however the changes for adding a “return” statement code also cause red SOS. Does anyone have an idea why executing a function specifically could lead to a crash? Is there still some form of memory execute protection enabled? I’ve read about the memory protection unit (detailed in the doc posted by @ScruffR above), but as far as I can tell it appears to not be explicitly enabled in Particle’s firmware.
We are able to produce the desired functionality as a test on x86 and ARM (raspberry pi): specifically, we can compile a binary, encode it via base64, decode it at runtime and load it into memory via new or malloc, and memcpy. Then we can execute a function via a pointer to the loaded memory. In order to avoid a seg fault, we had to use these specific flags on these systems to disable stack protection: -fno-stack-protector -D_FORTIFY_SOURCE=0 -z execstack

pra · May 16, 2017, 12:12am

Arm Cortex M3 has a MPU (Memory protection unit) which by default will cause a memory protection hard fault if you attempt to execute code in SRAM. These settings can be changed, see ST Micro’s AN4838 AP Note, Managing Memory Protection Unit.

In the past I have successfully dynamically loaded and executed loadable code modules in external static ram (32Meg) on an Embedded Artists LPC 1788 dev board. Here are the code snippets I used.


mpu_setup.h

#define MPU_TYPE    0xe000ed90                  // Type register
#define MPU_CTRL    0xe000ed94
#define MPU_RNR     0xe000ed98
#define MPU_RBAR    0xe000ed9c
#define MPU_RASR    0xe000eda0
#define MPU_RBAR_A1 0xe000eda4
#define MPU_RBAR_A2 0xe000edac
#define MPU_RBAR_A3 0xe000edb4
#define MPU_RASR_A1 0xe000eda8
#define MPU_RASR_A2 0xe000edb0
#define MPU_RASR_A3 0xe000edb8

// Type Register
#define MPU_IREGION 0x00FF0000
#define MPU_IREGION_SFT 16
#define MPU_DREGION 0x0000FF00
#define MPU_DREGION_SFT 8
#define MPU_SEPERATE 1

// Ctrl Register
#define MPU_ENABLE   1
#define MPU_HFNMIENA 2
#define MPU_PRIVENA  4

//Region Number Register

// Region Base Address register
#define MPU_REGION_VALID 0x010

//Region Base attributes and Size Register
#define MPU_SETXN   0x10000000
#define MPU_AP_RW   0x03000000
#define MPU_AP_RO   0x06000000
#define MPU_MEM_TEX 0
#define MPU_MEM_C   0x00020000
#define MPU_MEM_B   0x00010000
#define MPU_MEM_S   0x00040000
#define MPU_32MEG   0x00000032
#define MPU_512K    0x00000026
#define MPU_64K     0x0000001E
#define MPU_32K     0x0000001C

#define R0          0
#define R0_ADDR     0x00000000        // addr 0
#define MPU_RO_SET  (MPU_AP_RW | MPU_MEM_C | MPU_MEM_B | MPU_MEM_S | MPU_512K | MPU_ENABLE)
#define R1          1
#define R1_ADDR     0x10000000
#define MPU_R1_SET  (MPU_AP_RW | MPU_MEM_C | MPU_MEM_B | MPU_MEM_S | MPU_64K | MPU_ENABLE)
#define R2          2
#define R2_ADDR     0x20000000
#define MPU_R2_SET  (MPU_AP_RW | MPU_MEM_C | MPU_MEM_B | MPU_MEM_S | MPU_32K | MPU_ENABLE)
#define R3          3
#define R3_ADDR     0xA0000000
#define MPU_R3_SET  (MPU_AP_RW | MPU_MEM_C | MPU_MEM_B | MPU_MEM_S | MPU_32MEG | MPU_ENABLE)

code

        uint_32t            *mpuReg;
        // setup MPU unit to allow execution from off chip SRAM
        mpuReg = (uint32_t*) MPU_CTRL;
        *mpuReg = MPU_PRIVENA;
        mpuReg = (uint32_t*) MPU_RNR;
        *mpuReg = R0;
        mpuReg = (uint32_t*) MPU_RBAR;
        *mpuReg = R0_ADDR;
        mpuReg = (uint32_t*) MPU_RASR;
        *mpuReg = MPU_RO_SET;
        mpuReg = (uint32_t*) MPU_RNR;
        *mpuReg = R1;
        mpuReg = (uint32_t*) MPU_RBAR;
        *mpuReg = R1_ADDR;
        mpuReg = (uint32_t*) MPU_RASR;
        *mpuReg = MPU_R1_SET;
        mpuReg = (uint32_t*) MPU_RNR;
        *mpuReg = R2;
        mpuReg = (uint32_t*) MPU_RBAR;
        *mpuReg = R2_ADDR;
        mpuReg = (uint32_t*) MPU_RASR;
        *mpuReg = MPU_R2_SET;
        mpuReg = (uint32_t*) MPU_RNR;
        *mpuReg = R3;
        mpuReg = (uint32_t*) MPU_RBAR;
        *mpuReg = R3_ADDR;
        mpuReg = (uint32_t*) MPU_RASR;
        *mpuReg = MPU_R3_SET;
        mpuReg = (uint32_t*) MPU_CTRL;
        *mpuReg = (MPU_PRIVENA | MPU_ENABLE) ;

Basically this is setting up 4 memory regions for the flash, internal sram, iobuffer area and external sram available on that dev board. This will of course differ on Particle. I think the device only has flash and internal sram, so just 2. Just make sure that the sram region had the enable code execution bit set. Also (obviously) all the hex addresses provided above will have to change for the STM. The MPU and registers are there as part of the ARM Cortex spec, but location is up to manufacturer.

Hope this is of help

zachary · May 16, 2017, 8:53am

Fun times! I tried this, saw the same hard fault, and solved it. There are two issues. One is manually handling register state when the assembler isn't helping you. The other is ARM/Thumb-2 interworking.

TL;DR

Always prepend 0xff, 0xb4 to your machine code.
Always append 0xff, 0xbc, 0x70, 0x47 to your machine code.
Call one byte after your pointer.

Explanation

Handling register state

Your code, whatever it is, changes some register values. When you return to the caller, those registers cause a fault because they don't hold their expected values, and they get used in some unintended way.

In assembly this problem is typically solved by pushing registers onto the stack at the start of a function (b4ff), and then popping them off the stack at the end (bcff). The ff in both cases says to push/pop all the registers (r0–r7), just to be safe. There also exist general registers r8–r12, but there's no push instruction for those.

The final 4770 is just the bx lr "return" I suggested on March 29.

(Thumb-2 instructions are 2 bytes wide and little-endian, so logical 4770 is stored in memory as 7047.)

Address-based Interworking

ARM instructions are 4 bytes wide, whereas Thumb instructions are 2 bytes wide. They both are always aligned in memory to start at even addresses.

In general a program can have both ARM and Thumb instructions, and so the processor needs to know which mode it's in. When you call a function, you might be switching between ARM and Thumb mode. Half the types of branch instructions determine what mode to use by looking at the least significant bit of the address. They can do this because the addresses are really always even, so the least significant bit is always zero — thus ARM turned that bit into a mode flag.

Quoting a great article titled Branch and Call Sequences Explained on the ARM Processors blog:

Address-based interworking uses the lowest bit of the address to determine the instruction set at the target. If the lowest bit is 1, the branch will switch to Thumb state. If the lowest bit is 0, the branch will switch to ARM state. Note that the lowest bit is never actually used as part of the address as all instructions are either 4-byte aligned (as in ARM) or 2-byte aligned (as in Thumb).

Particle firmware is built entirely in Thumb mode to save code space, so we always want bit 0 of the branch address to be 1.

You can do the casting dance suggested on stackoverflow to force the compiler to let you turn a data pointer into a function pointer (on other architectures they're not necessarily the same size) and assign it to the right address like this:

void (*f)();           // declare function pointer f
*((void**)&f) = p + 1; // assign f to p+1, simpler syntax won't compile
f();                   // call f

However, I personally find the single line of inline assembly simpler:

asm( "blx %0" : /* no outputs */ : "mr" (p+1) );

Example Code

const unsigned char d7on[] = {
    0xff, 0xb4,
    0x40, 0x21, 0x09, 0x02, 0x02, 0x20, 0x01, 0x43,
    0x09, 0x04, 0x20, 0x20, 0x00, 0x02, 0x88, 0x61,
    0xff, 0xbc,
    0x70, 0x47 };

const size_t CODE_LEN = sizeof(d7on) / sizeof(d7on[0]);

void setup() {
    pinMode(D7, OUTPUT);
}

void callMemFunc() {
    void *p = malloc(CODE_LEN);
    memcpy(p, d7on, CODE_LEN);

    // either cast void* to a function pointer
    //void (*f)();
    //*((void**)&f) = p + 1;
    //f();

    // or just write the single branch instruction
    asm( "blx %0" : /* no outputs */ : "mr" (p+1) );

    // Don't forget to clean up after yourself.
    free(p);
}

void loop() {
    delay(10000);
    callMemFunc();
    Particle.publish("I called a function in memory.");
}

csaba · May 16, 2017, 6:33pm

Awesome, thanks for the very informative reply @zachary - I’m now able to get the function call working from the binary we are loading at runtime. The trick was the function pointer offset by +1, and compiling my standalone binary with the right flags for Thumb mode (-mcpu=cortex-m3 -mthumb).

vchavb · June 30, 2020, 3:31pm

Thanks for the great explanation, however I still dont understand how I can create the binary image correctly.

I made a code based on @zachary ´s example

static int f1(int a){
	return a+3;	
}

int addition(int a,int b){
	return f1(a)+b;	
}

The assembly code looks like this

00000000 <f1>:
   0:   3003            adds    r0, #3
   2:   4770            bx      lr

00000004 <__2b2ae8740__addition>:
   4:   b508            push    {r3, lr}
   6:   f7ff fffb       bl      0 <f1>
   a:   4408            add     r0, r1
   c:   bd08            pop     {r3, pc}

0000000e <addition>:
   e:   e92d 4200       stmdb   sp!, {r9, lr}
  12:   b403            push    {r0, r1}
  14:   f04f 011c       mov.w   r1, #28
  18:   6809            ldr     r1, [r1, #0]
  1a:   4678            mov     r0, pc
  1c:   4788            blx     r1
  1e:   4681            mov     r9, r0
  20:   bc03            pop     {r0, r1}
  22:   f7ff ffef       bl      4 <__2b2ae8740__addition>
  26:   e8bd 8200       ldmia.w sp!, {r9, pc}
  2a:   0000            movs    r0, r0

The binary image with a script I made was

const unsigned char image[]= {
0xff,0xb4,
0x03, 0x30, 0x70, 0x47, 0x08, 0xB5, 0xFF, 0xF7, 0xFB, 0xFF, 0x08, 0x44, 0x08, 0xBD, 0x2D, 0xE9, 0x00, 0x42, 0x03, 0xB4, 0x4F, 0xF0, 0x1C, 0x01, 0x09, 0x68, 0x78, 0x46,
0x88, 0x47, 0x81, 0x46, 0x03, 0xBC, 0xFF, 0xF7, 0xEF, 0xFF, 0xBD, 0xE8, 0x00, 0x82, 0x00, 0x00,
0x0ff,0xbc,
0x70,0x47
}

As recommended I appended 0xff,0xb4 and 0x0ff,0xbc,0x70,0x47

When I try the next code it kind of works…

	void *p = malloc(CODE_LEN);
	memcpy(p, image, CODE_LEN);
	int (*imagePtr)(int,int);
	imagePtr = (int (*)(int,int))p+1;
	result= imagePtr(5,3);

The problem is it gives me as result value 8 for the addition, but f1 should also be called which would give me (5+3)+3 =11.

The problem is that I dont know why the function inside the function is not being passed at all. Also Im not sure if the imagePtr should point to the p+1 or should it point to where the Addition function starts in the binary. I tried moving the pointer where the addition function starts and inserting 0xff,0xb4 before but that threw me a Handfault error.

Further information:
HW: stm32 cortex-m4
Toolchain: atollic-arm
compiler flags for image= -fPIE -msingle-pic-base -mcpu=cortex-m4 -mthumb -fomit-frame-pointer -fno-inline -fno-section-anchors

Topic		Replies	Views
Inefficient linker? Troubleshooting	1	933	September 30, 2014
Holding data in flash Firmware	4	576	January 15, 2019
Link Time Optimization (LTO) Firmware	6	1017	April 4, 2018
Persisting state General	21	3637	June 21, 2014
Modifying USB Firmware Firmware	6	2110	August 29, 2015