[Solved]Asm volatile rrx?


I’ve never touched assembly code before today. I’m trying to port some code over from my Arduino which had some assembly code.

I thought I had figured it out but I guess not since I’m getting a ‘hard fault’ a red blink on my Spark Core.

Here is the suspect code:

    #define add_one_pin_to_byte(sendByte, counter, tempPWMValues) \
    { \
        byte pwmval=*tempPWMValues; \
        asm volatile ("cmp %0, %1" : /* No outputs */ : "r" (counter), "r" (pwmval)); \
        asm volatile ("rrx %0, %1" : "=r" (sendByte) : "r" (sendByte)); 	\

This code is building 1 bit at a time for a byte to be sent to a shift register. Here is the Arduino code that works correctly:

#define add_one_pin_to_byte(sendbyte, counter, tempPWMValues) \
{ \
    byte pwmval=*tempPWMValues; \
    asm volatile ("cp %0, %1" : /* No outputs */ : "r" (counter), "r" (pwmval): ); \
    asm volatile ("ror %0" : "+r" (sendbyte) : "r" (sendbyte) : ); 	\

I was following along here: http://www.st.com/web/en/resource/technical/document/programming_manual/CD00228163.pdf to help me with the assembler. Not sure if that’s the right guide to be following.

Basically the goal in pseudo code is:

if(pwmValue > counter)
    right shift sendByte by 1
    put '1' into leftmost bit of sendByte

Anyone have any ideas or guidance?

Have you considered seeing if it works without writing it in assembly?

[Edit] Looking at your code I remembered how much I dislike assembly. Why don’t you walk us through the arduino version and then your spark version and perhaps we can all help together?


@MrRocketman, I agree with @harrisonhjones here.
With some more background info about the intended use, we might be able to give some better advice.

I guess you want to rotate the carry flag, set by the compare operation (for clarity I would use cmps, to indicate you are directly interested in the flags, instead) into bit 31 of sendbyte. I don’t know of a “asm-quick” C way to code this off the top of my head.

One thing I’m not sure about is your use of cp in your second example - I guess it’s a typo.

And since you want to write your result back to its original variable, you could try this

  asm volatile ("rrx %[sb], %[sb]" : [sb] "+r" (sendbyte));
  // or
  asm volatile ("rrx %0, %0" : "=r" (sendbyte) : "0" (sendbyte));

Yes, please try it with C first. The clock rate for the core is much faster than an Arduino, so when your loop() is not distracted by cloud/wifi/internet stuff, the straight-line execution speed of the CPU and the efficiency of the GCC-ARM compiler should make hand crafted assembly like this unneeded except for the most extreme cases.

@MrRocketman, if you tell us which Arduino library you are trying to port, it would be very useful for us to help you :smile:

It seems like what you want is something like this (we know sendByte is type byte or uint8_t right?):

if (pwmValue>counter) {
  sendByte = (sendByte>>1) | (1<<7);

@bko, it’s a pitty that we all have to assume things since we don’t get more info of @MrRocketman :worried:

Your assumtion that we are dealing with uint8_t is logical, given the wording sendbyte and that the original code comes from an 8bit library.
I did miss that with my answer. I was focused too much on the asm rrx which only works as expected when it’s dealing with 32bit (I’m not aware of a byte version like rrxb).
So even in asm you’d have to “rebuild” the functionality of an 8bit rrx and so - I’d guess - you’ll end up with no speed gain by using asm whatsoever.

While your code does what @MrRocketman 's pseudo code asks for, interpreting the (not working) asm code I’d come up with this

  sendbyte >>= 1;          // rrx would always right shift by 1 
  if (pwmCalue > counter)  // if cmp had set the carry flag (counter-pwmval->carry)
    sendbyte |= 0x80;      // it would need to be "inserted" into bit 7

Who knows what’s actually wanted???

Sorry I wasn’t online.

More details: I tried running with C which was an exclellent suggestion! Turns out the hard fault is due to something else that I have yet to track down. So that’s good news sort of.

So this bit of code comes from the ShiftPWM library. I have heavily modified it for my purposes which is for dimming AC lights. I have a zero cross circuit, and opto isolated triacs connected to a shift register. I have everything working great on my little arduino.

Yes the value I was sending in as sendbyte is a byte. I have a byte array called pwmValues. 1 byte per strand of lights I’m controlling. So I have a timer interrupt running at 120Hz * 255 brightness levels. = 30,000 + times a second. Then I have a counter that counts from 255 down to 0 in this interrupt which I pass in as ‘counter’. So the purpose of the asm is to check each strand of lights pwmValue to see if it’s time to turn on. Store the result in sendbyte, and repeat for the other 7 channels of the shift register. Then start over for the next shift register. So with 32 strings of lights (4 shift registers) this code could be running 1 million times a second.

I definitely wasn’t taking into account the 32bit nature of the Spark. That’s got to be the problem I was seeing.

Thank you everyone for your feedback and ideas! I will look into changing things to be 32 bit compatible and report back.

Thanks again! And sorry for being slow

Yes bko,

That’s the code I’m trying to replicate in asm. sendbyte is the byte that gets sent over SPI to the shift register. This code gets called 8 times with the same sendbyte variable. But with each different string of light’s ‘pwmValue’. The goal is to build the byte for each shift register as fast as possible.


Not sure what that means! The interrupt runs at 30KHz or so (about 33us per interrupt). You most likely calculate all channels on each pass of the interrupt. In the ISR you service 32 count values, 32 compare/bit set operations, 4 byte values for the shift registers and I assume you write out those bytes to the SPI on each interrupt (so 4 SPI writes). I am also assuming you are clocking the SPI at DIV1 for max speed (36 MHz if I recall).

So I can see why optimizing the compare/bit set operation is important. It might be interesting to use some tight C++ code to see how the compiler optimizes things. Knowing you have 4 bytes of shift registers, you could do all your shift calculations with a 32-bit var and then write out each byte to the SPI :smile:

I don’t know what optimization level the web gui compiles to but -O4 optimization out of the compiler, with possible re-swizzling of the C code to make sure the optimizer has a clear shot, should be able to achieve almost max speed.

1 Like

@AndyW, I have to say I am impressed by your level of expertise and experience as I have never re-swizzled C code before. I feel dumb all of a sudden :stuck_out_tongue:

1 Like

objdump -S is your friend in this battle.

When you got a local toolchain, this is a nice line to see what the compiler makes from your C code

arm-none-eabi-gcc -c -g -mcpu=cortex-m3 -mthumb -Ofast -Wa,-a,-ad test.c > test.txt

In test.txt you can see the C code along with its asm representation after build.

Alternative optimizer settings would be -O2,-O3 and -O4 however you like it, but it would be good tp know what WebIDE usually does.

@MrRocketman, one thing you have to know with this community - it never sleeps and these guys are rocket fast, so once you ask your question, don’t be surprised if you got your answer before you pressed Create Topic or Reply :wink:

Is there really a -O4 optimization flag? I thought -O3 was the max.

The web ide and local builds both use -Os optimization flag to optimize for size so the binary will actually fit on the core - without that the binary is much much larger.

Fortunately, you can instruct the compiler to optimize individual functions for speed like this:

 // first declare the function - with optimization attribute
 my_super_fast_function() __attribute__((optimize("O4")));

 // then the function body
  my_super_fast_function() {


I too agree with comments that you really rarely need asm - gcc does a fantastic job of optimizing code. And as always try to avoid premature optimization! Profile first, then write asm if it’s truly needed. :smile:

Okay, I think you are trying to port my ShiftPWM library.

That trick in assembly works on AVR, but I do not know enough about the Spark Core’s ALU to give you the spark equivalent. At least you would have to account for the Spark being 32 bits instead of 8.

The original code is here:

The first line of assembly fetches a duty cycle setting from memory (2 clock cycles)
The second line compares the duty cycle with a counter. (1 clock cycle).
The result of this will be stored in the carry on AVR.
With a rotate over carry instruction, the compare result is shifted into a byte. (1 clock cycle).
Do that 8 times, and you’ll have 8 pin states ready in a byte to send out over SPI.

With for loop overhead, this takes about 5 clock cycles per pin to update all pins. With hardware SPI the SPI transfer can overlap with calculating the pin states. This gives very fast software PWM.

ShiftPWM supported hardware SPI and bitbanging ports (software SPI). Paul Stofregen (the man behind Teensy) ported bitbanging to the 32-bit teensy 3, see this pull request:

That might be a better starting point in porting this to Spark.
I have stopped supporting ShiftPWM and selling the boards because I am way to busy with BrewPi.


I have to admit, I only know up to -O3 either, but I took @AndyW 's word for it and added this, too :blush:

But thanks for the clarifucation about '-Os` and the inline attribute :+1:

You guys are awesome!

So I’m thinking I’ll just use normal C. Although the asm does seem to work if I shift the result over 24 bits. The problem now is with my interrupt timer. Forgive me if this is dumb but I’ve been hunting down the issue between this timer/asm code for the past 2 days to no avail.

So first off here is my code:

So I’ve narrowed down the hard fault I’m getting to be whenever I call my *(–tmpPWMValues) within my timer interrupt handler (The one that happens ~ 30kHz). So lines 663 for asm or 677 for the C version. So I messed around and moved the same code up to my hardware zero cross interrupt (line 639) and it works fine (which happens ~120Hz). So to me this means it’s not some volatile issue but something else. I just can’t wrap my head around why the same code works in the hardware interrupt but not the timer interrupt.

I feel like this would be WAY easier if I had a JTAG debugger.

Any ideas?

@MrRocketman, this is not related to your error but make sure you get the latest version of SparkIntervalTimer from the IDE or my repo. I made some minor but important changes to the interrupt priority level used by the timer ISR so it would not interfere with key firmware interrupts. :smile:

Awesome, I pulled your update @peekay123. Thanks for the heads up!