Bug bounty: OTA flash reliability issue

Great, thanks for trying - I’m interested to hear how it works for you!

That’s correct, no interrupt masking. It’s only the code path that actually tries to acquire the bus that’s affected. There are a few other states where the interrupt handler is just changing state variables, and this can proceed unhindered.

The meat of it is here - it tries to acquire the bus, and if successful changes state and starts the spi read as before. If unsuccessful, it enters a new state to flag the request as pending. When the spi bus is released, the handle_spi_request() (just a few lines up) function is called to check and service any outstanding request.

Here’s a external flash test I wrote to check the delay(20); monkey-business. In the past, you needed the delay(20)'s or this would fail quickly.

This hits one sector of the flash chip pretty hard, so you shouldn’t run this forever. Obviously another loop over more sectors would spread the load.

I will try to test this against your patch tonight, but I thought I would put it out here.


#define FLASHADDR 0x81000

uint8_t byteBuffer[256];
char failStr[64];

void setup() {
}

void loop() {
    uint8_t baseValue = 100;
    
    writeSector(baseValue);
    checkSector(baseValue++);
    delay(5000);
}

void writeSector(uint8_t startValue) {
    uint8_t value = startValue;
    for(int i=0;i<256;i++) {
        byteBuffer[i] = value++;
    }
    //delay(20);
    sFLASH_EraseSector(FLASHADDR);
    //delay(20);
    sFLASH_WriteBuffer(byteBuffer, FLASHADDR, 256);   
    //delay(20);
}

void checkSector(uint8_t startValue) {
    uint8_t value = startValue;
    bool didItFail = false;
    memset(byteBuffer,0,256);
    
    //delay(20);
    sFLASH_ReadBuffer(byteBuffer, FLASHADDR, 256);
    //delay(20);
    
    for(int i=0;i<256;i++) {
        if (byteBuffer[i] != value++) {
            sprintf(failStr, "%d:%u:%u", i, byteBuffer[i], value-1);
            Spark.publish("FLASHFAIL", failStr);
            didItFail = true;
            break;
        }
    }
    if (~didItFail) {
        Spark.publish("FLASHFAIL","OK");
    }
    
}
1 Like

@bko, @zachary I just ended with the above step and agree regarding the disable/enable of interrupts within the get|set_socket_active_status() call. It seems odd and it is not part of the original TI code.

This commit was done way back in mid Feb: https://github.com/spark/core-common-lib/commit/ed1689e3af7ba69683949bdfed3bdba3a2d024e1

If I revert back to the original TI code, the spark_protocol’s blocking_send() at the least works.

2 Likes

I raised some eyebrows when I saw that code too. Turning on/off interrupts is heavy-handed here. The get_socket_status and set_socket_status can be done without turning off interrupts by using atomic operations such as compare and swap.

And if the socket_status variable is really shared between threads, then it at least should be declared as volatile, to be sure the value is always fetched, otherwise subsequent fetches can be optimized away by the compiler.

OK @mdma, your mutex code is running in my OTA test fixture. Looking good after a handful of passes.

Will let it run >= 50 passes before reporting back and tinkering further.

I had to add:

#include <stdint.h>

to the top of spi_bus.c to get it to compile. Not sure how your git tree was building…

OK - 49 of 50 iterations passed with the vanilla @mdma code. I’m not too concerned about the single failure at this point.

I am now testing a hybrid which obviates the need for handle_spi_request() and simply masks CC3000 IRQ before acquiring the lock. The IRQ masking code is just a proof of concept hack, because the stm32f10xexti.c functions are obfusticated, over complex and essentially useless.

That is in test now, I’ll report back.

Hybrid code passed 50 of 50 iterations so far. I will leave it running tests overnight.

Proof of concept code is in github changes are limited to core-common-lib. Commit message contains preliminary TODO list before this is anywhere near ready for consideration for prime time.
Apologies in advance if you have to manually merge/hack stuff because the baseline for this branch causes you build headaches.

@mdma, please note I did not pull over some of the additional atomic changes you had made in your repository.

I invite testing, especially with stress tests that simultaneously access the WiFi interface and the external flash. Post results here.

1 Like

@AndyW, I’m surprised you got a test failure - it ran successfully for me with 1000s of iterations. Can you share your test code so I can try?

What’s the benefit of masking the interrupt and removing the call to handle_spi_request()? That’s then preventing irq code executing that can safely execute unhindered since it’s not touching the spi bus. You could disable the irq handler only if the state is IDLE, which is when it uses the bus, but then we are getting into a very tight coupling of the irq behaviour and the arbitration logic. The handle_spi_request() function takes care of this, and doesn’t couple the arbitration logic to the irq behaviour.

That shouldn't be necessary. It's not using anything from stdint.h - just the standard data type int. I just checked out the code and it compiles as is. Which line is failing for you? Could you try again - I just did a push - not sure if there were unpushed changes, but at least now I know we are seeing the same thing. This might also explain the test failure if it wasn't the latest code.

1 Like

It was the int_least8_t definition that was required by stdatomic.h that was tripping this up, for my toolchain (gcc version 4.8.4 20140526 (release) [ARM/embedded-4_8-branch revision 211358]) this required stdint.h to compile.

As I said, I’m not too concerned about the single failure (my testing of the hybrid model also exhibited similar issues at approx 1% rate) - I think that is an artifact of the testbed, and not really related.

My test code is a simple shell script as follows:

    #
    # Tune DEVICE/TOKEN/FIRMWARE with the correct values for your situation.
    #
    TOKEN=xxxxxxxxxx
    DEVICE=yyyyyyyyyy
    FIRMWARE=$HOME/spark/vanilla/core-firmware/build/core-firmware.bin
    
    COUNT=0
    PASS=0
    FAIL=0
    
    while true
    do
      COUNT=$((COUNT+1))
      curl --max-time 5 https://api.spark.io/v1/devices/$DEVICE?access_token=$TOKEN
      curl  --max-time 5 -H "Authorization: Bearer $TOKEN" https://api.spark.io/v1/devices/$DEVICE -X PUT -F file=@$FIRMWARE
    
      if [ $? -ne 0 ]
      then
        FAIL=$((FAIL+1))
      else
        PASS=$((PASS+1))
      fi
    
      echo
      echo $COUNT/$PASS/$FAIL
      sleep 60
    done

WRT IRQ masking - there’s always more than one way to skin a cat. I’m used to masking/unmasking interrupts to avoid collisions, rather than setting flags and calling manually duplicated code from some other place. At this point, however we do it, we’re in sow’s ear/silk purse territory.

1 Like

The flags are just a new state in the state machine which is already part of the cc3000 spi implementation, so to my mind it seems a natural direction to go in. And of course, I can refactor the duplicated code for the production version of this - I kept the duplicate so it was easier to read and understand what’s happening.

Using a mutex will also work in cases where someone wants to pull data from the network on the main thread and write that to flash, while pulling it out of flash on a timer interrupt (e.g. to render as audio) - there’s been talk of this in implementing streamed audio playback/recording. Accessing flash on an interrupt wouldn’t be possible if the spi bus is managed with interrupt masking, since you don’t know in general which interrupts to mask!

The mutex solves this simply, and would allow flash access from both the main thread and on an interrupt. Furthermore, the same general approach with mutexes can be applied to secure any shared resources that we want concurrent access to. (This was my thought when I heard @satishgn was making the code multi-threaded with user and system code running in separate threads - there was no concurrency control in the present system, so this wouldn’t work without taking steps to apply arbitration to all shared resources.)

1 Like

@mdma, I tried your code over here https://github.com/spark/core-common-lib/commits/m-mcgowan/spibus_arbiter with the compiler error fix prescribed by @AndyW but the normal OTA process would end up in Core not running after OTA completion and reset (No RGB display) probably due to firmware corruption.

1 Like

Ok @mdma, successive test is passing after the first 2 failed attempts. Not sure what went wrong initially.

@Dave, can you please use your dual curl commands to test this code https://github.com/spark/core-common-lib/commits/m-mcgowan/spibus_arbiter (git checkout m-mcgowan/spibus_arbiter)?

I’m testing this now, would you rather I test the “bug-fix/ota-flash-reliability” branch?

Did a quick run of 12 tests, first one failed (maybe a fluke), and one didn’t seem to get the updatebegin (didn’t turn purple), but the rest passed. so 10 passes out of 12 tests.

Thanks,
David

Hi,
Any update on this situation? A good general fix in the works?

Reliable OTA flashing is critical for our product, and we are kind of stalled till this gets resolved.

Hey @faraday, what I know is that this will be worked on this week and at least have a deep update .bin which can be patched via DFU over USB.

This will allow users to perform an update while the OTA bug gets fixed

Thanks. If I understand you correctly you mean this will allow updating the firmware if OTA is broken (which it is for me still).

If that’s the case I think (for me at least) it’s easier just to use ST-Link JTAG. Pretty easy and seems to always work.

But before putting much more time into development of our product we need to know that the OTA issues will be fixed, otherwise the whole thing is a no go, as we need OTA.

Just via USB will do :slight_smile: no need JTAG

It will update the CC3000 firmware and improve performance and the patch file can be applied over USB