Bug bounty: OTA flash reliability issue

OK I wish you good luck!!

Yeah - it’s not a great situation.

It should have been addressed much earlier, and validated by a regression test, but now it’s just time to roll up the sleeves and get it done.

1 Like

I don’t know if this helps, but I have a Spark on my bench that seems to be constantly falling off the network, and needing to reconnect. The thing is, I have also managed to get the same device to be fairly reliable with staying connected to the network.

In my case, I am sending RS232 data in about every three seconds, as described in another thread. When the RS232 is connected, the Spark is much less reliable at staying connected. Remove it, keeping the same code running on the core, and it is more reliable.

I am wondering if this is an issue with an interrupt being called internally by the Serial hardware, and the Serial device is effectively stealing an interrupt from the CC3000. If this is the case, then there might be other interrupts that are being lost.

In this case, is the :spark: receiving RS232 data (and presumably pushing it out via wifi somewhere) ?

Do you physically disconnect the RS232 source, or just halt it (whilst still physically connected) ?

Does the :spark: generate the same amount of network traffic regardless of RS232 state, or does the CC3000 activity also drop off with the halting of the RS232 traffic ?

I actually remove the RS232 input connection, thanks to the Spark breadboard. The code I am running reads in the RS232 data into a structure, and is interrogated for the structure periodically. In this way, the network traffic is totally identical if RS232 is there or not. Also, to make sure that there is no difference, I ensure that I have enough data sent to fill the structure with real data, which then never changes.

Can you sling a scope on the 3V3 line and see if it behaves differently with/without the RS232 connection ?

The 3.3V line is to all intents and purposes the same regardless of RS232 state, at about 3.3V. There is a few mV of noise on the line, but not much, and it does not vary with RS232

OK, good to hear.

you using the stock serial 1 or 2 class ?

It is the stock Serial2 class. And just to make sure that the serial processing was not taking too much CPU, I put an LED output on the processing, that would be high whenever the loop was polling the Serial Port in the main loop. The code runs for about 25 mSec every three seconds, and 2-3 uSec every 100 mSec.

Ack.

Baud rate ?

/* blah blah discourse fodder */

9600 BPS… So it is not fast. An actual sample of data appears below:

0,58,128,233,0.55,0.46
1,54,77,233,0.33,0.70
2,78,99,231,0.43,0.79
3,37,144,232,0.62,0.26
4,166,245,232,1.06,0.67
5,41,103,233,0.44,0.40
6,54,146,233,0.63,0.37
7,49,94,232,0.41,0.52

1 Like

[Cross posting from Github]

OK, so I noticed that get_socket_active_status() disables and then conditionally re-enables interrupts, which seems odd but it came that way from TI I guess. I was playing around with socket status and discovered that merely reading socket status in the prepare function made it work again.

I can’t explain this, and I don’t feel that this is a fix by any means, but it must be a big clue. Maybe @satishgn or @david_s5 understands this code better.

void Spark_Prepare_For_Firmware_Update(void)
{
  SPARK_FLASH_UPDATE = 1;
  TimingFlashUpdateTimeout = 0;
  //----------------------------------------------------------------
  int stat = get_socket_active_status(sparkSocket);
  //----------------------------------------------------------------
  FLASH_Begin(EXTERNAL_FLASH_OTA_ADDRESS);
}
1 Like

I’ve not been able to reproduce the OTA problems, but I did create a test that reads both from a socket and from flash in a loop. After 9 - 15 iterations, the loop would hang indefinitely when reading from the socket. It’s not clear if this is the same issue as the OTA bug, but at the very least, the test highlights a fairly serious bug that needs a fix.

I have coded a fix - an arbiter for the spi bus: lock the bus on use, unlock when finished. The only tricky part is handling the case when the cc3000 wants to acquire the bus as part of an IRQ but the bus is already in use - external flash is using it on the main thread. In this case, the spi read from the cc3000 is not done there and then (since we can’t use the bus) - instead the cc3000 spi state machine is set to a new state - eSPI_STATE_READ_PREP_IRQ. When the SPI bus is later released, the release code checks for this state and then performs the spi read from the cc3000. So the outstanding request is fetched as soon as the bus becomes free again, minimizing delays.

This fixes the deadlock I was getting in my test case, and the test runs for 1000 iterations successfully (compared to 9-15 iterations without arbitration.)

The spi bus arbiter code is in a branch of core-common-lib. The application is in core-firmware. The mutex is a spin lock, but in practice there is no spinning. It’s a scalable solution - it will work if more devices are added to the internal spi bus in future.

I’d like to check this fixes the OTA bug, but I can’t reproduce it. If someone could try this code and let me know if it helps with the OTA bug, I’d be very grateful!

3 Likes

OK - can people try this patch to core-common-lib/SPARK_Firmware_Driver/src/sst25vf_spi.c

It is a horrendously crude mutex hack that uses EXTI_IMR to mask the CC3000 IRQ line whenever the flash driver enters a critical section.

I think the test regime that @Dave suggested overwrites my test firmware, rendering the test useless. I’m not even sure this patch catches all the cases, but it works 100% of the time after a dfu load in my testing.

Without details, there’s not much I can do with that apart from file it under anecdote, sorry.

Pattern sensitive failures are real, and I’ve encountered plenty of them; but unless someone is doing compression that I’m not aware of, I don’t think the datapath from cloud to external flash seems particularly likely to be pattern sensitive.

I’m going to continue to concentrate on the mutex theory first.

1 Like

I know I’m not going to be able to test that until i get to the bottom of the mutex issue, sorry.

To the best of my knowledge, this is not correct. The CC3000 can assert the IRQ line to indicate that it has a response and that the SPI master should later assert the CC3000’s chip select. There are so-called unsolocited responses where the CC3000 wants to send a response without there being any previous request, i.e. no previous related assert of the CS line.

EDIT: ah, just saw your edit at the bottom!

Yes - took me a while to get back into the groove with the CC3000 and it’s SPI protocol.

Sadly, the CC3000 doesn’t use the interrupt for anything very useful at the socket level, it is purely low-level housekeeping. I had to remind myself of that.

In other news, I think my crude mutex patch might narrow the window where the CC3000 IRQ can cause problems, but it does not come close to eliminating it.

1 Like

I took the approach of forcing mutual exclusion on the SPI bus using a mutex. I think it’s the right solution, and seems very solid given the tests I did with overlapping flash / wifi use. It should completely eliminate contention on the spi bus.

Yeah - will try it tonight.

Am I right in thinking you defer IRQ handling when the lock is already taken until the lock is released ?
And you do that by code, not by masking/unmasking of interrupts ?