Bug bounty: OTA flash reliability issue

pra · June 30, 2014, 9:55pm

OK I wish you good luck!!

AndyW · June 30, 2014, 9:59pm

Yeah - it’s not a great situation.

It should have been addressed much earlier, and validated by a regression test, but now it’s just time to roll up the sleeves and get it done.

vk2tds · June 30, 2014, 10:00pm

I don’t know if this helps, but I have a Spark on my bench that seems to be constantly falling off the network, and needing to reconnect. The thing is, I have also managed to get the same device to be fairly reliable with staying connected to the network.

In my case, I am sending RS232 data in about every three seconds, as described in another thread. When the RS232 is connected, the Spark is much less reliable at staying connected. Remove it, keeping the same code running on the core, and it is more reliable.

I am wondering if this is an issue with an interrupt being called internally by the Serial hardware, and the Serial device is effectively stealing an interrupt from the CC3000. If this is the case, then there might be other interrupts that are being lost.

AndyW · June 30, 2014, 10:08pm

In this case, is the receiving RS232 data (and presumably pushing it out via wifi somewhere) ?

Do you physically disconnect the RS232 source, or just halt it (whilst still physically connected) ?

Does the generate the same amount of network traffic regardless of RS232 state, or does the CC3000 activity also drop off with the halting of the RS232 traffic ?

vk2tds · June 30, 2014, 10:16pm

I actually remove the RS232 input connection, thanks to the Spark breadboard. The code I am running reads in the RS232 data into a structure, and is interrogated for the structure periodically. In this way, the network traffic is totally identical if RS232 is there or not. Also, to make sure that there is no difference, I ensure that I have enough data sent to fill the structure with real data, which then never changes.

AndyW · June 30, 2014, 10:26pm

Can you sling a scope on the 3V3 line and see if it behaves differently with/without the RS232 connection ?

vk2tds · June 30, 2014, 10:34pm

The 3.3V line is to all intents and purposes the same regardless of RS232 state, at about 3.3V. There is a few mV of noise on the line, but not much, and it does not vary with RS232

AndyW · June 30, 2014, 10:41pm

OK, good to hear.

you using the stock serial 1 or 2 class ?

vk2tds · June 30, 2014, 10:49pm

It is the stock Serial2 class. And just to make sure that the serial processing was not taking too much CPU, I put an LED output on the processing, that would be high whenever the loop was polling the Serial Port in the main loop. The code runs for about 25 mSec every three seconds, and 2-3 uSec every 100 mSec.

AndyW · June 30, 2014, 10:51pm

Ack.

Baud rate ?

/* blah blah discourse fodder */

vk2tds · June 30, 2014, 10:54pm

9600 BPS... So it is not fast. An actual sample of data appears below:

0,58,128,233,0.55,0.46
1,54,77,233,0.33,0.70
2,78,99,231,0.43,0.79
3,37,144,232,0.62,0.26
4,166,245,232,1.06,0.67
5,41,103,233,0.44,0.40
6,54,146,233,0.63,0.37
7,49,94,232,0.41,0.52

bko · July 1, 2014, 2:37am

[Cross posting from Github]

OK, so I noticed that get_socket_active_status() disables and then conditionally re-enables interrupts, which seems odd but it came that way from TI I guess. I was playing around with socket status and discovered that merely reading socket status in the prepare function made it work again.

I can’t explain this, and I don’t feel that this is a fix by any means, but it must be a big clue. Maybe @satishgn or @david_s5 understands this code better.

void Spark_Prepare_For_Firmware_Update(void)
{
  SPARK_FLASH_UPDATE = 1;
  TimingFlashUpdateTimeout = 0;
  //----------------------------------------------------------------
  int stat = get_socket_active_status(sparkSocket);
  //----------------------------------------------------------------
  FLASH_Begin(EXTERNAL_FLASH_OTA_ADDRESS);
}

mdma · July 1, 2014, 3:22am

I’ve not been able to reproduce the OTA problems, but I did create a test that reads both from a socket and from flash in a loop. After 9 - 15 iterations, the loop would hang indefinitely when reading from the socket. It’s not clear if this is the same issue as the OTA bug, but at the very least, the test highlights a fairly serious bug that needs a fix.

I have coded a fix - an arbiter for the spi bus: lock the bus on use, unlock when finished. The only tricky part is handling the case when the cc3000 wants to acquire the bus as part of an IRQ but the bus is already in use - external flash is using it on the main thread. In this case, the spi read from the cc3000 is not done there and then (since we can’t use the bus) - instead the cc3000 spi state machine is set to a new state - eSPI_STATE_READ_PREP_IRQ. When the SPI bus is later released, the release code checks for this state and then performs the spi read from the cc3000. So the outstanding request is fetched as soon as the bus becomes free again, minimizing delays.

This fixes the deadlock I was getting in my test case, and the test runs for 1000 iterations successfully (compared to 9-15 iterations without arbitration.)

The spi bus arbiter code is in a branch of core-common-lib. The application is in core-firmware. The mutex is a spin lock, but in practice there is no spinning. It’s a scalable solution - it will work if more devices are added to the internal spi bus in future.

I’d like to check this fixes the OTA bug, but I can’t reproduce it. If someone could try this code and let me know if it helps with the OTA bug, I’d be very grateful!

AndyW · July 1, 2014, 5:34am

OK - can people try this patch to core-common-lib/SPARK_Firmware_Driver/src/sst25vf_spi.c

It is a horrendously crude mutex hack that uses EXTI_IMR to mask the CC3000 IRQ line whenever the flash driver enters a critical section.

I think the test regime that @Dave suggested overwrites my test firmware, rendering the test useless. I’m not even sure this patch catches all the cases, but it works 100% of the time after a dfu load in my testing.

AndyW · July 1, 2014, 5:44am

Without details, there’s not much I can do with that apart from file it under anecdote, sorry.

Pattern sensitive failures are real, and I’ve encountered plenty of them; but unless someone is doing compression that I’m not aware of, I don’t think the datapath from cloud to external flash seems particularly likely to be pattern sensitive.

I’m going to continue to concentrate on the mutex theory first.

AndyW · July 1, 2014, 5:51am

I know I’m not going to be able to test that until i get to the bottom of the mutex issue, sorry.

mdma · July 1, 2014, 10:51am

To the best of my knowledge, this is not correct. The CC3000 can assert the IRQ line to indicate that it has a response and that the SPI master should later assert the CC3000's chip select. There are so-called unsolocited responses where the CC3000 wants to send a response without there being any previous request, i.e. no previous related assert of the CS line.

EDIT: ah, just saw your edit at the bottom!

AndyW · July 1, 2014, 2:27pm

Yes - took me a while to get back into the groove with the CC3000 and it’s SPI protocol.

Sadly, the CC3000 doesn’t use the interrupt for anything very useful at the socket level, it is purely low-level housekeeping. I had to remind myself of that.

In other news, I think my crude mutex patch might narrow the window where the CC3000 IRQ can cause problems, but it does not come close to eliminating it.

mdma · July 1, 2014, 2:54pm

I took the approach of forcing mutual exclusion on the SPI bus using a mutex. I think it’s the right solution, and seems very solid given the tests I did with overlapping flash / wifi use. It should completely eliminate contention on the spi bus.

AndyW · July 1, 2014, 3:38pm

Yeah - will try it tonight.

Am I right in thinking you defer IRQ handling when the lock is already taken until the lock is released ?
And you do that by code, not by masking/unmasking of interrupts ?

Topic		Replies	Views
Breathing Cyan, Wont Flash Troubleshooting	6	1403	January 12, 2015
Flash a Fac Default core, no LED unresponsive Troubleshooting	2	1290	September 6, 2014
Spark Cuts Out When Attempting to Flash New Apps Troubleshooting	8	1557	July 30, 2014
OTA Flashing Locked Out with heavy TCPClient usage Troubleshooting	15	1648	July 22, 2014
Non-Cyan-Flash Offline Core Firmware	33	4933	July 23, 2014

Bug bounty: OTA flash reliability issue

Related topics