Bug bounty: Kill the 'Cyan flash of death'

Found a bug that kills the SPI Dirver.

In this Pic the ISR have stopped on the right.

This is the second transaction from the right

This is the last transaction from the right.

To translate from CC3000 to English. Send(sd,buf,1) returned 01 then is returned -10 For the Same one Send!

Test Case:

A) Create 3 sockets the 3rd is the SparkSocket, the first 2 are throwaways
and deleted to get SparSocket # to not be resused to test connect bug.

B) Close operations have 1 byte writes before to reduce close time. see

http://e2e.ti.com/support/low_power_rf/f/851/p/304212/1062330.aspx#1062330

c) TCPClient does HTTP get without read.

When things fail InternetTest is called before the reconect.
On the close InternetTest test socket the writeā€™s status comes back 2
times. The second is without a ballenced SimpleLinkWaitxxxx. This caused
the hci_unsolicited_event_handler to not resume the SPI interrupts.

I fixed this by rewriting the SPI Driver. Pull request forthcoming.

David

4 Likes

Sparks arrived yesterday, also been noticing buggy behavior with wifi. Just ordered a JTAG interface (already got the JTAG shield for the spark) to join in on the fun :slight_smile:

Found an Issue:

If the userā€™s code has a socket they open and close regularly This should have a socket handle value of #1 with the sparkSocket being the socket handle value 0. However, if the sparkSocket is closed due to a error or a timeout in the CC3000 and the main loop calls into the userā€™s loop before it detects the spark socket is closed the users can reopen itā€™s socket and get handle 0. Then the Spark protocol ends up reading the USERS data an will most likely over write memory.

bool SparkProtocol::handle_received_message(void)
{
  last_message_millis = callback_millis();
  expecting_ping_ack = false;
  int len = queue[0] << 8 | queue[1];

  if (0 > blocking_receive(queue, len)) <------- what is length!!!!!
  {
2 Likes

Nice find! I'll pass that along for the firmware guys to checkout

Sorry if this is a silly response, I just saw the "what is length" thing and I just wanted to chime in and say the core->server protocol prefixes each message with a two byte length packet.

I have made more progress. See Below for details

http://nscdg.com/spark/good-bad_every_n_core-firmware.zip

I have added lot's of logging in the debug build.

I have re-written the TI cc3000_driver.
I fixed an dead hang in the write code where the race of a write to the device lost to a read request and was stuck waiting on the WIFI_INT pin.

I have fixed a ti bug in the event handler logic to prevent the situation documented above in "Found a bug that kills the SPI Dirver." post

I have added a watchdog to the TI event loop that will not let it run on longer then the inactivity timeout (+250 ms) and returns errors to the callers.

I do not have a good fix for post "Found an Issue:" other then extending the inactivity timer from 20 seconds to 32 seconds. But I believe this will not fix the issue, just mask it because of the async nature. of B, to D below

Need to discuss with you guy's best way to resolve this.

Anyway Please give these 4 builds a shot and report back....

2 Likes

There needs to be central management of socket descriptors on the STM32. Only call connect() from one place and keep track of the sockets that have been opened. If connect() returns the same descriptor of another socket which is thought to be open, mark it as closed and give the socket descriptor to the current caller. Next time through the spark(or user) loop sees that its socket is closed and makes another call to the connect() wrapper and is issued the next free descriptor.

Great catch on the blocking_read() with undefined length btw.

1 Like

@david_s5
So here is all the firmware tested with their results.

  1. " bad_every_1_core-firmware.bin" seems to restart the Core every ~30 seconds and then after doing that 10 or so times started to broadcast 6 red blinks followed by a series of faster blinks
  2. " bad_every_3_core-firmware.bin" the same behaviour as in test 1 but just took a bit longer for the issue to appear
  3. ā€œgood_20_sec_inactivity_core-firmware.binā€ had 6 red blinks followed by a couple of faster red blinks in less then 30 second of loading the firmware

ā€œgood_32sec_inactivitiy_core-firmware.binā€ seems to be ticking along nicely and no error is reporting. Will run this test for a bit longer to see if any strange comes along.

Yes if you count the blinks after the SOS ā€¦ā€”ā€¦ Blinks and report back the number that is the error code. My guess is it will be 6. See ā€œFound an Issueā€ above

Please give good_32_sec_inactivity_core-firmware.bin and let us know if it works ok.

@david_s5 I updated my answer with the latest result.

I agree that managing the sockets in one place will help. But what that management needs to do on the failure case is not clear.

Thank you!

Did you ever have the Cyan flash of death with the original FW?

yes unfortunately I had it regularly. Later firmwares have made it less frequent though

You should be a good test candidate then!

Was that just with the stock tinker or were you running something else?

You you have an USB to 3.3V converter you can plug onto tx,rx and gnd?

If you do 115200 will be the log. Looking forward to hearing how you make out!

Aloha,

David

@david_s5
Thanks for all the efforts!

I tried out all the binaries. good_32sec_inactivitiy_core-firmware.bin seems to be the most stable but it still ends up in CFOD after a few minutes. Unfortunately, it did not recover.

Update: It does recover most of the time!

Iā€™m not monitoring the serial log yet. Will update soon.

Mohit

Here is a sample log when it repeatedly fails to reconnect:

0000167702:<DEBUG> int Spark_Send(const unsigned char*, int) (348):bytes_sent 18
0000168221:<DEBUG> void loop1() (93):connecting
0000168226:<DEBUG> void loop1() (98):connection failed state 1
0000170751:<DEBUG> void loop1() (93):connecting
0000170756:<DEBUG> void loop1() (98):connection failed state 1
0000173276:<DEBUG> void loop1() (93):connecting
0000173281:<DEBUG> void loop1() (98):connection failed state 1
0000175801:<DEBUG> void loop1() (93):connecting
0000175806:<DEBUG> void loop1() (98):connection failed state 1
0000177811:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000177817:<DEBUG> int Spark_Disconnect() (674):
0000177821:<DEBUG> int Spark_Disconnect() (683):Close
0000181425:<DEBUG> int Spark_Disconnect() (685):Closed retVal=0
0000181431:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000181437:<DEBUG> int Spark_Connect() (665):connet
0000211496:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000211502:<DEBUG> int Internet_Test() (596):socket
0000211507:<DEBUG> int Internet_Test() (598):socketed testSocket=0
0000211512:<DEBUG> int Internet_Test() (619):connect
0000243767:<ERROR> hci_event_handler (248):Timeout now 243767 start 211517 elapsed 32250 cc3000__event_timeout_ms 32250
0000243777:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000243786:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000243792:<DEBUG> int Internet_Test() (628):Close
0000243797:<DEBUG> int Internet_Test() (630):Closed rv=0
0000245924:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000245930:<DEBUG> int Spark_Disconnect() (674):
0000245934:<DEBUG> int Spark_Disconnect() (683):Close
0000245939:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000245945:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000245951:<DEBUG> int Spark_Connect() (665):connet
0000278206:<ERROR> hci_event_handler (248):Timeout now 278206 start 245956 elapsed 32250 cc3000__event_timeout_ms 32250
0000278216:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000278225:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000278231:<DEBUG> int Internet_Test() (596):socket
0000278236:<DEBUG> int Internet_Test() (598):socketed testSocket=1
0000278242:<DEBUG> int Internet_Test() (619):connect
0000281497:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000281503:<DEBUG> int Internet_Test() (628):Close
0000281509:<DEBUG> int Internet_Test() (630):Closed rv=0
0000283635:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000283641:<DEBUG> int Spark_Disconnect() (674):
0000283645:<DEBUG> int Spark_Disconnect() (683):Close
0000283650:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000283656:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000283662:<DEBUG> int Spark_Connect() (665):connet
0000315917:<ERROR> hci_event_handler (248):Timeout now 315917 start 283667 elapsed 32250 cc3000__event_timeout_ms 32250
0000315927:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000315936:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000315942:<DEBUG> int Internet_Test() (596):socket
0000435920:<ERROR> hci_event_handler (248):Timeout now 435920 start 403670 elapsed 32250 cc3000__event_timeout_ms 32250
0000435930:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000435939:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000435945:<DEBUG> int Internet_Test() (596):socket
0000435950:<DEBUG> int Internet_Test() (598):socketed testSocket=1
0000435956:<DEBUG> int Internet_Test() (619):connect
0000441501:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000441507:<DEBUG> int Internet_Test() (628):Close
0000441512:<DEBUG> int Internet_Test() (630):Closed rv=0
0000443639:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000443645:<DEBUG> int Spark_Disconnect() (674):
0000443649:<DEBUG> int Spark_Disconnect() (683):Close
0000443654:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000443660:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000443666:<DEBUG> int Spark_Connect() (665):connet

Here is the serial output from running good_32_sec_inactivity_core-firmware.bin for 1 minute without any problem. How long test do you want to have? It is hard to force a CFOD and it might take hours so thats quite long logs.

http://pastebin.com/EQV0aBvZ

Will do the other and try to capture when the error happens and the SOS is signaling.

Iā€™ve the :spark: Coreā€™s worst test network at the office here. Its a collection of Aerohive APs, that make the Core CFOD consistently within a few minutes. MIMO is CC3000ā€™s worst nightmare.

Here is the logs for ā€œbad_every_1_core-firmware.binā€ which ended with a 6 red blink after the SOS blink.

http://pastebin.com/ve7zyuzw

ā€œgood_20_sec_inactivity_core-firmware.binā€ which also ended with 6 red blinks

http://pastebin.com/N8w30zHC

Just realized my terminal software, CoolTerm, truncates the ouput. Is this enough logs @david_s5 or you want the full ones?

Focus the tests on good_32_sec_inactivity_core-firmware.bin if you can keep and send the whole log with observations of RGB LED that would be great.