Bug bounty: Kill the 'Cyan flash of death'

david_s5 · February 1, 2014, 9:52pm

Found a bug that kills the SPI Dirver.

In this Pic the ISR have stopped on the right.

This is the second transaction from the right

This is the last transaction from the right.

To translate from CC3000 to English. Send(sd,buf,1) returned 01 then is returned -10 For the Same one Send!

Test Case:

A) Create 3 sockets the 3rd is the SparkSocket, the first 2 are throwaways
and deleted to get SparSocket # to not be resused to test connect bug.

B) Close operations have 1 byte writes before to reduce close time. see

http://e2e.ti.com/support/low_power_rf/f/851/p/304212/1062330.aspx#1062330

c) TCPClient does HTTP get without read.

When things fail InternetTest is called before the reconect.
On the close InternetTest test socket the write’s status comes back 2
times. The second is without a ballenced SimpleLinkWaitxxxx. This caused
the hci_unsolicited_event_handler to not resume the SPI interrupts.

I fixed this by rewriting the SPI Driver. Pull request forthcoming.

David

wtfuzz · February 1, 2014, 10:30pm

Sparks arrived yesterday, also been noticing buggy behavior with wifi. Just ordered a JTAG interface (already got the JTAG shield for the spark) to join in on the fun

david_s5 · February 3, 2014, 8:01pm

Found an Issue:

If the user’s code has a socket they open and close regularly This should have a socket handle value of #1 with the sparkSocket being the socket handle value 0. However, if the sparkSocket is closed due to a error or a timeout in the CC3000 and the main loop calls into the user’s loop before it detects the spark socket is closed the users can reopen it’s socket and get handle 0. Then the Spark protocol ends up reading the USERS data an will most likely over write memory.

bool SparkProtocol::handle_received_message(void)
{
  last_message_millis = callback_millis();
  expecting_ping_ack = false;
  int len = queue[0] << 8 | queue[1];

  if (0 > blocking_receive(queue, len)) <------- what is length!!!!!
  {

Dave · February 3, 2014, 9:03pm

Nice find! I'll pass that along for the firmware guys to checkout

Sorry if this is a silly response, I just saw the "what is length" thing and I just wanted to chime in and say the core->server protocol prefixes each message with a two byte length packet.

david_s5 · February 3, 2014, 9:19pm

I have made more progress. See Below for details

http://nscdg.com/spark/good-bad_every_n_core-firmware.zip

I have added lot's of logging in the debug build.

I have re-written the TI cc3000_driver.
I fixed an dead hang in the write code where the race of a write to the device lost to a read request and was stuck waiting on the WIFI_INT pin.

I have fixed a ti bug in the event handler logic to prevent the situation documented above in "Found a bug that kills the SPI Dirver." post

I have added a watchdog to the TI event loop that will not let it run on longer then the inactivity timeout (+250 ms) and returns errors to the callers.

I do not have a good fix for post "Found an Issue:" other then extending the inactivity timer from 20 seconds to 32 seconds. But I believe this will not fix the issue, just mask it because of the async nature. of B, to D below

Need to discuss with you guy's best way to resolve this.

Anyway Please give these 4 builds a shot and report back....

mattande · February 3, 2014, 9:41pm

There needs to be central management of socket descriptors on the STM32. Only call connect() from one place and keep track of the sockets that have been opened. If connect() returns the same descriptor of another socket which is thought to be open, mark it as closed and give the socket descriptor to the current caller. Next time through the spark(or user) loop sees that its socket is closed and makes another call to the connect() wrapper and is issued the next free descriptor.

Great catch on the blocking_read() with undefined length btw.

sjunnesson · February 3, 2014, 10:04pm

@david_s5
So here is all the firmware tested with their results.

" bad_every_1_core-firmware.bin" seems to restart the Core every ~30 seconds and then after doing that 10 or so times started to broadcast 6 red blinks followed by a series of faster blinks
" bad_every_3_core-firmware.bin" the same behaviour as in test 1 but just took a bit longer for the issue to appear
“good_20_sec_inactivity_core-firmware.bin” had 6 red blinks followed by a couple of faster red blinks in less then 30 second of loading the firmware

“good_32sec_inactivitiy_core-firmware.bin” seems to be ticking along nicely and no error is reporting. Will run this test for a bit longer to see if any strange comes along.

david_s5 · February 3, 2014, 10:48pm

Yes if you count the blinks after the SOS …—… Blinks and report back the number that is the error code. My guess is it will be 6. See “Found an Issue” above

Please give good_32_sec_inactivity_core-firmware.bin and let us know if it works ok.

sjunnesson · February 3, 2014, 10:54pm

@david_s5 I updated my answer with the latest result.

david_s5 · February 3, 2014, 11:01pm

I agree that managing the sockets in one place will help. But what that management needs to do on the failure case is not clear.

david_s5 · February 3, 2014, 11:03pm

Thank you!

Did you ever have the Cyan flash of death with the original FW?

sjunnesson · February 3, 2014, 11:30pm

yes unfortunately I had it regularly. Later firmwares have made it less frequent though

david_s5 · February 3, 2014, 11:48pm

You should be a good test candidate then!

Was that just with the stock tinker or were you running something else?

You you have an USB to 3.3V converter you can plug onto tx,rx and gnd?

If you do 115200 will be the log. Looking forward to hearing how you make out!

Aloha,

David

mohit · February 3, 2014, 11:56pm

@david_s5
Thanks for all the efforts!

I tried out all the binaries. good_32sec_inactivitiy_core-firmware.bin seems to be the most stable but it still ends up in CFOD after a few minutes. Unfortunately, it did not recover.

Update: It does recover most of the time!

I’m not monitoring the serial log yet. Will update soon.

Mohit

mohit · February 4, 2014, 12:18am

Here is a sample log when it repeatedly fails to reconnect:

0000167702:<DEBUG> int Spark_Send(const unsigned char*, int) (348):bytes_sent 18
0000168221:<DEBUG> void loop1() (93):connecting
0000168226:<DEBUG> void loop1() (98):connection failed state 1
0000170751:<DEBUG> void loop1() (93):connecting
0000170756:<DEBUG> void loop1() (98):connection failed state 1
0000173276:<DEBUG> void loop1() (93):connecting
0000173281:<DEBUG> void loop1() (98):connection failed state 1
0000175801:<DEBUG> void loop1() (93):connecting
0000175806:<DEBUG> void loop1() (98):connection failed state 1
0000177811:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000177817:<DEBUG> int Spark_Disconnect() (674):
0000177821:<DEBUG> int Spark_Disconnect() (683):Close
0000181425:<DEBUG> int Spark_Disconnect() (685):Closed retVal=0
0000181431:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000181437:<DEBUG> int Spark_Connect() (665):connet
0000211496:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000211502:<DEBUG> int Internet_Test() (596):socket
0000211507:<DEBUG> int Internet_Test() (598):socketed testSocket=0
0000211512:<DEBUG> int Internet_Test() (619):connect
0000243767:<ERROR> hci_event_handler (248):Timeout now 243767 start 211517 elapsed 32250 cc3000__event_timeout_ms 32250
0000243777:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000243786:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000243792:<DEBUG> int Internet_Test() (628):Close
0000243797:<DEBUG> int Internet_Test() (630):Closed rv=0
0000245924:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000245930:<DEBUG> int Spark_Disconnect() (674):
0000245934:<DEBUG> int Spark_Disconnect() (683):Close
0000245939:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000245945:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000245951:<DEBUG> int Spark_Connect() (665):connet
0000278206:<ERROR> hci_event_handler (248):Timeout now 278206 start 245956 elapsed 32250 cc3000__event_timeout_ms 32250
0000278216:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000278225:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000278231:<DEBUG> int Internet_Test() (596):socket
0000278236:<DEBUG> int Internet_Test() (598):socketed testSocket=1
0000278242:<DEBUG> int Internet_Test() (619):connect
0000281497:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000281503:<DEBUG> int Internet_Test() (628):Close
0000281509:<DEBUG> int Internet_Test() (630):Closed rv=0
0000283635:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000283641:<DEBUG> int Spark_Disconnect() (674):
0000283645:<DEBUG> int Spark_Disconnect() (683):Close
0000283650:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000283656:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000283662:<DEBUG> int Spark_Connect() (665):connet
0000315917:<ERROR> hci_event_handler (248):Timeout now 315917 start 283667 elapsed 32250 cc3000__event_timeout_ms 32250
0000315927:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000315936:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000315942:<DEBUG> int Internet_Test() (596):socket
0000435920:<ERROR> hci_event_handler (248):Timeout now 435920 start 403670 elapsed 32250 cc3000__event_timeout_ms 32250
0000435930:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000435939:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000435945:<DEBUG> int Internet_Test() (596):socket
0000435950:<DEBUG> int Internet_Test() (598):socketed testSocket=1
0000435956:<DEBUG> int Internet_Test() (619):connect
0000441501:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000441507:<DEBUG> int Internet_Test() (628):Close
0000441512:<DEBUG> int Internet_Test() (630):Closed rv=0
0000443639:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000443645:<DEBUG> int Spark_Disconnect() (674):
0000443649:<DEBUG> int Spark_Disconnect() (683):Close
0000443654:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000443660:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000443666:<DEBUG> int Spark_Connect() (665):connet

sjunnesson · February 4, 2014, 12:19am

Here is the serial output from running good_32_sec_inactivity_core-firmware.bin for 1 minute without any problem. How long test do you want to have? It is hard to force a CFOD and it might take hours so thats quite long logs.

http://pastebin.com/EQV0aBvZ

Will do the other and try to capture when the error happens and the SOS is signaling.

mohit · February 4, 2014, 12:23am

I’ve the Core’s worst test network at the office here. Its a collection of Aerohive APs, that make the Core CFOD consistently within a few minutes. MIMO is CC3000’s worst nightmare.

sjunnesson · February 4, 2014, 12:24am

Here is the logs for “bad_every_1_core-firmware.bin” which ended with a 6 red blink after the SOS blink.

http://pastebin.com/ve7zyuzw

sjunnesson · February 4, 2014, 12:30am

“good_20_sec_inactivity_core-firmware.bin” which also ended with 6 red blinks

http://pastebin.com/N8w30zHC

Just realized my terminal software, CoolTerm, truncates the ouput. Is this enough logs @david_s5 or you want the full ones?

david_s5 · February 4, 2014, 1:09am

Focus the tests on good_32_sec_inactivity_core-firmware.bin if you can keep and send the whole log with observations of RGB LED that would be great.

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1796	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	3014	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1831	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12534	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1873	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related topics