Bug bounty: Kill the 'Cyan flash of death'

david_s5 · February 4, 2014, 1:10am

Yep 6 blinks is tha panic code for 0000112426: bool SparkProtocol::handle_received_message() (752):len(15393) > arraySize(queue)

This is due the following:

david_s5:

"Found an Issue:"

If the user's code has a socket they open and close regularly This should have a socket handle value of #1 with the sparkSocket being the socket handle value 0. However, if the sparkSocket is closed due to a error or a timeout in the CC3000 and the main loop calls into the user's loop before it detects the spark socket is closed the users can reopen it's socket and get handle 0. Then the Spark protocol ends up reading the USERS data an will most likely over write memory.

bool SparkProtocol::handle_received_message(void)
{
last_message_millis = callback_millis();
expecting_ping_ack = false;
int len = queue[0] << 8 | queue[1];

if (0 > blocking_receive(queue, len)) <------- what is length!!!!!
{

other then extending the inactivity timer from 20 seconds to 32 seconds. But I believe this will not fix the issue, just mask it because of the async nature. of B, to D below

death-Capture.JPG925x83 28.9 KB

david_s5 · February 4, 2014, 1:20am

Understood - In the context of the failure it is data read from the users connection by the spark protocol. oops.

The value of length is rasndom data on the connection at the time and kills the core by overwriting memory, In my case it is the HT of HTTP… 0x5348 …

I have pull request up for the 4 repos with all this reworked

david_s5 · February 4, 2014, 1:27am

mohit:

0000167702:<DEBUG> int Spark_Send(const unsigned char*, int) (348):bytes_sent 18
0000168221:<DEBUG> void loop1() (93):connecting
0000168226:<DEBUG> void loop1() (98):connection failed state 1
0000170751:<DEBUG> void loop1() (93):connecting
0000170756:<DEBUG> void loop1() (98):connection failed state 1
0000173276:<DEBUG> void loop1() (93):connecting
0000173281:<DEBUG> void loop1() (98):connection failed state 1
0000175801:<DEBUG> void loop1() (93):connecting
0000175806:<DEBUG> void loop1() (98):connection failed state 1
0000177811:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000177817:<DEBUG> int Spark_Disconnect() (674):
0000177821:<DEBUG> int Spark_Disconnect() (683):Close
0000181425:<DEBUG> int Spark_Disconnect() (685):Closed retVal=0
0000181431:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000181437:<DEBUG> int Spark_Connect() (665):connet
0000211496:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000211502:<DEBUG> int Internet_Test() (596):socket
0000211507:<DEBUG> int Internet_Test() (598):socketed testSocket=0
0000211512:<DEBUG> int Internet_Test() (619):connect
0000243767:<ERROR> hci_event_handler (248):Timeout now 243767 start 211517 elapsed 32250 cc3000__event_timeout_ms 32250
0000243777:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000243786:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000243792:<DEBUG> int Internet_Test() (628):Close
0000243797:<DEBUG> int Internet_Test() (630):Closed rv=0
0000245924:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000245930:<DEBUG> int Spark_Disconnect() (674):
0000245934:<DEBUG> int Spark_Disconnect() (683):Close
0000245939:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000245945:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000245951:<DEBUG> int Spark_Connect() (665):connet
0000278206:<ERROR> hci_event_handler (248):Timeout now 278206 start 245956 elapsed 32250 cc3000__event_timeout_ms 32250
0000278216:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000278225:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000278231:<DEBUG> int Internet_Test() (596):socket
0000278236:<DEBUG> int Internet_Test() (598):socketed testSocket=1
0000278242:<DEBUG> int Internet_Test() (619):connect
0000281497:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000281503:<DEBUG> int Internet_Test() (628):Close
0000281509:<DEBUG> int Internet_Test() (630):Closed rv=0
0000283635:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000283641:<DEBUG> int Spark_Disconnect() (674):
0000283645:<DEBUG> int Spark_Disconnect() (683):Close
0000283650:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000283656:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000283662:<DEBUG> int Spark_Connect() (665):connet
0000315917:<ERROR> hci_event_handler (248):Timeout now 315917 start 283667 elapsed 32250 cc3000__event_timeout_ms 32250
0000315927:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000315936:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000315942:<DEBUG> int Internet_Test() (596):socket
0000435920:<ERROR> hci_event_handler (248):Timeout now 435920 start 403670 elapsed 32250 cc3000__event_timeout_ms 32250
0000435930:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000435939:<DEBUG> int Spark_Connect() (667):connected connect=-1
0000435945:<DEBUG> int Internet_Test() (596):socket
0000435950:<DEBUG> int Internet_Test() (598):socketed testSocket=1
0000435956:<DEBUG> int Internet_Test() (619):connect
0000441501:<DEBUG> int Internet_Test() (621):connected testResult=-1
0000441507:<DEBUG> int Internet_Test() (628):Close
0000441512:<DEBUG> int Internet_Test() (630):Closed rv=0
0000443639:<DEBUG> int Spark_Connect() (638):sparkSocket Now =0
0000443645:<DEBUG> int Spark_Disconnect() (674):
0000443649:<DEBUG> int Spark_Disconnect() (683):Close
0000443654:<DEBUG> int Spark_Disconnect() (685):Closed retVal=-57
0000443660:<DEBUG> int Spark_Connect() (645):socketed sparkSocket=0
0000443666:<DEBUG> int Spark_Connect() (665):connet

Are you sure you can call it a network

The above shows the connection WD is working: 1007 is the connection

0000435920:<ERROR> hci_event_handler (248):Timeout now 435920 start 403670 elapsed 32250 cc3000__event_timeout_ms 32250
0000435930:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode **0x1007**

If you would upload to pastbin the whole log from reboot to failure through to recovery that would help

Or is it that it does not recover and just stays in this mode?

What does the log show before this point?

sjunnesson · February 4, 2014, 1:32am

ok here is a 1 hour log from “good_32sec_inactivitiy_core-firmware.bin” No problem what I could see. Not sure if that is interesting to view but since it is generated anyway here it is.

Regarding the logs, do you @david_s5 need all the data in them? They become quite large when you run longer tests and if we could cut down on the counting for example to only output every 10 number that will make them a bit more nimble.

david_s5 · February 4, 2014, 1:37am

For now more data is better. How big will a zip file be?

david_s5 · February 4, 2014, 1:51am

This is 1.29 Hour of running good_32_sec_inactivity_core-firmware

It panic-ed because the SprakProtocol read the user’s data.

> 0004605563:<DEBUG> void loop1() (159):client.available() 0
> 0004605679:<DEBUG> void loop1() (159):client.available() 0
> 0004605794:<DEBUG> void loop1() (159):client.available() 0
> 0004605905:<DEBUG> void loop1() (194):


> 0004605908:<DEBUG> void loop1() (195):3890 total Bytes Read


> 0004606014:<DEBUG> virtual void TCPClient::stop() (181):_sock 1 closesocket=0
> 0004606021:<DEBUG> void loop1() (197):connection closed state 4
> 0004608540:<DEBUG> void loop1() (93):connecting
> 0004608547:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=1
> 0004640805:<ERROR> hci_event_handler (248):Timeout now 4640805 start 4608555 elapsed 32250 cc3000__event_timeout_ms 32250
> 0004640815:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
> 0004640824:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):_sock 1 connected=0
> 0004640833:<DEBUG> virtual void TCPClient::stop() (181):_sock 1 closesocket=0
> 0004640840:<DEBUG> void loop1() (98):connection failed state 1
> 0004663361:<DEBUG> void loop1() (93):connecting
> 0004663368:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=0

> ->>>4663477:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):_sock 0
------------------------------------------------------------------------

connected=1 
> 0004663491:<DEBUG> void loop1() (106): Send
> 0004663496:<DEBUG> void loop1() (114): Sent Doing Read
> 0004663641:<DEBUG> virtual int TCPClient::available() (135):recv(=236)
> 0004663648:<DEBUG> void loop1() (146):Ready
> 0004663652:<DEBUG> int Spark_Receive(unsigned char*, int) (392):bytes_received 2
> 0004663659:<PANIC> bool SparkProtocol::handle_received_message() (752):len(12857) > arraySize(queue)

wtfuzz · February 4, 2014, 4:30am

FIX FIX FIX Hopefully… Need more testing

EDIT: Don’t smash usRxEventOpcode if there is currently something pending.

There is a race condition on the expected command (usRxEventOpcode). I have a heavily modified SPI driver as well, which fixes a potential race condition in SpiWrite(). I’ve also simplified/cleaned up the driver. TCPServer/TCPClient has been modified as well and is quite stable. While the following patch to hci.c will probably make the driver work a lot better, these other fixes may be contributing to my stability as well. I will follow up with a github fork with all the changes.

hci_command_send() is called, which initiates an SPI transaction. If the interrupt occurs before SimpleLinkWaitEvent() is called, which normally sets usRxEventOpcode, the check in the receive handler will discard the event.

if (usReceivedEventOpcode == tSLInformation.usRxEventOpcode)
{
  tSLInformation.usRxEventOpcode = 0;
}

Applying this patch has worked wonders on the driver. This ensure the next expected event to wait for is set prior to sending the command. I tested this by adding a while(1){} loop in the above condition, and with this patch it runs strong, if I remove it I get an almost instant lockup.

diff --git a/CC3000_Host_Driver/hci.c b/CC3000_Host_Driver/hci.c
index bd75f00..17ef141 100644
--- a/CC3000_Host_Driver/hci.c
+++ b/CC3000_Host_Driver/hci.c
@@ -73,8 +73,13 @@ hci_command_send(unsigned short usOpcode, unsigned char *pucBuff,
        UINT8_TO_STREAM(stream, HCI_TYPE_CMND);
        stream = UINT16_TO_STREAM(stream, usOpcode);
        UINT8_TO_STREAM(stream, ucArgsLength);
+
+       while((tSLInformation.usRxDataPending != 0) || (tSLInformation.usRxEventOpcode != 0))
+       {
+       }
        
        //Update the opcode of the event we will be waiting for
+       tSLInformation.usRxEventOpcode = usOpcode;
        SpiWrite(pucBuff, ucArgsLength + SIMPLE_LINK_HCI_CMND_HEADER_SIZE);
        
        return(0);
@@ -111,6 +116,10 @@ hci_data_send(unsigned char ucOpcode,
        UINT8_TO_STREAM(stream, ucOpcode);
        UINT8_TO_STREAM(stream, usArgsLength);
        stream = UINT16_TO_STREAM(stream, usArgsLength + usDataLength + usTailLength);
+
+       while((tSLInformation.usRxDataPending != 0) || (tSLInformation.usRxEventOpcode != 0))
+       {
+       }
        
        // Send the packet over the SPI
        SpiWrite(ucArgs, SIMPLE_LINK_HCI_DATA_HEADER_SIZE + usArgsLength + usDataLength + usTailLength);

david_s5 · February 4, 2014, 2:32pm

As I understand the original TI code the hci_command_send is called by the foreground (non-interrupt context) to send data to the cc3000 by calling SpiWrite.

SpiWrite ends in

	while (eSPI_STATE_IDLE != sSpiInformation.ulSpiState)
	{
	}

which will wait until the complete write completes before returning to the caller. Which calls
SimpleLinkWaitEvent, SimpleLinkWaitEvent sets the opcode and begins waiting for tSLInformation.usEventOrDataReceived != 0) to process the event (all in the foreground FG)

The Interrupt caused by the falling edge of the WIFI_INT will load the DMA and the DMA's completion ISR will provide a corresponding call out to the SpiReceiveHandler in the background (BG interrupt context) . WITH the WIFI_INT interrupt disabled.

SpiReceiveHandler will set

tSLInformation.usEventOrDataReceived = 1;
tSLInformation.pucReceivedData = (unsigned char 	*)pvBuffer;

and calls:
hci_unsolicited_event_handler();

The hci_unsolicited_event_handler() calls the hci_unsol_event_handler() to the event discrimination and if it is an one of the following:

HCI_EVNT_WLAN_KEEPALIVE:
HCI_EVNT_WLAN_UNSOL_CONNECT:
HCI_EVNT_WLAN_UNSOL_DISCONNECT:
HCI_EVNT_WLAN_UNSOL_INIT:
HCI_EVNT_WLAN_ASYNC_SIMPLE_CONFIG_DONE:
HCI_EVNT_WLAN_UNSOL_DHCP:
HCI_EVNT_WLAN_ASYNC_PING_REPORT:
HCI_EVNT_BSD_TCP_CLOSE_WAIT:

then the hci_unsolicited_event_handler() will ultimately call the WLAN_Async_Callback which ONLY can set flags and can not do anything else. On exit of hci_unsol_event_handler() if the event was handled the WIFI_INT interrupt is re-enabled.

If the data at pucReceivedData is not an EVENT or not one handled by hci_unsol_event_handler it does nothing but leaves the WIFI_INT interrupt disabled and

tSLInformation.usEventOrDataReceived = 1;
tSLInformation.pucReceivedData = (unsigned char 	*)pvBuffer;

to be consumed by the call SimpleLinkWaitEvent which runs in the FG.

What is the path through the code that will allow hci_command_send to be called in the FG

where:

while((tSLInformation.usRxDataPending != 0) || (tSLInformation.usRxEventOpcode != 0))

Could be true prior to the call to SpiWrite?

tSLInformation.usRxDataPending is set by the BG and cleared by the FG
tSLInformation.usRxEventOpcode is set by the FG and cleared by the FG

if (usReceivedEventOpcode == tSLInformation.usRxEventOpcode)
{
	tSLInformation.usRxEventOpcode = 0;
}

The above runs in the FG with the WIFI_INT interrupt disabled.

Could it be that the changes you made to the spi driver changed this?

david_s5 · February 4, 2014, 3:17pm

@mohit Do you have a wireless sniffer? I ask because this AM I thought I had the CFOD.

Duhhh…I forgot the AP I attached to last night is connected to a firewall that has a curfew. So the
log was spewing

0037196230:<DEBUG> void loop1() (93):connecting
0037196337:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=1
0037196348:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):_sock 1 connected=0
0037196357:<DEBUG> virtual void TCPClient::stop() (181):_sock 1 closesocket=-57
0037196364:<DEBUG> void loop1() (98):connection failed state 1
0037198880:<DEBUG> void loop1() (93):connecting
0037198987:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=1
0037198998:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):_sock 1 connected=0
0037199007:<DEBUG> virtual void TCPClient::stop() (181):_sock 1 closesocket=-57
0037199014:<DEBUG> void loop1() (98):connection failed state 1
0037201020:<DEBUG> int Spark_Send(const unsigned char*, int) (348):bytes_sent 18
0037201157:<DEBUG> int Spark_Receive(unsigned char*, int) (392):bytes_received 2
0037201165:<DEBUG> int Spark_Receive(unsigned char*, int) (392):bytes_received 16
0037201554:<DEBUG> void loop1() (93):connecting

but only for the user’s code. (This suggests that the spark stayed connected through the firewall from before the curfew began.)

Anyway:                                             [tap]
The results where 100% good spi traffic with no AP<---+--->WAN traffic at the tap.

This posed the question to me what is happening in the SparkCore<------->AP side when a CFOD occurs?

Do you have a wireless sniffer and traces of that?

wtfuzz · February 4, 2014, 5:17pm

Right, but hci_command_send() is called with interrupts enabled, which begins sending the command. SimpleLinkeWaitEvent() is where usRxEventOpode is normally set, also with interrupts enabled.

Say usRxEventOpcode == 1

hci_command_send() -> SimpleLinkWaitEvent()

Say an interrupt occurs before SimpleLinkWait() sets usRxEventOpcode:
usEventOrDataReceived == 1
hci_event_handler() calls SpiResumeSpi() in the while(1) loop

After re-enabling interrupts, it won’t exit the loop since usRxEventOpcode != 0, as a result of usRxEventOpcode not being set to 0 since usReceivedEventOpcode != tSLInformation.usRxEventOpcode

  SpiResumeSpi();      
  if ((tSLInformation.usRxEventOpcode == 0) && (tSLInformation.usRxDataPending == 0))
  {
    return NULL;
  }	
  // continue while(1), interrupts enabled

david_s5 · February 4, 2014, 6:51pm

I am still not seeing it…

hci_command_send() -> SimpleLinkWaitEvent()->sets data and op code ---->hci_event_handler()
ISR…^…^…^…^…^

Anywhere ^ is safe…

If the opcode matches the

		if (usReceivedEventOpcode == tSLInformation.usRxEventOpcode)
				{
					tSLInformation.usRxEventOpcode = 0;
				}

will clear the opcode and no matter what set usEventOrDataReceived (DATA or EVENT) then usRxDataPending is cleared before SpiResumeSpi is called.

The code exits the loop, even at that point there will be no ISR corresponding to a new Sent. The only possible thing is an unsolicited event and it does not touch tSLInformation.usRxEventOpcode and will be processed by the loop and exit

mohit · February 4, 2014, 6:58pm

Thanks @david_s5
I don’t have a wireless sniffer, but I’ve heard that a mac book pro can do that natively. I’ll see if I can have one of my work-colleagues-mac-owners join me on that.

The previous log remained the same, since the Core never got back online.

I’ll gather some more data in the meatime. Do you have a fork link for the firmware you are working on?

Thanks!

david_s5 · February 4, 2014, 7:14pm

yes https://github.com/davids5/ all 4 repos have been updated on my master. The only thing is that application.cpp has test code and tinker.

wtfuzz · February 4, 2014, 7:16pm

Two interrupts between hci_command_send() and setting the opcode is what I think is unsafe. The received message won't be handled (unexpected opcode), interrupts are re-enabled and the next interrupt will start overwriting the rx buffer. Then the expected opcode of the first message is written.

david_s5 · February 4, 2014, 7:21pm

There can not be 2 because the call out is done with the ints off. The second can only happen after the SpiResumeSpi() and if it does it will take the hci_unsol_event_handler((char *)pucReceivedData) path and not touch the opcode and exit.

wtfuzz · February 4, 2014, 7:26pm

The while(1) loop in hci_event_handler calls SpiResumeSpi()

If the expected opcode does not match, it won’t reset usRxEventOpcode, and it will fall through to SpiResumeSpi() below, and continue at the top of the while(1)

if (usReceivedEventOpcode == tSLInformation.usRxEventOpcode)
				{
					tSLInformation.usRxEventOpcode = 0;
				}

			SpiResumeSpi();
			
			// Since we are going to TX - we need to handle this event after the 
			// ResumeSPi since we need interrupts
			if ((*pucReceivedData == HCI_TYPE_EVNT) &&
					(usReceivedEventOpcode == HCI_EVNT_PATCHES_REQ))
			{
				hci_unsol_handle_patch_request((char *)pucReceivedData);
			}
			
			if ((tSLInformation.usRxEventOpcode == 0) && (tSLInformation.usRxDataPending == 0))
			{
				return NULL;
			}

sjunnesson · February 4, 2014, 9:39pm

I got some more logs of the “good_32_sec_inactivity_core-firmware” It hung with just 1 red blink between SOS this time.

david_s5 · February 4, 2014, 11:17pm

Thank you it shows we still have a problem with the code crashing in a Hardfault (1 blink)
Here is what the codes mean in build you have.

1 (Faults,RGB_COLOR_RED,HardFault) 1
2 (Faults,RGB_COLOR_RED,NMIFault)
3 (Faults,RGB_COLOR_RED,MemManage)
4 (Faults,RGB_COLOR_RED,BusFault)
5 (Faults,RGB_COLOR_RED,UsageFault)

6 (Cloud,RGB_COLOR_RED,InvalidLenth)

7 (System,RGB_COLOR_RED,Exit)
8 (System,RGB_COLOR_RED,OutOfHeap)
9 (System,RGB_COLOR_RED,SPIOverRun)

10 (Softare,RGB_COLOR_RED,AssertionFailure)
11 (Softare,RGB_COLOR_RED,InvalidCase)

Also you had 629 connect attempts with 135 timeouts in the 3.17 hrs it ran

Thank for the data.

mohit · February 5, 2014, 6:21pm

Update:
In another news, @AndyW and I have been genreating debug logs for the CC3000 and sending it TI’s way. We also preped a Core with the debug lines exposed and shipped it to their tech support group. We also now have a direct email contact with the tech support group that has allowed us to bypass the TI’s forums.

The process is a little slow since TI’s uses proprietry tools to demystify the logs, but we are on it!

In case you are curious to know how we manged to sneek in the debug lines on the Core:

The calipers were used just to lift the module The yield rate for the process has been a low at 50% but we have enough of these now to keep continuing.

Thanks to everyone who has been patient and contributing to this thread!

david_s5 · February 5, 2014, 9:00pm

Amazing! Wow! You guys rock!

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1796	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	3014	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1831	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12534	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1873	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related topics