Bug bounty: Kill the 'Cyan flash of death'

Yep 6 blinks is tha panic code for 0000112426: bool SparkProtocol::handle_received_message() (752):len(15393) > arraySize(queue)

This is due the following:

Understood - In the context of the failure it is data read from the users connection by the spark protocol. oops.

The value of length is rasndom data on the connection at the time and kills the core by overwriting memory, In my case it is the HT of HTTPā€¦ 0x5348 ā€¦

I have pull request up for the 4 repos with all this reworked

1 Like

Are you sure you can call it a network :smile:

The above shows the connection WD is working: 1007 is the connection

0000435920:<ERROR> hci_event_handler (248):Timeout now 435920 start 403670 elapsed 32250 cc3000__event_timeout_ms 32250
0000435930:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode **0x1007**

If you would upload to pastbin the whole log from reboot to failure through to recovery that would help

Or is it that it does not recover and just stays in this mode?

What does the log show before this point?

ok here is a 1 hour log from ā€œgood_32sec_inactivitiy_core-firmware.binā€ No problem what I could see. Not sure if that is interesting to view but since it is generated anyway here it is.

Regarding the logs, do you @david_s5 need all the data in them? They become quite large when you run longer tests and if we could cut down on the counting for example to only output every 10 number that will make them a bit more nimble.

For now more data is better. How big will a zip file be?

This is 1.29 Hour of running good_32_sec_inactivity_core-firmware

It panic-ed because the SprakProtocol read the userā€™s data.

> 0004605563:<DEBUG> void loop1() (159):client.available() 0
> 0004605679:<DEBUG> void loop1() (159):client.available() 0
> 0004605794:<DEBUG> void loop1() (159):client.available() 0
> 0004605905:<DEBUG> void loop1() (194):


> 0004605908:<DEBUG> void loop1() (195):3890 total Bytes Read


> 0004606014:<DEBUG> virtual void TCPClient::stop() (181):_sock 1 closesocket=0
> 0004606021:<DEBUG> void loop1() (197):connection closed state 4
> 0004608540:<DEBUG> void loop1() (93):connecting
> 0004608547:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=1
> 0004640805:<ERROR> hci_event_handler (248):Timeout now 4640805 start 4608555 elapsed 32250 cc3000__event_timeout_ms 32250
> 0004640815:<ERROR> hci_event_handler (249):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
> 0004640824:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):_sock 1 connected=0
> 0004640833:<DEBUG> virtual void TCPClient::stop() (181):_sock 1 closesocket=0
> 0004640840:<DEBUG> void loop1() (98):connection failed state 1
> 0004663361:<DEBUG> void loop1() (93):connecting
> 0004663368:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=0

> ->>>4663477:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):_sock 0
------------------------------------------------------------------------

connected=1 
> 0004663491:<DEBUG> void loop1() (106): Send
> 0004663496:<DEBUG> void loop1() (114): Sent Doing Read
> 0004663641:<DEBUG> virtual int TCPClient::available() (135):recv(=236)
> 0004663648:<DEBUG> void loop1() (146):Ready
> 0004663652:<DEBUG> int Spark_Receive(unsigned char*, int) (392):bytes_received 2
> 0004663659:<PANIC> bool SparkProtocol::handle_received_message() (752):len(12857) > arraySize(queue)

FIX FIX FIX Hopefullyā€¦ Need more testing

EDIT: Donā€™t smash usRxEventOpcode if there is currently something pending.

There is a race condition on the expected command (usRxEventOpcode). I have a heavily modified SPI driver as well, which fixes a potential race condition in SpiWrite(). Iā€™ve also simplified/cleaned up the driver. TCPServer/TCPClient has been modified as well and is quite stable. While the following patch to hci.c will probably make the driver work a lot better, these other fixes may be contributing to my stability as well. I will follow up with a github fork with all the changes.

hci_command_send() is called, which initiates an SPI transaction. If the interrupt occurs before SimpleLinkWaitEvent() is called, which normally sets usRxEventOpcode, the check in the receive handler will discard the event.

if (usReceivedEventOpcode == tSLInformation.usRxEventOpcode)
{
  tSLInformation.usRxEventOpcode = 0;
}

Applying this patch has worked wonders on the driver. This ensure the next expected event to wait for is set prior to sending the command. I tested this by adding a while(1){} loop in the above condition, and with this patch it runs strong, if I remove it I get an almost instant lockup.

diff --git a/CC3000_Host_Driver/hci.c b/CC3000_Host_Driver/hci.c
index bd75f00..17ef141 100644
--- a/CC3000_Host_Driver/hci.c
+++ b/CC3000_Host_Driver/hci.c
@@ -73,8 +73,13 @@ hci_command_send(unsigned short usOpcode, unsigned char *pucBuff,
        UINT8_TO_STREAM(stream, HCI_TYPE_CMND);
        stream = UINT16_TO_STREAM(stream, usOpcode);
        UINT8_TO_STREAM(stream, ucArgsLength);
+
+       while((tSLInformation.usRxDataPending != 0) || (tSLInformation.usRxEventOpcode != 0))
+       {
+       }
        
        //Update the opcode of the event we will be waiting for
+       tSLInformation.usRxEventOpcode = usOpcode;
        SpiWrite(pucBuff, ucArgsLength + SIMPLE_LINK_HCI_CMND_HEADER_SIZE);
        
        return(0);
@@ -111,6 +116,10 @@ hci_data_send(unsigned char ucOpcode,
        UINT8_TO_STREAM(stream, ucOpcode);
        UINT8_TO_STREAM(stream, usArgsLength);
        stream = UINT16_TO_STREAM(stream, usArgsLength + usDataLength + usTailLength);
+
+       while((tSLInformation.usRxDataPending != 0) || (tSLInformation.usRxEventOpcode != 0))
+       {
+       }
        
        // Send the packet over the SPI
        SpiWrite(ucArgs, SIMPLE_LINK_HCI_DATA_HEADER_SIZE + usArgsLength + usDataLength + usTailLength);

As I understand the original TI code the hci_command_send is called by the foreground (non-interrupt context) to send data to the cc3000 by calling SpiWrite.

SpiWrite ends in

	while (eSPI_STATE_IDLE != sSpiInformation.ulSpiState)
	{
	}

which will wait until the complete write completes before returning to the caller. Which calls
SimpleLinkWaitEvent, SimpleLinkWaitEvent sets the opcode and begins waiting for tSLInformation.usEventOrDataReceived != 0) to process the event (all in the foreground FG)

The Interrupt caused by the falling edge of the WIFI_INT will load the DMA and the DMA's completion ISR will provide a corresponding call out to the SpiReceiveHandler in the background (BG interrupt context) . WITH the WIFI_INT interrupt disabled.

SpiReceiveHandler will set

tSLInformation.usEventOrDataReceived = 1;
tSLInformation.pucReceivedData = (unsigned char 	*)pvBuffer;

and calls:
hci_unsolicited_event_handler();

The hci_unsolicited_event_handler() calls the hci_unsol_event_handler() to the event discrimination and if it is an one of the following:

HCI_EVNT_WLAN_KEEPALIVE:
HCI_EVNT_WLAN_UNSOL_CONNECT:
HCI_EVNT_WLAN_UNSOL_DISCONNECT:
HCI_EVNT_WLAN_UNSOL_INIT:
HCI_EVNT_WLAN_ASYNC_SIMPLE_CONFIG_DONE:
HCI_EVNT_WLAN_UNSOL_DHCP:
HCI_EVNT_WLAN_ASYNC_PING_REPORT:
HCI_EVNT_BSD_TCP_CLOSE_WAIT:

then the hci_unsolicited_event_handler() will ultimately call the WLAN_Async_Callback which ONLY can set flags and can not do anything else. On exit of hci_unsol_event_handler() if the event was handled the WIFI_INT interrupt is re-enabled.

If the data at pucReceivedData is not an EVENT or not one handled by hci_unsol_event_handler it does nothing but leaves the WIFI_INT interrupt disabled and

tSLInformation.usEventOrDataReceived = 1;
tSLInformation.pucReceivedData = (unsigned char 	*)pvBuffer;

to be consumed by the call SimpleLinkWaitEvent which runs in the FG.

What is the path through the code that will allow hci_command_send to be called in the FG

where:

while((tSLInformation.usRxDataPending != 0) || (tSLInformation.usRxEventOpcode != 0))

Could be true prior to the call to SpiWrite?

tSLInformation.usRxDataPending is set by the BG and cleared by the FG
tSLInformation.usRxEventOpcode is set by the FG and cleared by the FG

if (usReceivedEventOpcode == tSLInformation.usRxEventOpcode)
{
	tSLInformation.usRxEventOpcode = 0;
}

The above runs in the FG with the WIFI_INT interrupt disabled.

Could it be that the changes you made to the spi driver changed this?

@mohit Do you have a wireless sniffer? I ask because this AM I thought I had the CFOD.

Duhhhā€¦I forgot the AP I attached to last night is connected to a firewall that has a curfew. So the
log was spewing

0037196230:<DEBUG> void loop1() (93):connecting
0037196337:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=1
0037196348:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):_sock 1 connected=0
0037196357:<DEBUG> virtual void TCPClient::stop() (181):_sock 1 closesocket=-57
0037196364:<DEBUG> void loop1() (98):connection failed state 1
0037198880:<DEBUG> void loop1() (93):connecting
0037198987:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=1
0037198998:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):_sock 1 connected=0
0037199007:<DEBUG> virtual void TCPClient::stop() (181):_sock 1 closesocket=-57
0037199014:<DEBUG> void loop1() (98):connection failed state 1
0037201020:<DEBUG> int Spark_Send(const unsigned char*, int) (348):bytes_sent 18
0037201157:<DEBUG> int Spark_Receive(unsigned char*, int) (392):bytes_received 2
0037201165:<DEBUG> int Spark_Receive(unsigned char*, int) (392):bytes_received 16
0037201554:<DEBUG> void loop1() (93):connecting

but only for the userā€™s code. (This suggests that the spark stayed connected through the firewall from before the curfew began.)

Anyway:                                             [tap]
The results where 100% good spi traffic with no AP<---+--->WAN traffic at the tap.

This posed the question to me what is happening in the SparkCore<------->AP side when a CFOD occurs?

Do you have a wireless sniffer and traces of that?

Right, but hci_command_send() is called with interrupts enabled, which begins sending the command. SimpleLinkeWaitEvent() is where usRxEventOpode is normally set, also with interrupts enabled.

Say usRxEventOpcode == 1

hci_command_send() -> SimpleLinkWaitEvent()

Say an interrupt occurs before SimpleLinkWait() sets usRxEventOpcode:
usEventOrDataReceived == 1
hci_event_handler() calls SpiResumeSpi() in the while(1) loop

After re-enabling interrupts, it wonā€™t exit the loop since usRxEventOpcode != 0, as a result of usRxEventOpcode not being set to 0 since usReceivedEventOpcode != tSLInformation.usRxEventOpcode

  SpiResumeSpi();      
  if ((tSLInformation.usRxEventOpcode == 0) && (tSLInformation.usRxDataPending == 0))
  {
    return NULL;
  }	
  // continue while(1), interrupts enabled

I am still not seeing itā€¦

hci_command_send() -> SimpleLinkWaitEvent()->sets data and op code ---->hci_event_handler()
ISRā€¦^ā€¦^ā€¦^ā€¦^ā€¦^

Anywhere ^ is safeā€¦

If the opcode matches the

		if (usReceivedEventOpcode == tSLInformation.usRxEventOpcode)
				{
					tSLInformation.usRxEventOpcode = 0;
				}

will clear the opcode and no matter what set usEventOrDataReceived (DATA or EVENT) then usRxDataPending is cleared before SpiResumeSpi is called.

The code exits the loop, even at that point there will be no ISR corresponding to a new Sent. The only possible thing is an unsolicited event and it does not touch tSLInformation.usRxEventOpcode and will be processed by the loop and exit

Thanks @david_s5
I donā€™t have a wireless sniffer, but Iā€™ve heard that a mac book pro can do that natively. Iā€™ll see if I can have one of my work-colleagues-mac-owners join me on that.

The previous log remained the same, since the Core never got back online.

Iā€™ll gather some more data in the meatime. Do you have a fork link for the firmware you are working on?

Thanks!

yes https://github.com/davids5/ all 4 repos have been updated on my master. The only thing is that application.cpp has test code and tinker.

Two interrupts between hci_command_send() and setting the opcode is what I think is unsafe. The received message won't be handled (unexpected opcode), interrupts are re-enabled and the next interrupt will start overwriting the rx buffer. Then the expected opcode of the first message is written.

There can not be 2 because the call out is done with the ints off. The second can only happen after the SpiResumeSpi() and if it does it will take the hci_unsol_event_handler((char *)pucReceivedData) path and not touch the opcode and exit.

The while(1) loop in hci_event_handler calls SpiResumeSpi()

If the expected opcode does not match, it wonā€™t reset usRxEventOpcode, and it will fall through to SpiResumeSpi() below, and continue at the top of the while(1)

if (usReceivedEventOpcode == tSLInformation.usRxEventOpcode)
				{
					tSLInformation.usRxEventOpcode = 0;
				}
			SpiResumeSpi();
			
			// Since we are going to TX - we need to handle this event after the 
			// ResumeSPi since we need interrupts
			if ((*pucReceivedData == HCI_TYPE_EVNT) &&
					(usReceivedEventOpcode == HCI_EVNT_PATCHES_REQ))
			{
				hci_unsol_handle_patch_request((char *)pucReceivedData);
			}
			
			if ((tSLInformation.usRxEventOpcode == 0) && (tSLInformation.usRxDataPending == 0))
			{
				return NULL;
			}

I got some more logs of the ā€œgood_32_sec_inactivity_core-firmwareā€ It hung with just 1 red blink between SOS this time.

Thank you it shows we still have a problem with the code crashing in a Hardfault (1 blink)
Here is what the codes mean in build you have.

1 (Faults,RGB_COLOR_RED,HardFault) 1
2 (Faults,RGB_COLOR_RED,NMIFault)
3 (Faults,RGB_COLOR_RED,MemManage)
4 (Faults,RGB_COLOR_RED,BusFault)
5 (Faults,RGB_COLOR_RED,UsageFault)

6 (Cloud,RGB_COLOR_RED,InvalidLenth)

7 (System,RGB_COLOR_RED,Exit)
8 (System,RGB_COLOR_RED,OutOfHeap)
9 (System,RGB_COLOR_RED,SPIOverRun)

10 (Softare,RGB_COLOR_RED,AssertionFailure)
11 (Softare,RGB_COLOR_RED,InvalidCase)

Also you had 629 connect attempts with 135 timeouts in the 3.17 hrs it ran

Thank for the data.

Update:
In another news, @AndyW and I have been genreating debug logs for the CC3000 and sending it TIā€™s way. We also preped a Core with the debug lines exposed and shipped it to their tech support group. We also now have a direct email contact with the tech support group that has allowed us to bypass the TIā€™s forums.

The process is a little slow since TIā€™s uses proprietry tools to demystify the logs, but we are on it!

In case you are curious to know how we manged to sneek in the debug lines on the Core:

The calipers were used just to lift the module :slight_smile: The yield rate for the process has been a low at 50% but we have enough of these now to keep continuing.

Thanks to everyone who has been patient and contributing to this thread!

2 Likes

Amazing! Wow! You guys rock!

1 Like