Bug bounty: Kill the 'Cyan flash of death'

david_s5 · January 29, 2014, 9:12pm

So the TI CC3000_Serial_Port_Interface_(SPI) docs say"

"Example Implementation of SPI
..... The second implementation is based on the LM3S9B96 Stellaris, which uses DMA for data transfer to or from the SPI.
.....
A limitation of the LM3S9B96 data buffer prevents more than 1024 bytes from being copied in one transaction of the DMA from RAM to the SPI FIFO; thus, two DMA transactions are required."

An STM32F10x dma count register is NDT[15:0]: Number of data to be transferred (0 up to 65535).

Time for a rewrite. But first a test to prove it is broke,

mohit · January 29, 2014, 10:05pm

I’m going to see if @satishgn has any inputs on this.

BDub · January 29, 2014, 10:24pm

Prototype: void SpiReadData(unsigned char *data, unsigned short size);

I just looked at this, evnt_buff is a pointer, so +10 just advances it to the pRxPacket of struct sSpiInformation. This would be just past the header information... so pointing to actual data now.

This one is setting the pointer to 1034 into the pRxPacket ... let's see if we can figure this out...

Here's what I think is happening:

enter void SPI_DMA_IntHandler(void)

data_to_recv = 0; is cleared

STREAM_TO_UINT16((char *)(evnt_buff + SPI_HEADER_SIZE), HCI_DATA_LENGTH_OFFSET, data_to_recv); data_to_recv gets set to a new 16bit unsigned int value of data length. evnt_buff points to the Rx stream, and HCI_DATA_LENGTH_OFFSET is the offset into the stream where the Rx data length is stored.

// Read the last portion of data
//
//
// We need to read the rest of data..
//
data_to_recv -= SPI_WINDOW_SIZE;

This suggests data_to_recv is already larger than SPI_WINDOW_SIZE (1024) so it's trying to get the last portion of data that is larger than 1024. evnt_buff = sSpiInformation.pRxPacket coming in here, so I'm guessing at this point the Rx buffer is filled up to 1024 plus 10 for header info.

if (!((HEADERS_SIZE_EVNT + data_to_recv) & 1))
{
	data_to_recv++;
}

Some kind of error correction...

SpiReadData(sSpiInformation.pRxPacket + 10 + SPI_WINDOW_SIZE, data_to_recv);
and then the good part, we start reading the rest of the data at Rx start address + 10 for header info + 1024 from first portion of read. And we tack on (data_to_recv) in size 16bit values to the end of the Rx buffer.

At least that's what pretty close to what I think it's doing.

david_s5 · January 29, 2014, 11:36pm

Yo have summed it up well sep for:

That bit ensures that the number of bytes read from the cc3000 is even.

Since sSpiInformation.pRxPacket and the alias event_buff are both pointing to wlan_rx_buffer and it is 1024, there would a 10 byte and 1034 byte over write . I was about to prove it when the power went out here.....

BDub · January 30, 2014, 12:08am

I see what you are saying, but the code structure is suggesting the SPI Rx needs to accommodate transfers larger than 1024. If that’s true than the SPI_BUFFER_SIZE might need to be set to 2048. Another place in the code it directly compares data_to_recv to SPI_WINDOW_SIZE:

if (data_to_recv >= SPI_WINDOW_SIZE)
{
	data_to_recv = eSPI_STATE_READ_FIRST_PORTION;
	SpiReadData(evnt_buff + 10, SPI_WINDOW_SIZE);
	sSpiInformation.ulSpiState = eSPI_STATE_READ_FIRST_PORTION;
}

You can see it does a direct 1024 read in that above code… and sets the state to eSPI_STATE_READ_FIRST_PORTION … so then when it gets to the code you say is 1034 overwrite, it’s trying to read the rest of the data… sounds like SPI_WINDOW_SIZE needs to be larger.

Some of the comments in there don’t make sense as well, like they might have been copy/pasted… grr…

dermotos · January 30, 2014, 3:21am

You guys seem to be knee-deep in this, but just thought id add:

I’ve been running a test with the latest cloud ide firmware. Once a second I send a simple udp packet to a node server, which prints it to the screen. It also flashes the D7 LED.
If I call Spark.disconnect() intially, after about an hour I’ve noticed the node server stops receiving the packets, however the spark code is still executing (D7 flashing once a second).

I’m sure you’re already aware of this given the TI post, but my experience is with UDP datagrams rather than TCP sockets, which is what is mentioned in the TI post.

Also, I read that as a temporary fix, the WiFi chip is reset if a disconnect occurs. Is this only detected when a connection to the cloud is active (ie: Spark.disconnect() hasn’t been called?)

Edit: Just to add that if I do not call Spark.disconnect(), after about 1-2 hours the core reconnects to the cloud when the connection drops, but UDP no longer works until after a reset.

AndyW · January 30, 2014, 3:31am

That note is saying that the LM3S9B96 has a broken DMA engine, not that the CC3000 has a limitation.

I am aware of no errata about the STM32F103 that would suggest we need this handling in the core firmware.

I have traces taken directly off the SPI bus using a logic analyser that are decoded into CC3000-speak, and there are no protocol issues that I can see on the wire. If this damn forum would allow the uploading of arbitrary files, I could share these with you for analysis & confirmation, but given that they are not a jpg, you’ll just have to trust me.

AndyW · January 30, 2014, 3:33am

Oh - and I’m waiting on TI to say whether we can gather their debug dumps without error yet.
The turnaround time on the TI forum for non-trivial questions is running over 24 hours right now.

david_s5 · January 30, 2014, 4:14am

Here is the thing. The code as it was inherited from TI had a notion of (SPI_WINDOW_SIZE) 1024 as the maximum the DMA on the TI LM3S9B96 could handle. It has noting to do with the CC3000!

TI originally had compile time constants CC3000_RX_BUFFER_SIZE and CC3000_TX_BUFFER_SIZE that statically allocated wlan_rx_buffer and wlan_tx_buffer based on the #ifdef chain found below for reference.

The defines took into account all the constant floating around in the TI code for buffer manipulation (and the infamous 10). (The code looks to be written by an very junior person, It is real mess, in that it does not use struct or sizeof operations just lot's of #defines arrg)

So the first thing is the code we have been discussing does not belong on the STM32F103. It is overly complex to handle the limit DMA capabilities of TI's LM3S9B9. None of which is needed on the STM32.

Secondly, The code uses pointers to arrays of bytes but does not size constrain them properly. It is just a bad implementation on TI's part. ( In a classic receive side of a communication stack . The buffer is allocated at the bottom and each layer reserves it's fields and only exposes the layer above to a pointer past it's fields.) Ti I did not do this and the results are predictable.

The spark team choose 1024 as the the size for CC3000_RX_BUFFER_SIZE and CC3000_TX_BUFFER_SIZE.
The problem is the low level driver will overwrite memory if you make a recv call asking for with more then 1014 bytes.

Mostly the spark protocol will not do large reads and the buffers for TCP and UDP clients are now set at 512. Also the linker ( with my tools codesourcery ) has graciously placed the wlan_tx_buffer right after wlan_rx_buffer. The the overwrite ends up in the tx buffer but I have varified they do happen.

The Spark cc3000 code inherited from TI has too many buffers and interdependencies on the sizes of the buffers. This should be cleaned up to make the platform rock solid.

    //TX and RX buffer sizes - allow to receive and transmit maximum data at lengh 8.
#ifdef CC3000_TINY_DRIVER
#define TINY_CC3000_MAXIMAL_RX_SIZE 44
#define TINY_CC3000_MAXIMAL_TX_SIZE 59
#endif
//Defines for minimal and maximal RX buffer size. This size includes the spi header and hci header.
//The maximal buffer size derives from:
// MTU + HCI header + SPI header + sendto() agrs size
//The minimum buffer size derives from:
// HCI header + SPI header + max args size
//
// This buffer is used for receiving events and data.
// Note that the maximal buffer size as defined here is a maximal at all
// The packet can not be longer than MTU size and CC3000 does not support
// fragmentation.Note that the same buffer is used for reception of the data and
// events from CC3000. That is why the minium is defined.
// The calculation for the actual size of buffer for reception is:
// Given the maximal data size MAX_DATA that is expected to be received by application, the required buffer is:
// Using recv() or recvfrom():
// max(CC3000_MINIMAL_RX_SIZE, MAX_DATA + HEADERS_SIZE_DATA + fromlen + ucArgsize + 1)
// Using gethostbyname() with minimal buffer size will limit the host name returned to 99 bytes only.
// The 1 is used for the overrun detection
#define CC3000_MINIMAL_RX_SIZE      (118+1)
#define CC3000_MAXIMAL_RX_SIZE      (1519 + 1)

/*Defines for minimal and maximal TX buffer size.
  This buffer is used for sending events and data.
  The packet can not be longer than MTU size and CC3000 does not support
  fragmentation. Note that the same buffer is used for transmission of the data
  and commands. That is why the minimum is defined.
  The calculation for the actual size of buffer for transmission is:
  Given the maximal data size MAX_DATA, the required buffer is:
  Using Sendto():

   max(CC3000_MINIMAL_TX_SIZE, MAX_DATA + SPI_HEADER_SIZE
   + SOCKET_SENDTO_PARAMS_LEN + SIMPLE_LINK_HCI_DATA_HEADER_SIZE + 1)

  Using Send():

   max(CC3000_MINIMAL_TX_SIZE, MAX_DATA + SPI_HEADER_SIZE
   + HCI_CMND_SEND_ARG_LENGTH + SIMPLE_LINK_HCI_DATA_HEADER_SIZE + 1)

  The 1 is used for the overrun detection */

#define  CC3000_MINIMAL_TX_SIZE      (118 + 1)
#define CC3000_MAXIMAL_TX_SIZE      (1519 + 1)

//In order to determine your preferred buffer size, change CC3000_MAXIMAL_RX_SIZE and CC3000_MAXIMAL_TX_SIZE
//to a value between the minimal and maximal specified above.
// Note that the buyffers are allocated by SPI. In case you change the size of those buffers, you might need also
// to change the linker file, since for example on MSP430 FRAM devices the buffers are allocated in the FRAM section
// that is allocated manually and not by IDE. Thus its size changes...

#ifndef CC3000_TINY_DRIVER

    #define CC3000_RX_BUFFER_SIZE   (CC3000_MAXIMAL_RX_SIZE)
    #define CC3000_TX_BUFFER_SIZE   (CC3000_MAXIMAL_TX_SIZE)
    #define SP_PORTION_SIZE         512

//if defined TINY DRIVER we use smaller rx and tx buffer in order to minimize RAM consumption
#else
    #define CC3000_RX_BUFFER_SIZE   (TINY_CC3000_MAXIMAL_RX_SIZE)
    #define CC3000_TX_BUFFER_SIZE   (TINY_CC3000_MAXIMAL_TX_SIZE)
    #define SP_PORTION_SIZE      32

#endif

YUCK!

BDub · January 30, 2014, 5:53am

I agree with both of you… sounds like low level SPI needs to be looked at and cleaned up to work with the STM32F103 specifically.

zach · January 30, 2014, 6:01am

You may be right that this stuff should be cleaned up, although we did increase buffer sizes in a previous attempt to resolve this issue:

There was no improvement, so we reverted the change to save RAM. Unfortunately while this does deserve some scrubbing, I think it’s a red herring as it relates to CFOD.

Also, I wanted to share a quick report that I spoke to someone today who used the CC3000 with an Arduino and ran into the same issues we’re seeing here (CC3000 unable to hold a connection forever). This helps to confirm that the issue is internal to CC3000, although it might have to do with implementation of the host drivers that happened to be common to both the Core and his prototype.

Just wanted to say to everyone here - thanks for your patience! We are diligently testing and debugging at the Spark office, while also trying to stay on TI’s case so that they give us some feedback on this.

david_s5 · January 31, 2014, 4:23am

I think I was able to force the CFOD by doing a http get of a large page using the tcp client (socket #1) and delaying the read of the results by 15 seconds (not in a hard wait 15000 times with a 1 ms delay in loop). The host keeps sending re-transmit and I believe this starves the sparkSocket connection from getting buffers This causes the dropped sparkSocket connection and re-connections. With a short aucInactivity (20) the tcp client (socket #1) and the sparksocket (socket #0) are closed and when reopened both ended up being socket #0. (don’t ask me how)

The SparkProtocol::handle_received_message(void) then read data from the tcp client connection and the code
int len = queue[0] << 8 | queue[1]; resulted in a length 0x4854 (the HT of HTTP) This then overwrote queue spilling onto

 tNetappIpconfigRetArgs ip_config;

volatile int8_t  WLAN_MANUAL_CONNECT = 0; //For Manual connection, set this to 1
volatile uint8_t WLAN_DELETE_PROFILES;
volatile uint8_t WLAN_SMART_CONFIG_START;
volatile uint8_t WLAN_SMART_CONFIG_STOP;
volatile uint8_t WLAN_SMART_CONFIG_FINISHED;
volatile uint8_t WLAN_SERIAL_CONFIG_DONE;
volatile uint8_t WLAN_CONNECTED;
volatile uint8_t WLAN_DHCP;
volatile uint8_t WLAN_CAN_SHUTDOWN;

void (*announce_presence)(void);

Things really went south then!

Setting aucInactivity to 60 prevented the both sockets reopened at 0 but delays the recovery by 60 seconds.

I think I have a fix the CFOD and I would like it somebody who is having CFOD would try 2 bin files.
One the used the tcp client correctly and one with the delay.Then report back there findings. Anyone up for that?

sjunnesson · January 31, 2014, 4:00pm

send them along to me and I can put it on 2 different Cores here at my desk.

AndyW · January 31, 2014, 5:13pm

This is a bad failure, but I do not believe this is the root cause of the majority of CFOD.

There are no errors either on the low-level SPI bus transfers or the STM32 firmware in the classic CFOD failure case.

david_s5 · January 31, 2014, 5:37pm

All,

I am just adding the final touches on error logging and signaling faults via the RGB LED. Here is a preview of compile time and run time controlled debug output

define DEBUG_BUILD=y on make command line and add the following to your Application.cpp

void debug_output_(const char *p)
{
  static boolean once = false;
 if (!once)
   {
     once = true;
     Serial1.begin(115200);
   }

 Serial1.print(p);
}

The API looks like

LOG("Want %d more cores",count);
WARN("Running %s on cores only %d more left",,"Low",count);
DEBUG("connection closed %d",retValue);
ERROR("Flash write Failed @0x%0x",badd_address);
PANIC(HardFault,"Hit Hardfault");

Output looks like:

0003187040:<DEBUG> int Spark_Connect() (639):connet
0003187179:<DEBUG> int Spark_Connect() (641):connected connect=0
0003233369:<DEBUG> void loop1() (164):0 total Bytes Read
0003233374:<DEBUG> virtual void TCPClient::stop() (175):closesocket=134277261
0003233381:<DEBUG> void loop1() (166):connection closed state 4
0003235908:<DEBUG> void loop1() (84):connecting
0003235916:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=1
0003236023:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):connected=1
0003236037:<DEBUG> void loop1() (97): Send
0003236042:<DEBUG> void loop1() (99): Sent
0003260515:<DEBUG> int Spark_Disconnect() (648):
0003260519:<DEBUG> int Spark_Disconnect() (653):send
0003260524:<DEBUG> int Spark_Disconnect() (656):Close
0003260529:<DEBUG> int Spark_Disconnect() (659):Closed retVal=-57
0003260535:<DEBUG> int Spark_Connect() (617):socket
0003260540:<DEBUG> int Spark_Connect() (619):socketed sparkSocket=0
0003260546:<DEBUG> int Spark_Connect() (639):connet
0003261230:<DEBUG> int Spark_Connect() (641):connected connect=0
0003311025:<DEBUG> void loop1() (164):0 total Bytes Read
0003311030:<DEBUG> virtual void TCPClient::stop() (175):closesocket=134277261
0003311037:<DEBUG> void loop1() (166):connection closed state 4
0003313554:<DEBUG> void loop1() (84):connecting
0003313561:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (61):socket=1
0003313668:<DEBUG> virtual int TCPClient::connect(IPAddress, uint16_t) (78):connected=1
0003313682:<DEBUG> void loop1() (97): Send
0003313687:<DEBUG> void loop1() (99): Sent
0003334431:<DEBUG> int Spark_Disconnect() (648):
0003334435:<DEBUG> int Spark_Disconnect() (653):send
0003334440:<DEBUG> int Spark_Disconnect() (656):Close
0003334445:<DEBUG> int Spark_Disconnect() (659):Closed retVal=-57
0003334451:<DEBUG> int Spark_Connect() (617):socket
0003334456:<DEBUG> int Spark_Connect() (619):socketed sparkSocket=0
0003334462:<DEBUG> int Spark_Connect() (639):connet

Once this is done I will make the 2 bin file available for testing

AndyW · January 31, 2014, 5:43pm

Totally agree with the comments here, low-level handling is ripe for overhaul.

There are things that can be done to make bad things happen because of poor structure/coding of the current underlying TI driver. However, until we eradicate the more serious show stoppers, these will likely fall into the category of “if it hurts, stop doing that”.

BDub · January 31, 2014, 6:17pm

Looks very cool and useful @david_s5! I've noticed issues with lots of debugging with Serial and Serial1 though, so to make sure you see all of the logs you might want to add this to the end of your debug_output_():

Still haven't heard if this works yet, but in theory it should

david_s5 · January 31, 2014, 7:30pm

Files are ready for testing http://nscdg.com/spark/good-and-bad-core-firmware.zip
This FW will write to to Serial1 (tx rx 3.3V).

good uses the TCPClinet to do an http get and then read the data 4800 bytes or so the pattern is “0.1.2.3. .998.999.”. (The output is truncated on the serial port)

bad does the same thing but does not read the data causing the CC3000 to be buffer depleted.

If there is a hard crash the device will issue an SOS on the red LED
…— code blink …—…

Please count the code and report back on if the CFOD occurs and the codes.

mtnscott · January 31, 2014, 7:53pm

@BDub and @timb, you guys always inspire me. I liked your idea and put this together -

/*

Lets get the WiFi information
*/

//*****************************************************************************
//
//! wlan_ioctl_get_scan_results
//!
//! @param[in] scan_timeout parameter not supported
//! @param[out] ucResults scan result (_wlan_full_scan_results_args_t)
//!
//! @return On success, zero is returned. On error, -1 is returned
//!
//! @brief Gets entry from scan result table.
//! The scan results are returned one by one, and each entry
//! represents a single AP found in the area. The following is a
//! format of the scan result:
//! - 4 Bytes: number of networks found
//! - 4 Bytes: The status of the scan: 0 - aged results,
//! 1 - results valid, 2 - no results
//! - 42 bytes: Result entry, where the bytes are arranged as follows:
//!
//! - 1 bit isValid - is result valid or not
//! - 7 bits rssi - RSSI value;
//! - 2 bits: securityMode - security mode of the AP:
//! 0 - Open, 1 - WEP, 2 WPA, 3 WPA2
//! - 6 bits: SSID name length
//! - 2 bytes: the time at which the entry has entered into
//! scans result table
//! - 32 bytes: SSID name
//! - 6 bytes: BSSID
//!
//! @Note scan_timeout, is not supported on this version.
//!
//! @sa wlan_ioctl_set_scan_params
//
//*****************************************************************************

long err = wlan_ioctl_get_scan_results(0, ucResults);
int _numEntry = ((uint8_t) ucResults[3] << 24) | ((uint8_t) ucResults[2] << 16) | ((uint8_t) ucResults[1] << 8) | ((uint8_t) ucResults[0]);

if (err == 0 && _numEntry > 0) {
/*
* Lets decode the status of this entry
*/
int _stat = ((uint8_t) ucResults[7] << 24) | ((uint8_t) ucResults[6] << 16) | ((uint8_t) ucResults[5] << 8) | ((uint8_t) ucResults[4]);

  /*
   * Lets test if the table entry is valid
   */
  bool _valid = (uint8_t) ucResults[8]  & 0x1;

  /*
   * Lets decode the RSSI value
   */
  int _rssi = (uint8_t) ucResults[8] >> 1;

  /*
   * Decode the security mode
   */
  int _mode = ((uint8_t) ucResults[9] | 0xC0) & 0x3;

  /*
   * Decode the length of the SSID
   */
  int _ssidlen = (uint8_t) ucResults[9] >> 2;

  char ssid[32];
  int idx = 0;
  while(idx < _ssidlen) { // Copy over the string using its length
  	ssid[idx] = ucResults[idx+12];
  	idx++;
  }
  ssid[_ssidlen] = (char) NULL;  // Lets make sure we null terminate the string

  Serial.print("\nWiFi scan status: ");
  switch (_stat) {
  	case 0:
  		Serial.print("aged, ");
  		break;
  	case 1:
  		Serial.print("valid, ");
  		break;
  	case 2:
  		Serial.print("no results, ");
  		break;
  }
  Serial.print(_numEntry);
  Serial.print(" nets found. ");
  Serial.print(ssid);
  if (_valid)
  	Serial.print(" is valid, RSSI: ");
  else
  	Serial.print("not valid, RSSI: ");
    Serial.print(_rssi);
  Serial.print(", mode: ");
  switch (_mode) {
  	case 0:
  		Serial.println("OPEN");
  		break;
  	case 1:
  		Serial.println("WEP");
  		break;
  	case 2:
  		Serial.println("WPA");
  		break;
  	case 3:
  		Serial.println("WPA2");
  		break;
  }
    delay(100);
}

else {
Serial.print("\n");
delay(100);
}

netapp_ipconfig(&ipconfig);
Serial.print("Connected to ");
Serial.print((char*)ipconfig.uaSSID);

The only thing I have not done is to have a while loop to read all of the AP's in the table. I only have one so it did not matter to me.

zachary · February 1, 2014, 1:42am

@mohit is doing most of the physical testing at Spark HQ — Mohit, at some point in this sprint, please try the binaries posted by @david_s5 at

http://nscdg.com/spark/good-and-bad-core-firmware.zip

and report to this thread.

Keep up the heavy debugging all; exactly what we need right now! Everyone grab your shovels!

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1796	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	3014	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1831	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12534	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1873	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related topics