Bug bounty: Kill the 'Cyan flash of death'

@Stevie are you sure you updated to the very latest from TI? What process did you use to update? if applied correctly, this should no longer be an issue.

@Zach, I took the CC3000 patch programmer from you, downloaded the newest release (1.13.7) from the TI site, unpacked, and overwrote the wlan_drv_patch, drv_length, fw_patch, and fw_length variables in your installer with those from TI. Then I uploaded the patcher to the Spark as described on your pages and executed. So it should be okay, but of course I am never 100% sure of anything :-). I will try again to see if that helps. Is there a way to check the CC3000 firmware version and dump to serial?

nvmem_read_sp_version() will give you the service pack version that the CC3000 is using. It should be 1.28 if it is patched properly.

Thanks again. Yes, it outputs 1.28.

Here is an update: Today my core actually stayed up all the time. Apparently it is now stable unless you do the following: If an accepted connection is waiting for receiving data and then the connection is closed from the other side, then the core gets stuck, it won’t receive any more data. I will analyze a bit more, but it seems this was the problem last time I tried: The browser was connected to the core and keeping the connection alive. Then I closed down my computer and when I later tried again the core ided.

Thanks for the update! Interesting failure mode; would you be willing to share the code that causes this failure? Would be helpful for us to dig in further.

1 Like

Hi Zach, I am working on making a real small reproducer for the problem. Will post the code once I am there.

Hi Zach,

please find below a reproducer application. I have tested this on the core with the newest version from TI. Connect to the code via telnet and press enter twice (tried from my Mac). Then the core will send ā€œokayā€ and close the connection. So far so good. If you connect and then do not press enter twice but instead close the telnet connection, then the problem begins for me.

Usually the output will be like this:

Start of loop
No socket 291
Start of loop
No socket 292
Start of loop
No socket 293
Start of loop
Accepted
Received 2
Received 2
Closed
Start of loop
No socket 294
Start of loop
No socket 295

But if I close the telnet when it is connected but before typing enter it looks like this:

Start of loop
No socket 298
Start of loop
No socket 299
Start of loop
Accepted
Received 1
Received 120
Received -1
Received negative bytes
Done closing socket
Start of loop
No socket 300
Start of loop
No socket 301

Note that the ā€œReceived 1ā€ followed by ā€œReceived 120ā€ is obvious nonsense. It should be ā€œReceived -1ā€ immediately. The ā€œReceived -1ā€ appears more than 20 seconds or so later. From then one the whole core is extremely sluggish, each new line only appears after about 20 seconds. After a while the core then loses the connection to the cloud and reconnects - probably your timeout code. After that it works again. If I disable the cloud then I have to reset the core to make it work again.

Here is the code:

#include "application.h"
//#include "spark_disable_cloud.h"

int g_counter = 0;
int g_listenSocket = -1;

void setup()
{
    Serial.begin(9600);

    g_listenSocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    if (g_listenSocket < 0)
        return;
    long optval = SOCK_ON;
    if (setsockopt(g_listenSocket, SOL_SOCKET, SOCKOPT_ACCEPT_NONBLOCK, &optval, sizeof(optval)) < 0)
        return;

    sockaddr listenAddress;
    listenAddress.sa_family = AF_INET;
    listenAddress.sa_data[0] = 0;
    listenAddress.sa_data[1] = 80;
    listenAddress.sa_data[2] = 0;
    listenAddress.sa_data[3] = 0;
    listenAddress.sa_data[4] = 0;
    listenAddress.sa_data[5] = 0;
    if (bind(g_listenSocket, (sockaddr*)&listenAddress, sizeof(listenAddress)) < 0)
        return;
    if (listen(g_listenSocket, 0) < 0)
        return;
}


void loop()
{
    Serial.println("Start of loop");
    sockaddr clientAddress;
    socklen_t addressLength = sizeof(clientAddress);
    int fd = accept(g_listenSocket, (sockaddr*)&clientAddress, &addressLength);
    if (fd < 0)
    {
        Serial.print("No socket ");
        Serial.println(g_counter++);
        delay(500);
        return;
    }

    Serial.println("Accepted");
    char inBuffer[512];
    inBuffer[0] = '\0';
    int readOffset = 0;
    while (!strstr(inBuffer, "\r\n\r\n"))
    {
        int readBytes = recv(fd, inBuffer + readOffset, sizeof(inBuffer) - readOffset - 1, 0);
        Serial.print("Received ");
        Serial.println(readBytes);
        if (readBytes < 0)
        {
            Serial.println("Received negative bytes");
            closesocket(fd);
            Serial.println("Done closing socket");
            fd = -1;
            return;
        }
        readOffset += readBytes;
        inBuffer[readOffset] = '\0';
    }

    send(fd, "okay\n", 5  , 0);
    delay(50);
    closesocket(fd);
    Serial.println("Closed");
}

A few comments. I have just upgraded my Spark Cores to 1.28, and things seem more reliable.

  • If you follow the instructions on GitHub link, using the GIT command line at step 2 works. Downloading the ZIP files at step 2 [method 2] leaves you with version 1.24. I think the zip files need to be upgraded.
  • I am pretty sure that the magenta does not flash at 200 mSec during update as noted in the readme either. I am sure it is a lot slower.
  • It would be great if there was the option to send this firmware over the air pre-compiled, even if you have to acknowledge that physical action is needed afterwards.
  • Under 1.28, I am still getting the Cyan flash of death.
  • I am just wondering - how stable is the back end infrastructure? At times I have had two devices disconnect simultaneously, with a third remaining connected. One possible explanation is that there are multiple back end servers and the server two devices were connected to had a a minor issue

Hope this all helps… Darryl

Thanks, good feedback!

Does it turn on/off every 200ms? That would be a 400ms "flash".

We're working on that, will be released next week! That's why we haven't really made announcements on it yet.

Under what circumstances? Can you provide more info? We've generally seen CFOD disappear under any reasonable circumstances, but if there are edge cases we'd love to fix them.

Back-end infrastructure is pretty stable; there are times when we reset the server and all devices will disconnect and reconnect, but it's quite infrequent. We're not yet at the point where we can guarantee every message gets through, but we're getting pretty close. We have, however, seen behaviors where two Cores on the same network will interact, perhaps because of how the router is handling DHCP leases. Have you tried them on two different routers, and do you see the same behavior?

Hi All

Today I have been monitoring three devices on my bench, and logging when they have a CFOD. To be clear, the CFOD is when the device does a fast flash for a period, and is therefore offline for about a minute. All these devices are running 1.28. The times (UTC+10) so far for CFOD are:

9:54AM - Opal
10:27AM - Relay
10:28AM - Red
10:48AM - Opal
11:23AM - Opal
11:29AM - Opal
12:10PM - Opal
12:28PM - Relay

As can been seen, Opal is far less reliable than the other two. This may have to do with the code running in each. Opal is being monitored every 60 seconds through a long (120 byte from memory) string, and also has serial data coming through constantly. Relay is being monitored through a single integer every five minutes, and has some custom timer code. Red is connected to a TCP server, and is sending strings every 7 minutes but that is all.

From this, it would appear that there is a correlation with the number of network requests (or load on the device) and the likelihood of a CFOD.

My infrastructure is that these units are connected to an Apple Time Capsule. They are about 1m from the AP. They are the only devices running on 2.4 GHz - I did have unified naming for both bands but now changed naming so that devices are on 5 GHz if possible. The Time Capsule is in bridge mode, with DHCP being offered by the Home Automation server.

The DHCP logs are below... They come from udhcpd running on a linux box.

Jun 23 08:27:09 lights udhcpd[1105]: Sending OFFER of 192.168.1.59 Relay
Jun 23 08:28:30 lights udhcpd[1105]: Sending OFFER of 192.168.1.53 Red
Jun 23 08:46:08 lights udhcpd[1105]: Sending OFFER of 192.168.1.52 Opal
Jun 23 09:55:21 lights udhcpd[1105]: Sending OFFER of 192.168.1.52 Opal
Jun 23 10:27:44 lights udhcpd[1105]: Sending OFFER of 192.168.1.59 Relay
Jun 23 10:29:11 lights udhcpd[1105]: Sending OFFER of 192.168.1.53 Red
Jun 23 10:49:49 lights udhcpd[1105]: Sending OFFER of 192.168.1.52 Opal
Jun 23 11:25:04 lights udhcpd[1105]: Sending OFFER of 192.168.1.52 Opal
Jun 23 11:31:14 lights udhcpd[1105]: Sending OFFER of 192.168.1.52 Opal
Jun 23 12:12:00 lights udhcpd[1105]: Sending OFFER of 192.168.1.52 Opal
Jun 23 12:28:23 lights udhcpd[1105]: Sending OFFER of 192.168.1.59 Relay

These logs tend to suggest that I have missed some disconnections with my manual logging.

I don't know if any of this helps

Darryl

2 Likes

OK. I have traced this down a bit further. It seems that there is an edge case with the Serial port. I am not sure if this is specifically with the serial port, or just with delays induced by the serial port. My code how it appears is something like:

void loop() {
  while (Serial1.available()){
    // Turn D7 on
    ... Do something
    // Turn D7 off
  }
  delay (100);
}

The serial data comes in at 9600 bps and this data is repeated about every three seconds.

0,58,128,233,0.55,0.46
1,54,77,233,0.33,0.70
2,78,99,231,0.43,0.79
3,37,144,232,0.62,0.26
4,166,245,232,1.06,0.67
5,41,103,233,0.44,0.40
6,54,146,233,0.63,0.37
7,49,94,232,0.41,0.52

Trying to chase things down, I unplugged the serial line, and the device operated much more reliably. Overnight with the serial plugged in, the device probably reset itself 20-30 times. Once the serial was removed, it became a lot more stable.

Concerned that the serial code might be taking up too much CPU time, I added the code to turn D7 on whilst processing Serial code. What I found was that the LED flashes on very quickly. [Bitscope results below]

I would be suggesting my code is bad next, but the resets are not coincident with the red SOS message.

I have put the D7 LED on the Saleae16 bitscope and found that the serial code is running for about three lots of 25 mSec every three seconds or so, followed by 2-3 uSec pulses every 100 mSec.

I am just wondering if the problem might be the occasional joint interrupt between the serial port and the CC3000, where the CC3000 interrupt is lost.

Not sure if this qualifies as a an ā€˜edge case’ or even if it is the CFOD but I suspect it is related.

For days I have had my core resetting itself randomly, 10 or 12 times overnight. I used a counter in the start() section to update a value in EEPROM to measure this. I then finally discovered this thread and updated the CC3000 according to the instructions above and thought I had fixed it, but now I get a different behaviour. Every few hours the code just freezes for 100 secs and then continues on it’s way - no reset. I only witnessed it once and the LED was normal and breathing cyan, the D7 LED (which is used in my code) was frozen in the ON state as were all the outputs - normally toggling on or off every few seconds. Eventually the whole core went dark (no LEDs) or outputs for 30 secs (I think) and then came back to normal via solid cyan -> breathing cyan.

I have inserted some extra code to measure how long one of my longer loops actually took (it is normally 150 to 4000ms) and if it exceeds that time I call it a freeze and record the time it actually took. I’ve recorded 3 freezes in 8 hours and they are 100 seconds long, give or take a few secs due to the variable length loop I am using. I also record the network.RSSI() signal strength thinking my wifi might be the problem and it shows as +2dB at the conclusion of the freeze, whereas it normally runs at -50 to -60dB - impossible? I’m using a netcomm myzone 3G modem/router but the same problem has happened with a different router. It maybe that it was also exhibiting this behaviour, although much more frequently, before the CC3000 patch but resetting itself as part of that.

Not sure if there are any answers to this one or if the ā€˜official’ TI patch will fix it. Fingers crossed. The core is going in to a kinetic artwork and it won’t look so good in the gallery if the artwork locks up every couple of hours. I have to decide in the next few days whether to replace the core with an Arduino Uno, but that will require me visiting the gallery every day to update the code.

Fabulous product BTW, and I’m thinking of more uses for it all the time. It just needs to be reliable.

3 Likes

Not sure if you have been following the udp issues but simple programs that just send a udp packet every 200 ms CFOD as well. UDP issues and workarounds - #13 by SomeFixItDude - Troubleshooting - Particle Even if you take the cloud out of the picture the user loops stops running in that issue that I linked to. I have been eagerly awaiting some sort of spark core update to the core's firmware to make it more stable... The core is pretty much useless if it can't stay up for more than an hour. Like AndrewS' case I too need remote programming and the spark core looked like the answer. These stability problems require me to have someone around to reset the core every few hours.

Couldn't agree more with that statement. I don't want to seem like I am just adding to the complaints... Just really hoping to see things get fixed. Please make this awesome product more stable.

2 Likes

I’m starting to look at the stability issues again. This is writing out to a TCP socket as quickly as possible.

(gdb) bt
#0  0x0800b5e4 in Delay_Microsecond (uSec=<optimized out>, uSec@entry=100000) at ../SPARK_Firmware_Driver/src/hw_config.c:260
#1  0x0800c9c8 in panic_ (code=<optimized out>) at ../SPARK_Firmware_Driver/src/panic.c:62
#2  0x08008674 in HardFault_Handler () at ../src/stm32_it.cpp:88
#3  <signal handler called>
#4  0x080074cc in TCPClient::available (this=0x0, this@entry=0x20000328 <client>) at ../src/spark_wiring_tcpclient.cpp:150
#5  0x08005b48 in loop () at ../src/application.cpp:67
#6  0x08005d9c in main () at ../src/main.cpp:199

The application is

void loop()
{
  int now = Time.now();

  if(now - last_status >= 5)
  {
    last_status = now;
    Serial.print("Now: ");
    Serial.println(now);

    Serial.print("Received: ");
    Serial.println(received);

    // Prints out the network parameters over Serial
    Serial.println(Network.SSID());
    Serial.println(Network.gatewayIP());
    Serial.println(Network.subnetMask());
    Serial.println(Network.localIP());
  }

  if (client.connected()) {

    server.print(millis());
    server.print(" ");
    server.println(sent);

    sent++;

    while (client.available()) {
      received++;
      sent++;
      server.write(client.read());
    }
  } else {
    client = server.available();
  }
}

Client state:

(gdb) frame 5
#5  0x08005b48 in loop () at ../src/application.cpp:67
67          while (client.available()) {
(gdb) print client
$5 = {<Client> = {<Stream> = {<Print> = {_vptr.Print = 0x8015270 <vtable for TCPClient+8>, write_error = 0}, _timeout = 1000,
      _startMillis = 1073756172}, <No data fields>}, static _srcport = <optimized out>, _sock = 2,
  _buffer = "\377\373%\377\375\003\377\373\030\377\373\037\377\373 \377\373!\377\373\"\377\373'\377\375\005\377\375%\377\373\003\377\376\030\377\376\037\377\376 \377\376!\377\376\"\377\376'\377\374\005\000 Q\264\000\b\000\000\000\000\000\000\000\000\005\000\000\000`O\000 \000\000\000\000\371\377\377\377h\341\002\000\000\000\000\000\005\000\000\000\000\000\000\000`O\000 \b\020\000\000`O\000 ɖ\000\b\001\000\000\000\210\026\000 6\000\000", _offset = 0, _total = 0}

(gdb) print server
$6 = {<Print> = {_vptr.Print = 0x8015350 <vtable for TCPServer+8>, write_error = 0}, _port = 23, _sock = 1,
  _client = {<Client> = {<Stream> = {<Print> = {_vptr.Print = 0x8015270 <vtable for TCPClient+8>, write_error = 0},
        _timeout = 1000, _startMillis = 1073756172}, <No data fields>}, static _srcport = <optimized out>, _sock = 2,
    _buffer = "\020\000\000\000\025\000\000\000\000\000\000\000\200", '\000' <repeats 16 times>, "\060\000\000\000\000\000\000\025\000\000\000p\026\000 \001\000\000\000\034\026\000 \224O\000 Q\264\000\b\000\000\000\000\000\000\000\000\005\000\000\000`O\000 \000\000\000\000\371\377\377\377h\341\002\000\000\000\000\000\005\000\000\000\000\000\000\000`O\000 \b\020\000\000`O\000 ɖ\000\b\001\000\000\000\210\026\000 6\000\000", _offset = 0, _total = 0}}
1 Like

Ok, so a basic UDP echo server on my laptop will induce a CFOD quickly. Every time this happens, it

Breakpoint 1, HardFault_Handler () at ../src/stm32_it.cpp:88
88              PANIC(HardFault,"HardFault");
(gdb) bt
#0  HardFault_Handler () at ../src/stm32_it.cpp:88
#1  <signal handler called>
#2  UINT32_TO_STREAM_f (p=0xd "\005", p@entry=0xa "", u32=<optimized out>) at ../CC3000_Host_Driver/cc3000_common.c:88
#3  0x0800cc14 in simple_link_send (sd=<optimized out>, buf=0x20004fc0, len=4, flags=0, to=to@entry=0x200002b2 <udp+30>,
    tolen=tolen@entry=16, opcode=opcode@entry=131) at ../CC3000_Host_Driver/socket.c:1044
#4  0x0800ccc2 in sendto (sd=<optimized out>, buf=<optimized out>, len=<optimized out>, flags=<optimized out>,
    to=0x200002b2 <udp+30>, tolen=16) at ../CC3000_Host_Driver/socket.c:1132
#5  0x08008c52 in UDP::write (this=0x20000294 <udp>, buffer=0x20004fc0 "\221\257", size=4) at ../src/spark_wiring_udp.cpp:147
#6  0x08005c7e in loop () at ../src/application.cpp:49
#7  0x08005f7e in main () at ../src/main.cpp:199

Note that the tSLInformation pucTxCommandBuffer pointer is now NULL!

(gdb) print tSLInformation.pucTxCommandBuffer
$8 = (UINT8 *) 0x0

It looks like wlan_stop() is being called somewhere which sets the buffer to NULL (really no reason to, it’s not ā€˜deallocating’, but the application loop is regaining control and attempting to use a socket.

We can see that wlan is currently stopped, and SPARK_WLAN_Loop() won’t start it again until the next iteration, since the check for the flag is in an else block.

(gdb) print SPARK_WLAN_STARTED
$9 = 0 '\000'

As a test, I added

  if(SPARK_WLAN_STARTED == 0)
    return;

to loop() before touching the UDP sockets.

Now, I get CFOD after about 30 seconds,but no HardFault. Unfortunately the UDP sockets stop working, although the core went back to breathing cyan, and SPARK_WLAN_STARTED == 1

More digging…

1 Like

Has anyone had success soldering jumpers onto the STM32? I’m tempted to wire the SPI/interrupt pins up to my saleae Logic

Edit: Looks like SCLK, MISO, MOSI can be had on the SST25 :slight_smile: Interrupt on pin 22 and CS on 25 (which is on a corner).

Yes. It is pretty easy to bring out the spi bus. I have some python scripts to decode the spi decoded output from a logic into cc3000 speak. Crude but useful.

I think any overrun problems are caused by the buffer management code. We should be able to write a socket at full speed and have the system throttle to match the buffer availability. All these TCP & UDP bug reports suggest that is broken in a major way.

Here’s the last sent command (HCI_CMND_CONNECT). The CC3000 does not respond to this. One interesting note is the payload is not zero padded to an even number. Edit: Still unclear if the payload length should be even, or the whole packet? (5 byte header + payload)

There is no enforcement of NULL termination after the memcpy() of the domain before gethostbyname in Spark_Connect()