Unwanted but reproducable disconnect in TCPClient

mdma · June 22, 2014, 4:20am

I’ve ported over the webduino webserver to the spark and have been stumped for hours why I’m not getting the correct response from a simple example, yet some of the built in web pages (like favicon.ico) do work.

I’ve reduced the problem down to how the message is printed. Here are two functions that handle page output - they only differ in the how they print the response body. The first case prints using 3 separate prints:

void cmd1(WebServer &server, WebServer::ConnectionType type, char * data, bool)
{
  server.printP("HTTP/1.0 200 OK" CRLF);    
  server.printP("Content-Type: text/html" CRLF CRLF);
  server.printP("Hello world");
}

This doesn’t work. Connecting with telnet and typing GET /1 HTTP/1.0' <CRLF> <CRLF>
I get a “connection closed by remote host” after the 2nd line has been printed, before Hello world is printed. 100% reproducable.

Here’s the code that works:

void cmd2(WebServer &server, WebServer::ConnectionType type, char * data, bool)
{
  server.printP("HTTP/1.0 200 OK" CRLF);    
  server.printP("Content-Type: text/html" CRLF CRLF "Hello world");
  server.printP("It's 42");
}

This works, I now see “Hello world” in the browser, but not “It’s 42”. Again, I get a disconnect before the third print.

Trying to make sense of the code:

server.printP() calls `TCPClient.write(string, strlen(string));’, which is implemented as

status() ? send(...) : -1

To be sure it wasn’t a problem with status() dropping calls to send, I just made the send unconditional. No change.

send() calls simple_link_send(sd, buf, len, flags, NULL, 0, HCI_CMND_SEND); so then it’s down to the CC3000.

This is strange behaviour, but good that it is 100% reproducable, so there is a chance of tracking down the bug. But I’m out of my depth debugging the CC3000 - I hope someone can help.

mdma · June 23, 2014, 3:34am

I’ve done a little more digging.

If I add a delay(20) between the writes to the TCPClient socket, then it works as expected.

I’m not writing lots of data - a handful of bytes.

This is the data recieved:

GET /1 HTTP/1.1<CR><LF>
User-Agent: curl/7.21.7 (amd64-pc-win32) libcurl/7.21.7 OpenSSL/0.9.8r zlib/1.2.5<CR><LF>
Host: 192.168.1.101<CR><LF>
Accept: */*<CR><LF>
<CR><LF>

And this is the data sent (with delays to make it work):

HTTP/1.0 200 OK
Content-Type: text/html

Hello world

I just upgraded to the latest cc3000 patch just to see if that helps, but no.

AndyW · June 23, 2014, 3:55am

Sure as hell looks like the host driver is not correctly accounting for buffer utilization in the CC3000, it should keep a running count of the number of free buffers in the CC3000 at any time, and hold off issuing additional send/sendto until there is a buffer available.

mdma · June 23, 2014, 4:06am

I was thinking the same. I was hoping it wasn’t buffer starvation, since some form of management should take care of that.

I’ve disabled the cloud for my testing. It doesn’t make any difference. What I do notice is that when I run curl, sometimes the response comes back immediately, other times curl has to wait a couple of seconds for the response. My debug logs shows that the TCP connection to webserver socket from curl sometimes takes a couple of seconds to produce a new TCPClient connection. And the strange thing is, whether the next connection is fast or slow depends upon the prevoius URL requested - presumably something to do with the amount of data previously sent.

Is there source code I can look at to try to track down this problem?

AndyW · June 23, 2014, 4:14am

All the code is freely available on github

The host driver (originally from TI as modified/ported by ) is in core-common-lib/CC3000_Host_Driver.

That’s not meant to be a curt 'hey, there’s the source - you’re on your own"; but rather a pointer in the same direction I am looking (and suggesting others do too.) It could be as simple as an off-by-one error, or something subtle like a mutex problem. I’m personally not a big fan of the host driver code from TI tracks the CC3000 buffer usage.

zachary · June 23, 2014, 6:00am

Thanks @mdma for the clear debug case. Here’s where simple_link_send is specifically FYI:

github.com

particle-iot/core-common-lib/blob/master/CC3000_Host_Driver/socket.c

/*****************************************************************************
 *
 *  socket.c  - CC3000 Host Driver Implementation.
 *  Copyright (C) 2011 Texas Instruments Incorporated - http://www.ti.com/
 *
 *  Updated: 14-Feb-2014 David Sidrane <david_s5@usa.net>
 *  Redistribution and use in source and binary forms, with or without
 *  modification, are permitted provided that the following conditions
 *  are met:
 *
 *    Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *
 *    Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the
 *    distribution.
 *
 *    Neither the name of Texas Instruments Incorporated nor the names of
 *    its contributors may be used to endorse or promote products derived

This file has been truncated. show original

AndyW · June 23, 2014, 1:34pm

If it turns out that the host driver is doing what it says on the tin, and returning -2 (the moral equivalent of EAGAIN, or ENOBUFS) - then we should consider modifying the default behaviour of the interface exposed to user code, and loop w/ delay (or something better, but not at the expense of RAM.)

EAGAIN handling should not be required around every call to a server.printP()-like substance.

mdma · June 23, 2014, 1:55pm

Thanks. I’d seen those files a long time ago, but kind of forgot about them while I’ve been busy in application-land!

I see 2 potential issues in the code:

lack of atomicity when updating the buffer count. This is updated in two places - one as derivative of the socket send() function, and the other as a IRQ from the SPI DMA handler. Since increments and adds are not atomic, it’s possible to have an interleaving where updates to the buffer count are lost. I am looking into what atomic operations are available on GCC/STM32 to code around this using busy-wait (without disabling interrupts or relying on a heavy critical section/mutex etc…).
the possibility that less data is sent: the send() function isn’t guaranteed to send all the data. In fact, the caller should be prepared to resend all or part of the data again. The current code in TCPClient simply calls send(), and returns the response from send() as the bytes written. There are some corner cases where this is incorrect, and TCPClient should do a better job of attempting a resend.

BDub · June 26, 2014, 9:58am

You're probably well past this point by now, but this should work just fine... correct?

void cmd2(WebServer &server, WebServer::ConnectionType type, char * data, bool)
{
  server.printP("HTTP/1.0 200 OK" CRLF "Content-Type: text/html" CRLF CRLF "Hello world It's 42");
}

I know it's not ideal for your application... I'm just looking at different ways of isolating the underlying issue.

I think it also only returns -2 if SEND_NON_BLOCKING mode is used, and this is blocking only, right?

mdma · June 26, 2014, 10:25am

Thanks for the suggestion @bdub.

Putting everything on one line works for this test case, but like you said, it’s not an ideal solution in general.

A clunky fix is to add a delay of 20ms before each write (and for now, that’s what I put in the webduino library port). It’s also happens to be the same fix that @bko has suggested for working around issues accessing the external flash. I’m wondering if both of these problems comes down to use of the internal SPI bus.

My list of todo’s is piling up in relation to the free time I have to spend on this, but I’d really like to look at adding arbitration to the internal SPI bus, since at present there is none - yet it seems there is a clear concurrency issue, since requests can come in from the main thread, or from an interrupt from the CC3000, and both are accessing the same shared state, which can result in incorrect behavior depending upon exactly when the interrupt occurs.

As a long-time specialist in concurrent code, I’m not able to explain how the system works well as it does without the arbitration in place. Is it working most of the time by pure chance?

AndyW · June 26, 2014, 1:47pm

I think currently very little is happening in IRQ context, because all comms with the CC3000 are basicly polled at present. The IRQ from the CC3000 is not used the way you’d think/hope but rather is used to acknowledge the chip select and signal the spi interface is ready.

That’s not to say there are no mutex issues, though.

mdma · June 26, 2014, 2:24pm

Ah, thanks for for the clarification @AndyW. I’m still going through the common-lib code.

I’m just came across SPI_EXTI_IntHandler - I’m guessing this is the interrupt you’re talking about? I see that it asserts CC3000’s CS line. Do you know how we can be sure there isn’t already another peripheral using the SPI bus?

bko · June 26, 2014, 2:25pm

As I understand it from @david_s5 there is no arbitration between the CC3000 and the external flash, so problems do come up pretty quickly when you are trying to use both. I cannot speak to the exact nature of the problems, but I can say writes to flash are corrupted in my tests if you do not delay before and after the flash write or read calls. I don’t know why this would be, but my theory is that now that delay() runs the Spark loop, the CC3000 gets serviced enough that the probability of collisions is reduced greatly. I did confirm that write data was corrupted by downloading the test data via dfu-util.

mdma · June 26, 2014, 2:53pm

Thanks @bko. I’m still going through the code, so I’ll have more questions than answers right now (although of course I hope that will change once I’ve built a complete picture of how this works!)

delay() just calls Delay(), which is simply a loop.

void Delay(uint32_t nTime)
{
    TimingDelay = nTime;    
    while (TimingDelay != 0x00);
}

TimingDelay is decremented as part of the SysTick handler. So, that means any CC3000 work being done is asynchronously - it’s not happening as a direct consequence of that loop.

But as you were saying, what this does mean is that the delay gives time for the asynchronous ops on the CC3000 to happen, so then the spi bus is free for the external flash.

The CC3000 can arbitrarily post asynchronous events (unsolicited events in the literature), using the same mechanism to indicate a response is available to a previous request - assert the IRQ line and wait for the host to assert the CS line to receive the event data.

So, unless I’m totally on the wrong path, it does seem as if we need to manage the assert of the CS line in the SPI_EXTI_IntHandler if the bus is already in use.

bko · June 26, 2014, 3:00pm

Hi @mdma

delay() the user function in spark_wiring,cpp calls the SPARK_WLAN_Loop and kicks the watchdog.

mdma · June 26, 2014, 3:02pm

Sorry, just noticed I was looking at my own repo, which hasn’t been updated in ages. D’oh. :facepalm:

But even so, I think that interrupt handler asserting the CS line for the CC3000 is the cause of our troubles. I’ll write some stress tests to try to reproduce the corruption of flash, and then have a go at a fix. The fix will add a mutex for the spi bus, which the interrupt routine will have to acquire before asserting the CS line. If it can’t lock the bus, then set a flag so that the event from the CC3000 is picked up later. This should avoid contention on the bus.

Topic		Replies	Views
TCP client dies Troubleshooting	29	5171	October 30, 2014
Hard fault caused by the TCPServer example from the Spark docs and a simple Python client Troubleshooting	53	12951	April 15, 2016
TCPClient Strangeness Troubleshooting	21	4337	July 30, 2015
WiFi unstable, goes dead temporarily at varying intervals Troubleshooting	38	6975	September 18, 2014
TCPClient reliability problems [SOLVED] Product Creation	13	4265	October 14, 2014

Unwanted but reproducable disconnect in TCPClient

Related topics