TCPClient slowing to a crawl

mleonh · July 21, 2014, 7:51pm

Hi everyone,

I’m experiencing issues with the TCPClient, and I’m hoping someone here can help me out. My actual goal is audio streaming, but it turns out I cannot get even the basic ‘Google’ example from the docs to work properly! This one here.

The behavior I’m seeing is that the download starts just fine but slows down after a couple of seconds (always at the same number of bytes), transforming into a slow crawl of intermittent data bursts with pauses increasing in size, until almost no data is received at all. With some source addresses (I did a couple of tests) the core manages to finish the downloads (after minutes, like for the Google page), sometimes it just hangs/restarts.

Building in the Web IDE or using my local environment (latest master branches) does not seem to make a difference. To rule out my home network as cause I also tried using a mobile hotspot, with same results.

Any ideas what’s going on here? Modifying the basic Google example a bit to output not the source but milliseconds since start of download; every 512 bytes; illustrates the issue pretty well:

connecting...
connected
3115 - download total: 512
5935 - download total: 1024
8890 - download total: 1536
11455 - download total: 2048
18645 - download total: 2560
27726 - download total: 3072
30405 - download total: 3584
39365 - download total: 4096
48445 - download total: 4608
57526 - download total: 5120
60092 - download total: 5632
69166 - download total: 6144
78246 - download total: 6656
80811 - download total: 7168
89881 - download total: 7680
98961 - download total: 8192
101691 - download total: 8704
110602 - download total: 9216
119682 - download total: 9728 ...

You can see that after two minutes of download not even 10kb have arrived.

Anyone?

Dave · July 21, 2014, 8:00pm

Hi @mleonh,

Interesting! I’m researching a potentially related issue. If you want the greatest throughput, I recommend compiling locally, and editing inc/spark_wiring_tcpclient.h and changing the buffer size to something bigger, like 256. Try making your packet sizes from your source a weird number of bytes, like 255. There’s a glitch in the host driver about packet sizes at or multiples of 256 bytes, and TCP really only goes up to 1500 anyway, so you want your buffer to be as close to your packet size as possible, and at the moment 256 / 512 byte packets might be problematic.

Thanks!
David

Hootie81 · July 21, 2014, 9:04pm

Could you share the code you used to get the times?

mleonh · July 21, 2014, 9:21pm

Sure, here it is:

TCPClient client;
byte server[] = { 74, 125, 224, 72 }; // Google

system_tick_t start;

void setup()
{
  // Make sure your Serial Terminal app is closed before powering your Core
  Serial.begin(9600);
  // Now open your Serial Terminal, and hit any key to continue!
  while(!Serial.available()) SPARK_WLAN_Loop();

  Serial.println("connecting...");

  if (client.connect(server, 80))
  {
    Serial.println("connected");
    client.println("GET /search?q=unicorn HTTP/1.0");
    client.println("Host: www.google.com");
    client.println("Content-Length: 0");
    client.println();
    start = millis();
  }
  else
  {
    Serial.println("connection failed");
  }
}

long total = 0;
void loop()
{
	if (client.available()) {
		client.read();
		total++;
		if (total % 512 == 0) {
			Serial.print(millis() - start);
			Serial.print(" - download total: ");
			Serial.println(total);
		}
	}

  if (!client.connected())
  {
    Serial.println();
    Serial.println("disconnecting.");
    client.stop();
    for(;;);
  }
}

As I said, pretty much the basic example from the docs with some timing additions. Outputting the received bytes via Serial and/or taking time measurements does not seem to influence the stalling.

@Dave: Thanks for your reply. Throughput is not (yet) my primary concern, right now I’d be happy if I can get reliably receiving data to work at all! I already tried playing around with the TCP buffer sizes yesterday, sadly, without much success.

Hootie81 · July 21, 2014, 10:10pm

i noticed something strange with Serial.read() in one of my programs that did the same.

i found adding a 10ms delay in the if(Serial.available()) loop fixed it. im not sure if its the checking available or the read that messes it up but it fixed the problem.

I tried many different ways so that i didnt slow the serial transfer down, i tried adding delays every 256bytes, instead of each byte thinking it may be a buffering issue with no luck…

Anyways goodluck im afk for the next 12hrs

Dave · July 21, 2014, 11:00pm

Hi @mleonh,

No problem! It looks like you’re using client.read(), which pulls in one byte at a time. I tried increasing the buffer size to 512, and I modified the firmware a bit, how about (note, this requires modifying core-firmware/inc/spark_wiring_tcpclient.h, and increasing the buffer size).

#include "application.h"
#include <math.h>

TCPClient client;
//byte server[] = { 74, 125, 224, 72 }; // Google

system_tick_t start;
uint8_t buffer[511];
uint8_t done = 0;

void setup()
{
  // Make sure your Serial Terminal app is closed before powering your Core
  Serial.begin(9600);
  // Now open your Serial Terminal, and hit any key to continue!
  while(!Serial.available()) SPARK_WLAN_Loop();

  Serial.println("connecting...");

  if (client.connect("google.com", 80))
  {
    Serial.println("connected");
    client.println("GET /search?q=unicorn HTTP/1.0");
    client.println("Host: www.google.com");
    client.println("Content-Length: 0");
    client.println();
    delay(250);
    start = millis();
  }
  else
  {
    Serial.println("connection failed");
  }
}

long total = 0;
void loop()
{
    if (done) {
        return;
    }

    //read it fast!
    int count = 0;
    while ((count = client.available()) >= 0) {
        for(int i=0;i<511;i++) { buffer[i] = 0; }

        //Serial.println(String(count) + " bytes available");
        total += client.read(buffer, min(count, 511));
        //Serial.println((char*)buffer);

        Serial.print(millis() - start);
        Serial.print(" - download total: ");
        Serial.println(total);
    }


  if (!client.connected())
  {
    Serial.println();
    Serial.println("disconnecting.");
    client.stop();
    client.flush();
    done = 1;
    delay(50);
  }
}

You’re definitely right that things slow down. I would want to watch this request with wireshark, but my guess is that data is coming too fast, and the delays are incremental backoffs as packets are retransmitted. I might be crazy, but I suspect this is an artifact of the buffers on the radio itself. In any case, I wanna research this more… But this should help a little.

Thanks,
David

mleonh · July 22, 2014, 6:59am

Thanks for your help. I tried your code with a TCPCLIENT_BUF_MAX_SIZE of 512 bytes. The stalling still occurs – in fact, it’s even more visible and happens sooner than before:

7 - download total: 218
87 - download total: 436
111 - download total: 654
551 - download total: 872
575 - download total: 1090
1438 - download total: 1308
1463 - download total: 1526
3183 - download total: 1744
3205 - download total: 1962
6626 - download total: 2180
6651 - download total: 2398
13482 - download total: 2616
13506 - download total: 2834
27150 - download total: 3052
27181 - download total: 3270
54457 - download total: 3488
54487 - download total: 3706
and so on …

What’s interesting is that including “spark_disable_cloud.h” (and using the ip byte array instead of the hostname) causes the issue to disappear, resulting in a very regular download which goes through to the end.

Hootie81 · July 22, 2014, 10:51am

i couldn’t get the built-in gethostbyname to work (the function that does the DNS lookup) so i made my own here

Dave · July 22, 2014, 2:27pm

Oooh, that's a great clue. Okay, lets open an issue for this I think:
https://github.com/spark/core-firmware/issues/239

Thanks!
David

Hootie81 · July 22, 2014, 5:47pm

could it be because the while client.available loop blocks the cloud connection

what happens if you add a delay (which calls spark wlan loop) instead of disabling the cloud?

Dave · July 22, 2014, 6:49pm

Hmm… I tried leaving the cloud on and adding Spark_WLAN_Loop() between reads, and it didn’t seem to help. Maybe this is something @satishgn could check out?

Thanks,
David

Hootie81 · July 22, 2014, 7:27pm

one thing i notice is there is no way to tell if the buffer is getting emptied… ie client available gets to 0, the while loop goes false, the code continues, client is still connected so the if loop gets skipped -> back to loop() the first if (done) is false, then data is available again so the while loop happens…

218 seems to be a magi number that the total is incrementing by each time round, even as it slows right down

i left my core in my room this morning otherwise i would give it a quick try myself.

mleonh · July 22, 2014, 7:42pm

Based on some further experiments I did, I’m starting to believe that all this does not relate to the cloud connection directly. I think having the connection enabled is just one way to trigger the stalling.

If the cloud connection is disabled, the download works perfectly, with a speed good-enough for what I originally set out to do: audio streaming. Transferring a WAV (8khz/8bit for testing) from a very simple node.js relay works great. This is pretty much the same as downloading bits of HTML from Google, as in the basic example.

However, guess what happens if I don’t read at full speed (just discarding data), but put the data in a ringbuffer instead; which is simultaneously read by an interrupt running at the sample rate of the WAV? Exactly, after about a second or two the pauses start again, with length between data bursts exponentially increasing. Which is exactly the behavior which can be seen when ‘just downloading’, but with the cloud connection on.

To me it seems like the problem is caused if available data is not immediately read, but only in specific cases – for example, adding small delays in the read loop (basic example, cloud disabled) does not result in stalling, just a slower transfer. What I can’t figure is what it is exactly that the cloud connection/my playback code is doing to trigger the issue.

bko · November 21, 2014, 1:38am

I found this thread via a corresponding github issue

I think the “problem” here is normal exponential back-off from TCP NAKs at work. If the core cannot take in more TCP data and sends a NAK, the sender backs off as it should. This repeats (sometimes quickly) and the sender’s rate is slowed for the life of that connection.

Other cases of this effect come from the sender using a packet size (MTU) that is larger than the (Ethernet minimum sized) buffer on the TI CC3000. Using the cloud or any other network service which uses packets buffers on the chip will tend to make this worse.

There are some possible work-arounds like using chunked transfers or closing and reopening the connection, but none of these are very attractive.

chuank · November 21, 2014, 1:58am

Hi @bko,

Thanks for your post – it helps to explain and confirm my doubts on the issue I'm facing. I'm getting a very similar situation regarding a websocket implementation that works with Spacebrew. This implementation requires extended payloads that go beyond what TCPClient is able to handle. The root of my issue is the websocket sending continuous streams of data to the Spark Core in short intervals, which I suppose is similar to what @mleonh is facing.

Long story short, I had to resort to MANUAL mode and rely on a disconnect/reconnect routine in my Spark code to get around the buffer issue. Running my code with spark cloud connectivity results in a repeatable hard fault SOS:

bko · November 21, 2014, 2:07am

Hi @chuank

Sorry you are having trouble! Can I ask if you are in control of the Spacebrew server? If so, you might be able to set the MTU on the server to a smaller number so that the server packetizes your payload into smaller chunks.

Another idea would be to quickly dump any data you don’t need. For instance for @BDub Facebook likes pushup man, he just throws away the first 512 bytes quickly because he knows that his good data in not in the first part of the returned data.

chuank · November 21, 2014, 2:31am

Yes it’s my own server, MTU 1500. Based on the buffer size set in TCPClient (128), does this mean I should start with an MTU of 128 and work both directions to find the ‘sweet spot’?

It’s certainly worth a try, although I’m averse to reducing MTUs just to get this to work with the Spark – there are other services running on my server that I don’t wish to impact down the line – assuming MTU changes are applied across the entire ethernet interface (please advise if otherwise!)

If this works, however, it will probably make sense to run my spacebrew server solo…

I will test it on my local mac, and then a raspberry pi first. Feels like i’m really close to fixing this issue. Thanks again.

bko · November 21, 2014, 3:35am

So MTU of 1500 should already be ok I’m afraid. Any way you can dump unneeded data quickly?

Another thing to know is that currently using client.print() causes one packet per character to be sent. This is getting fixed but switching to client.write() can give big improvements.

chuank · November 21, 2014, 12:10pm

The data’s formatted in JSON. I already have a means to discard data that’s not terminated properly with the closing curly braces, but there’s no way for me to bail out of the reads earlier – maybe reads with unexpected opening characters.

In any case, that approach makes it impossible for me to handle websockets beyond a single frame size (126 bytes), so it won’t be a long-term fix.

I’m using client.write(), as well as client.read(buf,len) that @ekbduffy discovered in the as-yet-undocumented TCPClient method. The problem I’m facing is more of incoming data – client.read(buf,len) causes hard faults during moments where I suspect the encoded length data in the websocket header exceeds or does not match the actual unread buffer. However, using client.read() causes inexplicable freezes and disconnections.

bko · November 21, 2014, 12:26pm

HI @chuank

Another option if you can compile locally is to increase the TCPClient buffer size. I think eventually controlling these sizes will be made easier by the Spark team, but right now there are too many trade-offs to be balanced for the differing applications and a relatively small fixed size was selected.

Topic		Replies	Views
TCPClient read() data missed Troubleshooting	41	7763	October 26, 2015
TCP Client not reading anything Troubleshooting	28	7972	January 29, 2014
TCPClient reliability problems [SOLVED] Product Creation	13	4268	October 14, 2014
TCP client dies Troubleshooting	29	5195	October 30, 2014
TCPServer() usage and bandwidth issue Firmware	19	5769	May 26, 2016

TCPClient slowing to a crawl

Related topics