TCPClient connect Issues

kkessler · May 1, 2015, 3:05pm

I’ve run into an issue and a work-around that I would like to run by the forum to see if others have come across this before. I have code which reads data from a LinkSprite Camera, TCPClient connects to a NodeJS program running on a RaspberryPi and dumps the JPG there. What I have found is often times the TCPClient->connect fails (returns 0). I’ve written loops to retry the connect on failures, destroying the TCPClient object and recreating it, etc., but after it fails, it just doesn’t seem to be able to connect again. What I’ve found, though, is if I get a failure, and let the code execute back through the loop() and through the firmware code, the TCPClient->connect will work. The following pseudo-code illustrates the work around:

loop() {
    if(i_want_to_tcpconnect) {
        TCPClient c=new TCPClient();
        if(c->connect(server_string,port)) {
            c->write(somedata);
            ...
            i_want_to_tcpconnect=false;
        }
        else {
            delete c;
            c=NULL;
        }
}

If I just put some sort of for(int n=0;n<5;n++) loop around the if(i_want_to_tcpconnect) statement and the connect fails (which happens quite often for some reason), it will always fail in that tight loop. If I let it flow though the outside loop(), and in through spark firmware, the next connect attempt will most probably work. In the real world programs, getting this flow to happen is quite complicated, though. Has anyone else dealt with this issue, and found a better solution?

Also, doing research on this issue on the forum, I found a lot of people having problems with trying to do a connect(“192.168.1.9”,12345) which doesn’t work because it has to be a DNS lookup to succeed (I know because I struggled with that myself a bit). I never saw anyone mention a trick to make this work by using the free wild card dns service, xip.io. As long as you have connectivity to the internet, you could do a connect(“192.168.1.9.xip.io”,12345), and the xip DNS will return 192.168.1.9 (or whatever address you put in there), and you’re good to go. Just thought I’s pass that along.

rsteckler · May 1, 2015, 3:07pm

Starting with the easy stuff: are you closing the connection?

The cc3000 only has 4 sockets available, I believe.

bko · May 1, 2015, 3:17pm

Hi @kkessler

The comment by @rsteckler is right-on: There are seven total sockets available in the TI CC3000, four of which can be used for UDP or TCP and one of those is used by the cloud if you are using the cloud. The other sockets are used for things like DNS lookup, ARP, etc.

Every time you call client.connect(), you are using a new socket.

If you think that servicing the cloud is somehow giving you better/different results (which I can believe but don’t have any real evidence either way) then calling delay(1000); should do the same thing as going around loop() since that calls the relevant parts of the cloud servicing code.

Wow! That xip.io service (from Basecamp) looks like a big win for development! Good find!

kkessler · May 1, 2015, 3:50pm

The limitation on the number of sockets is very useful information that I did not know, but my app is pretty good about closing connections, because that is what signals the NodeJS app on the Raspberrypi to close the JPG file on disk. I’ve done a hundred netstats on the RaspberryPi to see if there are connections in weird half connected states or anything, and I never saw multiple connections or half connections; either one ESTABLISHED or nothing. I will go over my code when I get home (I haven’t pushed my latest version to github, so I can’t look at it now), with an eye to making sure I close even failed attempts (even though I would think the ~TCPClient does that as well), but I know I’ve had multiple connection failures on my first attempt after a reboot.

I can tell you the delay(1000) did not work, because that is the first thing I tried, if for not other reason that to be able to Spark.publish the results of my connect attempt.

bko · May 1, 2015, 4:29pm

I went and read through the TCPClient code again and the destructor is declared virtual ~TCPClient() {}; and does zero cleanup. It is just there so you can interface over TCPClient with your own class.

This means that unless you call client.stop() before destroying the client, the socket is still open on the TI CC3000 until it times out (which I don’t know actual timeout value but it seems to take 5-6 seconds). If you are out of sockets and try to create another one, client.connect will fail and return 0. Later on after the time out, client.connect will get a socket from the TI part and connect and return 1, indicating success.

When I was debugging other user’s UDP code this confused me at first since it seemed like the connection would go through only every Nth try, but what was really happening is that TI part timed out the socket and closed it so a new socket was open. I figured this out by writing code that dumped the connected status of all 7 sockets which clearly showed why my connections were failing–there were no sockets available.

You are also asking for memory trouble since you are returning a 128-byte buffer inside client only to ask for it again and any memory fragmentation effects get magnified in this case. That should improve in the Photon case since they are making memory allocation work better. I would not call delete and new with such frequency.

I know the client method return values are not documented but they are not documented for Arduino either. They are, however, the key to understanding what is happening with your connections.

kkessler · May 1, 2015, 6:25pm

Thank you, this is really informative. I can certainly stop creating and destroying the TCPClient object. My original implementation just statically allocated this object once, but I was desperately deleting the object, trying to clean up things so I would start from scratch and get a connection. I think my problem was think of the TCPClient like I would a Java or C Socket, and not a wrapper around some internal firmware implementation of a connection.

Is it possible that the trip through the firmware in the loop() is different than delay(1000) with respect to timing out the connection? Yesterday I was doing delay(1000) between connect attempts, and getting 50-60% failure rate on my downloads, and today, just taking the 5ms trip though the loop(), and I’m having 100% success.

WebDust21 · August 15, 2015, 7:39pm

@bko: Could you share your code for this?

bko · August 15, 2015, 7:55pm

This is only for the Core, before the HAL layer comes in–still interested?

WebDust21 · August 15, 2015, 7:55pm

Sure. I’m using a Core, and there’s a way to hack the info out despite the HAL layer .

bko · August 16, 2015, 4:04am

OK, on a Core and only on a Core, you can do

for(int i=0; i<7; i++) {
  int res = get_socket_active_status(i);
  if (SOCKET_STATUS_ACTIVE == res) {
    // do something
  } else {
    // do something else
  }
}

You might have to include event_handler.h. At the time I was doing this, I had a completely instrumented version of UDP but I no longer have that sandbox.

WebDust21 · August 16, 2015, 5:52pm

@bko: Thanks a lot. (A million, perhaps?) That should help debug some issues I’m having…

EDIT: [quote=“bko, post:10, topic:11554”]
You might have to include event_handler.h
[/quote]

Just for the record, that’s “evnt_handler.h”, part of the CC3000 host driver.

WebDust21 · August 16, 2015, 11:58pm

@bko. Does this code work on the latest DEVELOP? It seems that a lot of WiFi features are broken/unreliable on anything more than the default Particle Dev IDE firmware.
The main reasons I left the “official” firmware of the Dev IDE are:

TCPServer.stop()
smaller firmware size (or at least flexibility to manually reduce it)
hope for bugfixes and improved reliability.

I got the first two, but apparently not the last one. WiFi.ping() doesn’t work (I’m not alone with this one: Ping not working on develop branch?)…and with the latest DEVELOP, WiFi.connecting() doesn’t appear to work either.

I wrote some code to stack all the socket statuses into one variable, via the socket status code you kindly provided. Here’s the code I’m using to stack it all together:

      mBase = 0;
      for(mWork = 6; mWork <= 0; mWork--)
        {
          mBase <<= 1;
          mBase |= get_socket_active_status(mWork);        // Returns SOCKET_STATUS_ACTIVE = 1, INACTIVE = 0.  That's the only options.
      }
      Serial.print("SocketStat: "); Serial.println(mBase, HEX);
```
And I always get...0.  Regardless if the Core is connected to the :cloud: or not.

I peeked into the firmware "evnt_handler.c" and found that <code>get_socket_active_status</code> is really bit-masking an internal variable with the socket statuses in it.  As in:

//*****************************************************************************
//
//! get_socket_active_status
//!
//! @param Sd Socket IS
//! @return Current status of the socket.
//!
//! @brief Retrieve socket status
//
//*****************************************************************************

INT32 get_socket_active_status(INT32 Sd)
{
long rv = SOCKET_STATUS_INACTIVE;
if(M_IS_VALID_SD(Sd))
{
rv = (socket_active_status & (1 << Sd)) ? SOCKET_STATUS_INACTIVE : SOCKET_STATUS_ACTIVE;
}
return rv;
}


As mentioned before, I always get 0, regardless of whether the TCPClient succeeds or fails to connect to an external server (in this case, [textbelt.com][1]).  Any suggestions?  Or perhaps I should return to Particle Dev IDE?  I suppose the way to synthesize `TCPServer.stop()` is with `WiFi.off(); WiFi.on();`?


  [1]: http://textbelt.com

bko · August 17, 2015, 12:47am

Hi @WebDust21

I have not used that code in quite some time (1 year? and certainly only on the old three repo setup) so I can’t really speak to how it works on Develop branch.

But the Develop branch is “bleeding-edge” where a lot of branch integration work is done. I would switch the Latest branch for more stability in any case. Other than at release points, anything on the Develop branch could be broken briefly at any time. It is certainly a “non-goal” to integrate non-working code, but it happens sometimes.

My recollection is that the socket id’s were simple integers and 0 was always the cloud connection on the Core at the time. In the packed int32 socket_active_status, a “1” means inactive and a “0” means active. On a Core with a TI CC3000, you have to call the CC3000 driver to close the socket, see closesocket() in socket.c for instance. And that is what TCPClient.stop() does.

I am not clear what problem you are trying to debug/fix at this point: What is not working for you?

mdma · August 17, 2015, 1:00am

Hey @WebDust21 - for features not working the best way to ensure these issues get some attention is to post in github issues, with a test case that illustrates the problem. That will allow the firmware team to reproduce the issue and be able to diagnose and fix.

Thanks,
mat.

WebDust21 · August 17, 2015, 1:01am

@bko. I'm having difficulty with an TCPClient, specifically with regarding to a REST API client. It is being extremely unreliable, for reasons unknown to me.

First of all: WiFi.ping() doesn't work in anything but the Particle Dev IDE's build. No biggie--it took a couple of seconds to return anyway, and all I'm interested in is whether the remote host can be reached. Therefore, I'm using TCPClient to connect to the host. If the connection is successful, disconnect and indicate that the server is valid.

int PingHost(char *wServer)
{
  if (!cClient.connect(wServer, 80)) {Serial.print("Fail Connect"); return 0;}    // An error if we can't connect to the server.
  // We are connected.  That means the host is present.
  delay(250);                    // DEBUG...wait
  cClient.flush();
  cClient.stop();
  return 1;
}

This works...most of the time. I'll note that it's only called on startup, when everything's fresh.
BUT, when I try to send a REST command, the following happens:

Successfully connects
Sends the HTTP POST request
Waits for a response...and hears absolutely nothing
My code times out (7 seconds later) and fails. No 200, no 404, etc. Just absolutely nothing.

Conversely, TCPClient works properly in a webserver example, sending and receiving code. Hmm.

Thanks for the reminder. I'll switch to the most recent LATEST (now that I know what files NOT to merge if I want a 11K-smaller BIN) and see if that fixes the issues.

@mdma: Thanks...I'll do that as I confirm issues. After all, that's about the extent of my GitHub abilities...cheers!

bko · August 17, 2015, 1:16am

Hi @WebDust21

You have to be careful with ping (ICMP Echo) since it has been used in DDOS attacks in the past, a lot of web hosts (like Amazon AWS) don’t answer pings. These days a lot of web hosts are not really one computer with a single address either, and they use DNS hacks to assign a closer host to do the work. I have not investigated the ping issues but to be sure you should always check a host you know answers to be sure. I will try to have a look at ping in the Develop branch and see if I see anything. When I contributed that there were changes from the TI driver on up, so there are a lot of moving parts to check there.

My experience with the Core is the client.flush() and client.stop() are not really sufficient in dealing with the return data. I would setup two while loops, one to wait for data to start coming back and one to read the data (dumping it in the bit bucket if you like) until there is no more. This has to do with the two-level buffering scheme with packets being buffered in the TI CC3000 and then data being buffered again in the TCPClient. Client.flush() only hits the latter while the former is somewhat scarce resource shared by all networking clients in the TI CC3000. You can look around the forum for my previous examples with timeouts or search for @mtnscott 's code which was well done too.

WebDust21 · August 17, 2015, 1:42am

@bko, @mdma: Downloaded the most recent LATEST from Github (via “Download as ZIP”). Extracted it…popped my code in (whoops: had to replace all the Particle.xxx() with Spark.())…failed compile with 4,200 bytes too large. I expected that, and replaced module.mk with a month-old copy. Successful compile with 6,000 bytes to spare. (Can I get a comment about this, or the changes in module.mk?)

Only now…TCPClient fails outright in my test code! (It still works fine in the webserver.) Calling .connect() always fails. Maybe I’ll go back to the month-old LATEST…or see if I can cram my code into the old firmware build.

I get the same strange powerup LED sequence with the most recent LATEST as the DEVELOP branch: power up -> white breath (1.5sec) -> blue (1sec) -> flash green (1sec) -> off (until WiFi connected) -> my code finally is granted control of the RGB LED.

Thanks for the info regarding WiFi.Ping(). But I can build code in the Particle Dev IDE, and get a valid ping—and take the exact same code and server to ANY of my local builds (LATEST or DEVELOP), and get a return value of 0. That includes WiFi.ping(WiFi.gatewayIP()).

I’ll experiment with other methods of discarding data. BUT…when I connect to an HTTP server, it doesn’t send a single byte: the client has to initiate the conversation. That means that there is nothing to discard, no?

WebDust21 · August 17, 2015, 2:14pm

@bko: The most recent LATEST didn’t work for me, so I tried switching to the firmware-release-0.4.3 build. TCPClient connects fine (might have been my bad with trying to cClient.flush(); cClient.stop(); before calling cClient.connect(wServer, 80);), but I still don’t get a single byte back from the server. I can CURL the server manually from my computer, and it works perfectly.

Other notes: 0.4.3 doesn’t give me the strange RGB-LED bootup pattern, and I didn’t change my code at all. It also gives a slightly smaller output file than the hacked LATEST.
FWIW, it seems that the 11K-larger size of the DEVELOP and LATEST builds is due to COMPILE_LTO ?= n in module-defaults.mk. I’ve done text differencing on the old (= small build) file and the new (= big build), and that’s the ONLY change. Strangely, if I change it to COMPILE_LTO ?= y…it still fails with “too big.”

WebDust21 · August 17, 2015, 3:01pm

@bko: Awful lonely out here, it seems. I tried the following code on RELEASE 0.4.3:

/* executes once at startup */
void setup() {
    Serial.begin(19200);       // for debug purposes.  Has to be first in the code, or the USB-RS232 driver doesn't start
    delay(5000);
}

/* executes continuously after setup() runs */
void loop() {
    Serial.println("Resolve Test");
    delay(1000);
    Serial.println(WiFi.resolve("textbelt.com"));
    delay(1000);
    Serial.println(WiFi.resolve("particle.io"));
    delay(1000);
    Serial.println(WiFi.resolve("api.pushingbox.com"));
    delay(1000);
    Serial.println(WiFi.resolve("google.com"));
    delay(1000);
}

Output:

Resolve Test
104.28.20.107
54.230.88.192
0.0.0.0
173.194.46.96
Resolve Test
104.28.21.107
54.230.91.123
213.186.33.19
173.194.46.99
Resolve Test
0.0.0.0
0.0.0.0
213.186.33.19
173.194.46.104
Resolve Test
104.28.20.107
205.251.253.175
213.186.33.19
173.194.46.98
Resolve Test
104.28.21.107
54.192.89.11
213.186.33.19
173.194.46.100
Resolve Test
104.28.20.107
54.192.90.137
0.0.0.0
173.194.46.103
Resolve Test
104.28.21.107
54.192.90.137
213.186.33.19
173.194.46.97

Basically, WiFi.resolve() (=gethostbyname) randomly fails, and when it fails, TCPClient can’t connect to the server. As TCPClient uses gethostbyname to resolve the IP address to connect to…that means that TCPClient randomly fails to connect.

Well, I’m feeling a little stupid, and I’m not one of those people gifted with chocolate socks. Why was TCPClient not returning anything??? 'Cause the server was waiting for me to send a closing CRLF on the last line… All these days I’ve wasted 'cause of that . Give it the closing CRLF, and

bko · August 17, 2015, 3:05pm

Hi @WebDust21

The way that TCPClient fails when DNS fails was also the subject of a recent bug fix. In the future, this will not connect and you can retry. I would look at your router logs to see if you can figure out why DNS is failing randomly for you.

Glad you figured out the CRLF issue!

Topic		Replies	Views
Connecting with TCPclient code, and PING issues Troubleshooting	43	11134	October 16, 2015
TCPClient problem connecting - help please Getting Started	43	4764	April 24, 2015
TCPClient hangs up core Troubleshooting	51	11930	November 4, 2015
TCPClient always fails on first connect attempt after activating WiFi Troubleshooting	10	4615	March 3, 2015
WiFi unstable, goes dead temporarily at varying intervals Troubleshooting	38	6976	September 18, 2014

TCPClient connect Issues

Related topics