WiFi unstable, goes dead temporarily at varying intervals

Thanks for the detailed explanation and it really helps to understand why you are doing what you are doing.

I can say from my own experience that TCPClient and HTTP GET has been very stable for me. I have a few apps with TCPServer that I have played with and they work for the short time I tested them, but I don’t use that very much. So let’s focus on TCPClient and the status messages as you say.

It looks like you are not interested in the response from the server to your HTTP GET request, but that doesn’t mean it isn’t going to send one to the TI CC3000. I think you will do better if you wait for client.available() perhaps with a timeout for safety after sending the GET request, and then do client.read in a loop or client.flush before calling client.stop, rather than just blindly waiting 10ms and assuming it is a good time to stop.

There is nothing wrong with client.print and I use client.print exclusively in my web fetching apps. On both Spark and Arduino it can send multiple packets for one print statement leading to lower performance, but this does not hurt apps that only send occasionally. A common failure mode is trying to receive data in a web server application and then trying to resend the received data one character at a time. The TI CC3000 overflows its buffers in this mode since one packet coming in with 100 characters requires 100 packets sent out and there is no flow control here.

3 Likes

In past revisions of my code, at the end of my HTTP GET I had the following code:

...
client.print("\r\nContent-Length: 0\r\n\r\n");
while (client.connected() && clientTimeoutCounter < 100) {
    delay(10);
    clientTimeoutCounter++;
}
client.flush();
client.stop();
...

which would wait up to a second for the client to disconnect on its own. But when I was trying to simplify things, I tried to remove all logic and loops from the HTTP GET. I do like the idea of doing a flush and then doing a client read to make sure it is a good time to stop (definitely with a timeout) . I’ll change it to:

...
client.print("\r\nContent-Length: 0\r\n\r\n");
while (client.read() > 0 && clientTimeoutCounter < 100) {
    client.flush();
    clientTimeoutCounter++;
    delay(10);
}
client.stop();
...

and implement it that way. Do you think that a 1 second timeout is enough, or should I let it wait a bit longer?

Thank you for the productive suggestions and feedback.

1 Like

If it is a local server (like I believe it is) then that should be OK.

What I like to do is to loop while client.available() is false and a counter is less than say 1000 just waiting, followed by looping while client.available() is true and a another counter is less than the max number of bytes I expect doing client.read(). Then I do client.flush() and client.stop after the while loop.

That way I wait for the response to start coming in and I try to dump it all.

Lots of people have reported that delay(10); or delay(20); is required in some places in their code but I don’t have any delays in my client code.

1 Like

After implementing that idea (and the while client.available() == false as well), the results were pretty much the same failure rate. So to further focus on what I think is the problem, I trimmed out almost all of the other code.

The results are the worst that I have seen since I started on this branch of the code: The core sent the status message seemingly flawlessly for 2 hours and 10 minutes. At that point the core stopped responding for 5 hours and 44 minutes. Then the core started sending status messages for 1 hour, then stopped again for the next hour when I checked the results. This is the longest the core has gone being unresponsive and yet still recovered with my code. A few interesting points though:

  • When I checked the results and the core was not sending messages, the D7 LED was lit indicating that the core was stuck in the section of code where the HTTP GET is being sent. Also, the core LED is breathing cyan.
  • In my code the very first stats sent after a reset contains “SparkCoreReboot”, so it does not appear that the core is resetting.
  • The recovery after the 5 hour unresponsive period shows that the messageCounter is still being incremented. Thinking about this, I realized that the messageCounter has continued to increment during every failure period, and yet in all of my recent tests the D7 pin has been stuck on, indicating a hang within the code at that point. This makes me think that the code is still performing the messageCounter increment while the code is stuck within the HTTP GET, which is puzzling.
  • Keeping the last point in mind (and considering the core was breathing cyan), I tried to push the code to the core again at the end of this test (with D7 lit)… no response. I didn’t think this would work, but I did it to see if I could catch the code executing the loop while the function was hung (which is a possibility considering the previous point. When I pushed this code last night, I did do a second push of it while it was executing, so I know this is also failure behavior.

Not sure where to go from here.

Here is the code.

SYSTEM_MODE(SEMI_AUTOMATIC);
   int heartbeatTimer;
    
// Declare logCounter
    int messageCounter = 0;
    
// Declare / Initialize bootNotificationSent
    bool bootNotificationSent = false;

// Begin: Declare/Initialize for relay shield
    int RELAY1 = D0;
    int RELAY2 = D1;
    int RELAY3 = D2;
    int RELAY4 = D3;
    int INDICATORLED = D7;
// End:   Declare/Initialize for relay shield

// Begin: Declare / Initialize for TCPServer
    TCPClient client;
// End:   Declare / Initialize for TCPServer

    // Function to perform operation received from client
    void performOperation(String operation) {
        // Turn on INDICATORLED while sending message
        char hostname[] = "10.3.2.106";
        int port = 80;
        int clientTimeoutCounter;
        String operationPerformed = "";
        // Perform actions based on operation
        if (operation == "statusHeartbeat") {
            // Report that Core has booted before first heartbeat
            if (bootNotificationSent == false) {
                operationPerformed = "SparkCoreBoot";
                bootNotificationSent = true;
            } else {
                // Every 15th status, reset counter
                if (messageCounter < 15) {
                    messageCounter++;
                } else {
                    messageCounter = 0;
                }
                operationPerformed = "status";
            }
        } else {
            // Invalid command, abort
            return;
        }
        // Send http GET
        // Make sure client isn't connected first
        client.flush();
        client.stop();
        // Turn on INDICATORLED
        digitalWrite(D7, HIGH);
        if (client.connect(hostname, port)) {
            client.print("GET /scripts/response.php?type=");
            client.print(operationPerformed);
            client.print("&message=");
            client.print(WiFi.SSID()); // suffix[0]: WiFi SSID of connected network
            client.print("_");
            client.print(WiFi.RSSI()); // suffix[1]: WiFi signal strength
            client.print("_");
            client.print(Spark.connected()); // suffix[2]: Status to determine if connected to Spark Cloud
            client.print("_");
            client.print("0"); // suffix[3]: Status of RELAY1
            client.print("_");
            client.print("0"); // suffix[4]: Status of RELAY2
            client.print("_");
            client.print(messageCounter); // suffix[5]: logCounter
            client.print(" HTTP/1.0\r\n");
            client.print("Connection: close\r\n");
            client.print("Host: ");
            client.print(hostname);
            client.print("\r\nContent-Length: 0\r\n\r\n");
            clientTimeoutCounter = 0;
            while (client.available() == false && clientTimeoutCounter < 1000) {
                clientTimeoutCounter++;              
            }
            clientTimeoutCounter = 0;
            while (client.read() != -1 && clientTimeoutCounter < 100) {
                client.flush();
                clientTimeoutCounter++;
                delay(10);
            }
            client.stop();
        }
        // Turn off INDICATORLED
        digitalWrite(D7, LOW);

    }

void setup() {

    // Connect to Spark Cloud (System mode is set to semi-automatic)
        Spark.connect();

    // Begin: Setup for relay shield

        //Initilize the relay control pins as output
        pinMode(RELAY1, OUTPUT);
        pinMode(RELAY2, OUTPUT);
        pinMode(RELAY3, OUTPUT);
        pinMode(RELAY4, OUTPUT);
        pinMode(INDICATORLED, OUTPUT);
        // Initialize all relays to an OFF state
        digitalWrite(RELAY1, LOW);
        digitalWrite(RELAY2, LOW);
        digitalWrite(RELAY3, LOW);
        digitalWrite(RELAY4, LOW);
        digitalWrite(INDICATORLED, LOW);
    // End:   Setup for relay shield

    // Initialize heartbeatTimer
    heartbeatTimer = Time.now();
}

void loop() {

    // IF for heartbeat functionality
    if (Time.now() > heartbeatTimer + 60) {
        heartbeatTimer = Time.now();
        performOperation("statusHeartbeat");
    }
}

It sounds like something is still going very wrong, but I am not seeing it. The only place your code could get stuck with LED on and still be breathing cyan is in the second while() loop where you call delay(10); I am not sure why you want to call delay there. I would also try moving the call to flush() to outside the end of that while loop–it shouldn’t do anything but is it safe there. Flush just resets the pointers to buffer; it is client.read() calling client.available() that pulls data in from the WiFi chip. I would do:

unsigned long lastTime = millis();
while( client.available() && millis()-lastTime<10000 ) {  //ten seconds
  client.read();
}
client.flush();  //for safety

It is also worth asking about your server: Is it a well-debugged web server like Apache or something more homegrown? Since you can’t re-flash it is clear you have core firmware issues but there could be issues on the other side too. Maybe you can use a regular PC to simulate the core output to test the server overnight too.

2 Likes

Maybe this is related to your flashing issue: A week or so ago, I also had problems flashing over the air using the IDE. The IDE reported flashing was successful, but I didn’t see any magenta blinking. After a while I found out that before the IDE reports 100% usage of code space (97.6% in my case), this problem crops up. So if your code is getting close to the 100%, you may have encountered the same issue.

2 Likes

@bko,

I was calling a delay simply to increase the wait time to at least 1 second (100 max loop iterations x 10 ms). For some (dumb) reason I didn’t even think to use actual time (even though I use it elsewhere in my code, doh!). I’ll fix that.

I misunderstood client.flush() and client.read() a bit, I completely agree with your evaluation of that.

As far as my webserver is concerned, I had previously been using the webserver plugin in EventGhost (which has proven to be buggy at best), but when things started going wrong with the Spark Core I stood up a RaspberryPi with Rasbian and Apache (with no special configuration). After setting it up I did some light soak testing on it and it hasn’t shown any problems.

I still keep coming back to this question: if the code is getting stuck in the loop with the delay, how is the messageCounter variable getting incremented?

In any case, I updated the code and pushed it… waiting for results:

SYSTEM_MODE(SEMI_AUTOMATIC);
   int heartbeatTimer;
    
// Declare logCounter
    int messageCounter = 0;
    
// Declare / Initialize bootNotificationSent
    bool bootNotificationSent = false;

// Begin: Declare/Initialize for relay shield
    int RELAY1 = D0;
    int RELAY2 = D1;
    int RELAY3 = D2;
    int RELAY4 = D3;
    int INDICATORLED = D7;
// End:   Declare/Initialize for relay shield

// Begin: Declare / Initialize for TCPServer
    TCPServer server = TCPServer(80);
    TCPClient serverClient;
    TCPClient client;
// End:   Declare / Initialize for TCPServer

    // Function to perform operation received from client
    void performOperation(String operation) {
        // Turn on INDICATORLED while sending message
        char hostname[] = "10.3.2.106";
        int port = 80;
        int clientTimeoutCounter;
        String operationPerformed = "";
        // Perform actions based on operation
        if (operation == "statusHeartbeat") {
            // Report that Core has booted before first heartbeat
            if (bootNotificationSent == false) {
                operationPerformed = "SparkCoreBoot";
                bootNotificationSent = true;
            } else {
                // Every 15th status, reset counter
                if (messageCounter < 15) {
                    messageCounter++;
                } else {
                    messageCounter = 0;
                }
                operationPerformed = "status";
            }
        } else {
            // Invalid command, abort
            return;
        }
        // Send http GET
        // Make sure client isn't connected first
        client.flush();
        client.stop();
        // Turn on INDICATORLED
        digitalWrite(D7, HIGH);
        if (client.connect(hostname, port)) {
            client.print("GET /scripts/response.php?type=");
            client.print(operationPerformed);
            client.print("&message=");
            client.print(WiFi.SSID()); // suffix[0]: WiFi SSID of connected network
            client.print("_");
            client.print(WiFi.RSSI()); // suffix[1]: WiFi signal strength
            client.print("_");
            client.print(Spark.connected()); // suffix[2]: Status to determine if connected to Spark Cloud
            client.print("_");
            client.print("0"); // suffix[3]: Status of RELAY1
            client.print("_");
            client.print("0"); // suffix[4]: Status of RELAY2
            client.print("_");
            client.print(messageCounter); // suffix[5]: logCounter
            client.print(" HTTP/1.0\r\n");
            client.print("Connection: close\r\n");
            client.print("Host: ");
            client.print(hostname);
            client.print("\r\nContent-Length: 0\r\n\r\n");
            clientTimeoutCounter = millis();
            while (client.available() == false && millis() - clientTimeoutCounter < 1000) {
                // Just wait up to 1000 millis
            }
            clientTimeoutCounter = millis();
            while (client.read() != -1 && millis() - clientTimeoutCounter < 1000) {
                client.read();
            }
            client.flush();
            client.stop();
        }
        // Turn off INDICATORLED
        digitalWrite(D7, LOW);

    }

void setup() {

    // Connect to Spark Cloud (System mode is set to semi-automatic)
        Spark.connect();

    // Begin: Setup for relay shield

        //Initilize the relay control pins as output
        pinMode(RELAY1, OUTPUT);
        pinMode(RELAY2, OUTPUT);
        pinMode(RELAY3, OUTPUT);
        pinMode(RELAY4, OUTPUT);
        pinMode(INDICATORLED, OUTPUT);
        // Initialize all relays to an OFF state
        digitalWrite(RELAY1, LOW);
        digitalWrite(RELAY2, LOW);
        digitalWrite(RELAY3, LOW);
        digitalWrite(RELAY4, LOW);
        digitalWrite(INDICATORLED, LOW);
    // End:   Setup for relay shield

    // Initialize heartbeatTimer
    heartbeatTimer = Time.now();
}

void loop() {

    // IF for heartbeat functionality
    if (Time.now() > heartbeatTimer + 60) {
        heartbeatTimer = Time.now();
        performOperation("statusHeartbeat");
    }
}

1 Like

@maxint, thanks for the suggestion. Here are the usage numbers on compile from the code pushed last night:

Flash used	78936 / 110592	71.40%
RAM used	13612 / 20480	66.50%

and I am able to flash over the air just fine while the code is not hung. All I had to do was reset the core. Had the core not been hung in the code, I think it would have flashed normally.

I have been watching these numbers, the usage numbers from the full code are:

Flash used	80872 / 110592	73.10%
RAM used	13612 / 20480	66.50%

and I have yet to see those higher that 75% / 68%.

1 Like

Looking better to my eye, but time will tell.

The millis() function returns an unsigned 32-bit value so declaring your clientTimeoutCounter as int could be a problem when it wraps.

1 Like

crap… (fixed)

Just a note, when I went to push the new code, the core had send status messages for 26 minutes, hung for 10 minutes (during which I couldn’t push the new code), and recovered on it’s own and I was able to push the code at that point.

Now that I have changed the declaration of clientTimeoutCounter from int to unsigned long, I’ll let it run and see what I get. Nice catch @bko, I really should have looked at that more carefully, I appreciate you taking the time to read the code.

1 Like

I stumbled across this topic hoping for a fix or workaround. I have it narrowed down (in my case at least) and maybe it will this will be a breadcrumb for you.

I have a project that uses MQTT (internally it uses TCPClient) and was stable. It would run for days without any incident and it would only stop because I reflashed the code. Then I got the idea to add a signal strength indicator and started calling Wifi.RSSI(). Once I did that, the spark became unstable and I started seeing the symptoms that you described. I have auto-reconnect code and output to serial so I know the spark was trying to reconnect but failing. It seems that Wifi was off in the weeds due to asking for signal strength.

Gut all references to the Wifi object and see where that gets you. I removed it from my project and it’s back to working even using TCPClient.

2 Likes

Interesting idea @brianw13a, I have had the WiFi.RSSI() for a while. I’ll remove that and WiFi.SSID() from my code for the next test. Thanks for the info!

It looks likes WiFi.RSSI() blocks for up to 1 second of real-time while waiting for the TI CC3000 to send the scan table results, so it is very possible that this could interfere with other things happening on the TI side.

I was under the impression that the TI CC3000 is scanning all the time and this method is just collecting and reporting the scanning results but there could be a priority problem in the TI driver where this is treated as a high priority somehow. It could also be that this triggers a scan and takes resources away from other things in the TI part. Interesting!

The SSID is just read out of the same structure that holds the current IP address and the current gateway information, so it should be safe to access at any time since it was read out of the TI chip when you got your IP address.

1 Like

After 3 days of testing:

  • No errors lasting longer than 10 minutes, after each error the core seemingly recovered completely.
  • Every 1-3 hours there has been an error in the heartbeat message, each lasting less than 10 minutes.
  • During the 3 days, there were three reboots of the core for unknown reason. The intervals between these reboots was not evenly spaced at all, so it doesn’t appear to be caused by a build up of some sort of memory leak.

While I’m still seeing frequent errors, the length of time of those errors has drastically reduced and recovery is seemingly happening every time, so I still see it as progress.

@bko, I thought that could be the case for the WiFi.SSID(), but I honestly don’t really need and and only put it in because it looked good next to the WiFi.RSSI in the code (though it is good information to know).

Also, I was informed that “Connection: close” is not supported in HTTP 1.0 and after a little research I am switching that to HTTP 1.1. for the next round of testing.

The next round of testing will only remove the WiFi function calls and change to HTTP 1.1, and I’ll post the results in a day or so.

I’m at a loss. Everything was looking great… for 23 hours. Between 2pm yesterday and 1pm today, there were only 3 failures, and each was less only a single status missed (so each was less than 2 minutes), so it was a reliability rate of 99.78% with an estimated mean time of failure less than 1 minute. This is still not perfect, but definitely a massive improvement.

Unfortunately at 1:20 pm, the core stopped responding until 1:55. During this 35 minute failure:

  • There is no indication that the core rebooted
  • The messageCounter of the last status message was 14. When the core started sending messages again, the messageCounter of the first status message was 15. Based on previous failures, I don’t think that the core paused the entire time, and looking at the other results it takes about 35 minutes for the core to do 30 messages, so instead it looks like the core kept looping at the same rate, just without sending the status message.

I am not sure where to go from here. The core is getting more stable, but I am not sure what more I can strip out and still test functionality of the core in a way that I can build back to an application that will work for my garage.

I’m going to run the core for another 24 hours and see what I get, if there is another error I can compare the results, and maybe that can provide me with some idea of what is going on and where to improve the code.

Current code:

SYSTEM_MODE(SEMI_AUTOMATIC);
   int heartbeatTimer;
    
// Declare logCounter
    int messageCounter = 0;
    
// Declare / Initialize bootNotificationSent
    bool bootNotificationSent = false;

// Begin: Declare/Initialize for relay shield
    int RELAY1 = D0;
    int RELAY2 = D1;
    int RELAY3 = D2;
    int RELAY4 = D3;
    int INDICATORLED = D7;
// End:   Declare/Initialize for relay shield

// Begin: Declare / Initialize for TCPServer
    TCPServer server = TCPServer(80);
    TCPClient serverClient;
    TCPClient client;
// End:   Declare / Initialize for TCPServer

    // Function to perform operation received from client
    void performOperation(String operation) {
        // Turn on INDICATORLED while sending message
        char hostname[] = "10.3.2.106";
        int port = 80;
        unsigned long clientTimeoutCounter;
        String operationPerformed = "";
        // Perform actions based on operation
        if (operation == "statusHeartbeat") {
            // Report that Core has booted before first heartbeat
            if (bootNotificationSent == false) {
                operationPerformed = "SparkCoreBoot";
                bootNotificationSent = true;
            } else {
                // Every 15th status, reset counter
                if (messageCounter < 15) {
                    messageCounter++;
                } else {
                    messageCounter = 0;
                }
                operationPerformed = "status";
            }
        } else {
            // Invalid command, abort
            return;
        }
        // Send http GET
        // Make sure client isn't connected first
        client.flush();
        client.stop();
        // Turn on INDICATORLED
        digitalWrite(D7, HIGH);
        if (client.connect(hostname, port)) {
            client.print("GET /scripts/response.php?type=");
            client.print(operationPerformed);
            client.print("&message=");
// Put in to test the removal RSSI
client.print("NetInfoDisabled_-00_"); 
//            client.print(WiFi.SSID()); // suffix[0]: WiFi SSID of connected network
//            client.print("_");
//            client.print(WiFi.RSSI()); // suffix[1]: WiFi signal strength
//            client.print("_");
            client.print(Spark.connected()); // suffix[2]: Status to determine if connected to Spark Cloud
            client.print("_");
            client.print("0"); // suffix[3]: Status of RELAY1
            client.print("_");
            client.print("0"); // suffix[4]: Status of RELAY2
            client.print("_");
            client.print(messageCounter); // suffix[5]: logCounter
            client.print(" HTTP/1.1\r\n");
            client.print("Connection: close\r\n");
            client.print("Host: ");
            client.print(hostname);
            client.print("\r\nContent-Length: 0\r\n\r\n");
            clientTimeoutCounter = millis();
            while (client.available() == false && millis() - clientTimeoutCounter < 1000) {
                // Just wait up to 1000 millis
            }
            clientTimeoutCounter = millis();
            while (client.read() != -1 && millis() - clientTimeoutCounter < 1000) {
                client.read();
            }
            client.flush();
            client.stop();
        }
        // Turn off INDICATORLED
        digitalWrite(D7, LOW);

    }

void setup() {

    // Connect to Spark Cloud (System mode is set to semi-automatic)
        Spark.connect();

    // Begin: Setup for relay shield

        //Initilize the relay control pins as output
        pinMode(RELAY1, OUTPUT);
        pinMode(RELAY2, OUTPUT);
        pinMode(RELAY3, OUTPUT);
        pinMode(RELAY4, OUTPUT);
        pinMode(INDICATORLED, OUTPUT);
        // Initialize all relays to an OFF state
        digitalWrite(RELAY1, LOW);
        digitalWrite(RELAY2, LOW);
        digitalWrite(RELAY3, LOW);
        digitalWrite(RELAY4, LOW);
        digitalWrite(INDICATORLED, LOW);
    // End:   Setup for relay shield

    // Initialize heartbeatTimer
    heartbeatTimer = Time.now();
}

void loop() {

    // IF for heartbeat functionality
    if (Time.now() > heartbeatTimer + 60) {
        heartbeatTimer = Time.now();
        performOperation("statusHeartbeat");
    }
}

I ultimately had a failure even after removing RSSI. Just on a whim I decided to see how long my loop() was taking and was shocked to see ~500ms. Thinking this was way too long I split most of the network comms up using a loop counter. I’m now seeing a typical loop() execution time of 5ms with the occasional 120ms which I attribute to the RSSI call. It’s a far cry from 500ms and it doesn’t seem to cause any problems so I’m not really concerned about it at this point.

According to the spark docs, there is some internal voodoo that executes between loop() calls and I wonder if the stability is related to timing. I also suspect that part of the problem is exceeding the networking buffer(s) because sending lots of data certainly causes me problems too.

I also having strange result using the server ability. Happen if anyone have success of certain sequence.
Following the example from doc did not get me too far.
A general question, should the Server able to work together with Spark Cloud functions ??

Yes is the answer and as I understand it the Spark team tests each new release with firmware that runs several TCP connections and a cloud connection at the same time.

I personally don't have much experience with TCP.Server but I have used TCP.Client a good deal both with and without cloud variables and published events and with some mildly careful programming, I don't have any real problems.

I have just created a test code which can reproduce the problem that I have. If Spark team willing to check this, I can email to them. May be the order of my code is wrong or certain sequence trigger a bug ??