Active TCPServer / Client appears to crash Photon when WiFi disconnects

After a lot of testing I am pretty sure that there is problem with TCPServer/Client and WiFi interaction which causes the entire Photon to reset when WiFi connectivity disappears at a point when there isTCP activit - I’ve tried firmware revisions 0.5.3 and 0.6.0 and see the same outcome with both.

I have tried various test programs including a very simple web server (code below based on an example) and a test program in Python to access it I have also set up a WiFi access point for testing that only this one device connects to… If I run the test program under normal conditions it works fine and doesn’t fail any web requests. But if I then disconnect the Ethernet cable to the access point the Photon first loses connection to the particle cloud (cyan flashing) and then in around 1 case in 3 it results in a full reset - sometimes the LED goes to flashing green before going to flashing red and other times straight to flashing red.

I have monitored the WiFi.ready() result and in all cases the result is still true (WiFi is ready) when the reset occurs so there seems to be no way to catch this situation and avoid the call to the TCP code which I believe is triggering the crash.

SYSTEM_MODE(AUTOMATIC);
char ssid[] = "rdint02";
char password[] = "testtesttest";
TCPServer server(80);

void setup() {
  Serial.begin(115200);      // initialize serial communication
  delay(3000);
  Serial.printlnf("System version %s", System.version().c_str());

  WiFi.on();
  WiFi.setCredentials(ssid,password);
  WiFi.connect();

  while (WiFi.connecting()) {
    // print dots while we wait to connect
    Serial.print(".");
    delay(300);
  }

  IPAddress localIP = WiFi.localIP();
  while (localIP[0] == 0) {
    localIP = WiFi.localIP();
    Serial.println("waiting for an IP address");
    delay(1000);
  }

  Serial.printf("SSID: %s, IP: ", WiFi.SSID());
  Serial.println(WiFi.localIP());
  server.begin();
}

int lastWiFiReady = false;
int lastParticleConn = false;
void loop() {
    if ((lastWiFiReady != WiFi.ready()) || (lastParticleConn != Particle.connected()))
        Serial.printlnf("WiFiReady %d ParticleConn %d", WiFi.ready(), Particle.connected());
    lastWiFiReady = WiFi.ready();
    lastParticleConn = Particle.connected();

    TCPClient client = server.available();
    if (client) {
        bool currentLineIsBlank = true;
        while (client.connected()) {
            if (client.available()) {
                char c = client.read();
                /*Serial.write(c);*/
        if (c == '\n' && currentLineIsBlank) {
              // send a standard http response header
              client.println("HTTP/1.1 200 OK");
              client.println("Content-Type: text/html");
              client.println("Connection: close");  // the connection will be closed after completion of the response
              client.println();
              client.println("<!DOCTYPE HTML><html>Hello World</html>");
              break;
        }
        if (c == '\n') {
              // you're starting a new line
              currentLineIsBlank = true;
        }
        else if (c != '\r') {
          // you've gotten a character on the current line
          currentLineIsBlank = false;
        }
      }
    }
    // give the web browser time to receive the data
    delay(1);
    // close the connection:
    client.stop();
  }
}

For reference this is the Python code I’m using to test:

import time
import serial
import threading
import requests

comport = "COM8"
failureCounts = {
    "success": 0,
    "requestOtherException": 0,
    "requestTimeouts": 0,
    "requestConnErr": 0,
    "serialTimouts": 0,
    "serialExceptions": 0,
    "ReopenedSerialPorts": 0,
    "non200StatusCodes": 0
}

# Read data from serial port and echo
def serialRead(ser):
    global serialIsClosing
    serExcept = False
    while True:
        # Handle closing down
        if serialIsClosing:
            break
        # Get a char if there is one
        try:
            if ser.isOpen():
                val = ser.read(1)
                if len(val) == 0:
                    continue
                print(val.decode("utf-8"), end="")
            else:
                ser = serial.Serial(port=comport, baudrate=115200, timeout=1)
                if ser.isOpen():
                    print("Reopened Serial Port")
                    failureCounts["ReopenedSerialPorts"] += 1
            serExcept = False
            serTimeoutExcept = False

        except serial.SerialTimeoutException:
            failureCounts["serialTimouts"] += 1
            if not serTimeoutExcept:
                print("Serial timeout exception")
                serTimeoutExcept = True
                serExcept = False
                ser.close()
        except serial.SerialException:
            failureCounts["serialExceptions"] += 1
            if not serExcept:
                print("Serial exception")
                serExcept = True
                serTimeoutExcept = False
                ser.close()

print("Basic web server stress test")

# Serial connection
serialIsClosing = False
serPort = serial.Serial(port=comport, baudrate=115200, timeout=1)

# Thread for reading from port
thread = threading.Thread(target=serialRead, args=(serPort,))
thread.start()

def showFailCounts():
    print(".......", end="")
    for key, value in failureCounts.items():
        print(" ", key, value, end="")
    print()

# Test web using request
for i in range(100000):
    try:
        r = requests.get('http://192.168.0.129', timeout=10.0)
        # print(r.status_code)
        if r.status_code != 200:
            failureCounts["non200StatusCodes"] += 1
        else:
            failureCounts["success"] += 1
    except requests.exceptions.ConnectTimeout:
        failureCounts["requestTimeouts"] += 1
    except requests.exceptions.ConnectionError:
        failureCounts["requestConnErr"] += 1
    except requests.exceptions.RequestException as excp:
        failureCounts["requestOtherException"] += 1
        print("Other exception", excp)
    showFailCounts()
    time.sleep(1)

I’m not sure why your test is crashing, but here’s what I tried:

The Photon code is a simple web server that returns an index.html file. The server side code makes a new HTTP request to the Photon every second. I would disconnect the Ethernet to the access point for about 5 seconds, long enough to stop incoming connections but not go to blinking cyan. Then wait for things to stabilize again, then disconnect for long enough to go to blinking cyan. I repeated this 5 times, which was about 330 connections. I didn’t run into any trouble.

That’s not to say a bug doesn’t exist, but I haven’t seen it happen yet. I tested with 0.5.3.

Here’s the Photon code:

#include "Particle.h"

SYSTEM_THREAD(ENABLED);

// Pages
// [start a87cffa2-e342-4f2c-9070-72b710d606c3]
// name=/index.html contentType=text/html size=293 modified=2016-11-08 12:50:47
const char fileData0[] =
"<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n"
"<!DOCTYPE html>\n"
"<html xmlns=\"http://www.w3.org/1999/xhtml\">\n"
"<head>\n"
"<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\n"
"\n"
"<title>Test Page</title>\n"
"\n"
"</head>\n"
"<body>\n"
"\n"
"<div id=\"main\">\n"
"\n"
"<p>Test Page</p>\n"
"\n"
"</div> <!-- main -->\n"
"\n"
"</body>\n"
"</html>\n"
"";

typedef struct {
	const char *name;
	const char *mimeType;
	const uint8_t *data;
	size_t dataSize;
	unsigned long modified;
	bool isBinary;
} FileInfo;

const FileInfo fileInfo[] = {
	{"/index.html", "text/html", (const uint8_t *)fileData0, sizeof(fileData0) - 1, 1478627447, FALSE},
	{NULL, NULL, 0, 0, FALSE}
};
// [end a87cffa2-e342-4f2c-9070-72b710d606c3]

static const char *dateFormatStr = "%a, %d %b %Y %T %z";

enum State {
	FREE_STATE,
	READ_REQUEST_STATE,
	WRITE_HEADER_STATE,
	WRITE_RESPONSE_STATE
};

const int MAX_CLIENTS = 5;
const int LISTEN_PORT = 7123;
const int CLIENT_BUF_SIZE = 1024;
const int MAX_TO_WRITE = 1024;
const unsigned long INACTIVITY_TIMEOUT_MS = 30000;

class ClientConnection {
public:
	ClientConnection();
	virtual ~ClientConnection();

	void loop();
	bool accept();

protected:
	void clear();
	void readRequest();
	void generateResponseHeader();
	void writeResponse();

private:
	uint8_t clientBuf[CLIENT_BUF_SIZE+1];
	State state;
	int clientId;
	TCPClient client;
	int readOffset;
	int writeOffset;
	unsigned long lastUse;
	time_t startTime;

	// Response data
	int responseCode;
	String responseStr;
	const FileInfo *fileToSend;

	const uint8_t *sendBuf;
	size_t sendOffset;
	size_t sendLen;

};


String localIP;
TCPServer server(LISTEN_PORT);
ClientConnection clients[MAX_CLIENTS];
int nextClientId = 1;


void setup() {
	Serial.begin(9600);

	waitUntil(Particle.connected);

	// From CLI, use something like:
	// particle get test5 localip
	// to get the IP address of the Photon (replace "test5" with your device name)
	localIP = WiFi.localIP(); // localIP must be a global variable
	Particle.variable("localip", localIP);
	Serial.printlnf("server=%s:%d", localIP.c_str(), LISTEN_PORT);

	server.begin();
}

void loop() {
	// Handle any existing connections
	for(int ii = 0; ii < MAX_CLIENTS; ii++) {
		clients[ii].loop();
	}

	// Accept a new one if there is one waiting (and we have a free client)
	for(int ii = 0; ii < MAX_CLIENTS; ii++) {
		if (clients[ii].accept()) {
			break;
		}
	}

}


ClientConnection::ClientConnection() : state(FREE_STATE) {
	clear();
}

ClientConnection::~ClientConnection() {
}

void ClientConnection::loop() {
	if (state == FREE_STATE) {
		return;
	}

	if (client.connected()) {
		switch(state) {
		case READ_REQUEST_STATE:
			readRequest();
			break;

		case WRITE_HEADER_STATE:
		case WRITE_RESPONSE_STATE:
			writeResponse();
			break;
		}

		if (millis() - lastUse > INACTIVITY_TIMEOUT_MS) {
			Serial.printlnf("%d: inactivity timeout", clientId);
			client.stop();
			clear();
		}
	}
	else {
		Serial.printlnf("%d: client disconnected", clientId);
		client.stop();
		clear();
	}
}

bool ClientConnection::accept() {
	if (state != FREE_STATE) {
		return false;
	}

	client = server.available();
	if (client.connected()) {
		lastUse = millis();
		state = READ_REQUEST_STATE;
		clientId = nextClientId++;
		startTime = Time.now();
		Serial.printlnf("%d: connection accepted", clientId);
	}
	return true;
}

void ClientConnection::clear() {
	lastUse = 0;
	readOffset = 0;
	writeOffset = 0;
	state = FREE_STATE;
	fileToSend = 0;
}

void ClientConnection::readRequest() {
	// Note: client.read returns -1 if there is no data; there is no need to call available(),
	// which basically does the same check as the one inside read().

	size_t toRead = CLIENT_BUF_SIZE - readOffset;
	if (toRead == 0) {
		// Didn't get end of header
		Serial.printlnf("%d: didn't receive end-of-header", clientId);
		client.stop();
		return;
	}

	int count = client.read(&clientBuf[readOffset], toRead);
	if (count > 0) {
		readOffset += count;
		clientBuf[readOffset] = 0;

		if (strstr((const char *)clientBuf, "\015\012\015\012")) {
			// Ignore the actual request and just return the index.html data
			responseCode = 200;
			responseStr = "OK";
			fileToSend = &fileInfo[0];

			Serial.printlnf("%d: sending %s", clientId, fileToSend->name);
			generateResponseHeader();
		}
		lastUse = millis();
	}
}



void ClientConnection::generateResponseHeader() {
	char *dst = (char *)clientBuf;
	char *end = &dst[CLIENT_BUF_SIZE];

	// Generate HTTP response header
	// HTTP/1.0 200 OK
	dst += snprintf(dst, end - dst, "HTTP/1.0 %d %s\r\n", responseCode, responseStr.c_str());

	// Date
	String s = Time.format(Time.now(), dateFormatStr);
	dst += snprintf(dst, end - dst, "Date: %s\r\n", s.c_str());

	if (responseCode == 200 && fileToSend) {
		// Content-Type
		if (fileToSend->mimeType) {
			dst += snprintf(dst, end - dst, "Content-Type: %s\r\n", fileToSend->mimeType);
		}

		// Content-Length is the length if known. contentLength is initialized to -1 (not known)
		// but it's good to set it if you know, because not settings a content length means keepalive
		// cannot be used.
		// For HEAD, Content-Length is the length the body would be, not the actual length (0 for HEAD).
		if (fileToSend->dataSize >= 0) {
			dst += snprintf(dst, end - dst, "Content-Length: %d\r\n", fileToSend->dataSize);
		}

		// Last-Modified
		if (fileToSend->modified != 0) {
			s = Time.format(fileToSend->modified, dateFormatStr);
			dst += snprintf(dst, end - dst, "Last-Modified: %s\r\n", s.c_str());
		}
	}


	// End of header
	dst += snprintf(dst, end - dst, "\r\n");

	// Now send
	sendBuf = clientBuf;
	sendOffset = 0;
	sendLen = dst - (char *)clientBuf;
	state = WRITE_HEADER_STATE;
}


void ClientConnection::writeResponse() {
	if (sendOffset == sendLen) {
		if (state == WRITE_HEADER_STATE && fileToSend) {
			// Write body now
			sendOffset = 0;
			sendBuf = fileToSend->data;
			sendLen = fileToSend->dataSize;
		}
		else {
			// Done
			Serial.printlnf("%d: send complete", clientId);
			client.stop();
			return;
		}
	}
	size_t bytesToWrite = sendLen - sendOffset;
	if (bytesToWrite >= MAX_TO_WRITE) {
		bytesToWrite = MAX_TO_WRITE;
	}

	int count = client.write(&sendBuf[sendOffset], bytesToWrite);
	if (count == -16) {
		// Special case on Photon; buffer is full, retry later
	}
	else
	if (count > 0) {
		sendOffset += count;
	}
	else {
		Serial.printlnf("%d: error writing %d", clientId, count);
		client.stop();
	}

}

Thanks for the quick response. I’ve tested your code and get the same behavior as before. Please could you try it and wait a bit longer after disconnecting the Ethernet cable. The problem occurs some time after the Ethernet is disconnected - i.e. around 1 to 3 minutes - and it does not happen every time - perhaps 1 time in 3.

I have recorded a short video that shows the problem occurring with the test code you sent me. Hopefully you can see it here:

https://goo.gl/photos/a8X1mMZHxSkH68RL8

In the video the first time I disconnect the Ethernet cable it goes into fast green flashing and continues to cycle there - so I reconnected the cable an tried again (after re-connection to the particle cloud) - I then disconnected a second time and, around 30 seconds after the loss of connection, the photon reboots.

In more detail what is happening in the video is as follows:

  • Ethernet disconnected at around 50 seconds into the video
  • continues to breath cyan for around 20 seconds
  • flashes cyan for around 30 seconds
  • flashes green fast for around 30 seconds
  • magenta very briefly?
  • flashes green fast again and cycles through here
  • then I reconnect the Ethernet cable
  • quickly goes back to breathing cyan
  • Disconnect Ethernet again at around 2 minutes 48 in the video
  • continues to breath cyan for around 15 seconds
  • flashes cyan for around 20 seconds
  • flashes red indicating that the Photon is rebooting

In the scenario where: I disconnect the Ethernet from the access point and wait long enough for the Photon to blink red once then go back to blinking green and then blinking cyan, then reconnect the Ethernet. And repeat the whole process again, I get an SOS after the second time I reconnect the Ethernet.

It might be a few days before I can run this with the debugger and see where it’s crashing, however.

Thanks again for the quick response and I’m glad that you have reproduced what I was seeing. Please let me know if you make any progress on it. It would be fantastic to achieve a bullet-proof web server as I feel that is an essential component of many IoT applications that would leverage the Photon’s WiFi capabilities. Also, please let me know if there is anything I can do to help with this.

I have ordered a Particle debug board and will hope to try to discover more myself when it arrives. If you make any progress in the interim please let me know.

I now have a debug environment up and running. The problem is manifesting as a “double fault”. This is the information from GDB at the point that the program crashes:

stm32f2x.cpu – clearing lockup after double fault
Program received signal SIGINT, Interrupt.
0xae094710 in ?? ()

Not too much information there I guess!

I’ve tried to isolate to a specific call to TCPServer or TCPClient but so far have been unable to do so. The problem does not occur if no TCP connection is made but I don’t think it occurs in the calling thread directly. The double fault implies an ISR I think so maybe it is in servicing the socket.

Another piece of potentially relevant information in tracking this down is that the RedBear Duo also exhibits the same behavior despite having a very different WiFi implementation (I believe). So I guess this points to some common code between the two which might be relevant.

I’m really hoping not to have to dig too deeply into the socket code so if you guys are able to discern anything from what I’ve found so far and fix the problem that would be great!

Hi @rickkas7 have you by any chance had time to look at the TCP/WiFi stability problem? I have tried to make progress on it but there is a lot of code dealing with the WICED chip that I don’t understand. I have found that the same problem occurs on the RedBear Duo so I guess it must be in code that is shared between Photon and Duo. Not sure if that helps narrow things down though. Any progress would be greatly appreciated. I think I’ve seen at least one other thread where the same kind of problem may occur but I didn’t make a note of it and a recent search came up with nothing.

I haven’t tried it yet, but I’ll try running it with OpenOCD and see if I can get a better idea of what’s happening, hopefully next week.

Thanks that would be great - I really want to find a way to achieve stable operation regardless of what happens to the WiFi - Rob

Hi @rickkas7 - did you get anywhere with this?

I tried a test using a monolithic debug build, system firmware 0.6.0, with the debugger attached. I tried a couple times using the technique above, and got it to stop responding to network requests, so something got messed up in the network stack, but I haven’t gotten it to system fault with the debugger attached yet, unfortunately.

I’d guess that the set of operations is probably causing memory corruption or using a free object or something along those lines, but that’s really a pain to track down.

I had a similar problem a while back and made sure the client socket object of a tcp_sever is removed from its list after the disconnection. I created a pull request. Do you mind trying it out?

I have tried this fix now. I’m pleased to report that I have not seen the device hang up or enter SOS state with this fix. Having said that I also tested the exact same code base without the fix and I don’t see the same behavior as previously (again no SOS kernel panic) but I did get one lock-up without the fix. I’m planning to set up an automated test bed again but that will probably be in a few days time as I’ve got a busy week ahead. Thanks so much @tlangmo and @rickkas7 for investigating this and providing what looks like a great fix.

I have tested this code with an automated test which uses a raspberry pi to switch a WiFi hub on and off at intervals and then exercises a test web server running on the Photon. The Photon no longer goes into panic mode so that is a great step forwards. However, the Photon also does not reconnect the TCPServer socket when the WiFi is reconnected and, hence, the test web server code doesn’t work after the failure. I’ve tried a few variations of code to attempt to delete the TCPServer and create a new one but, so far, haven’t hit on a winning formula. I’ll continue to test and attempt to understand what is happening at the WICED level in order to see why the socket doesn’t get reconnected.

1 Like