TCP write lockup

Jseiler · February 26, 2018, 9:18pm

Hello,

I have been working on a project for the past 3 months to upgrade or add functionality to an existing product using either the Photon or Redbear Duo.

Out system is a nurse call system that communicates with individual devices using a proprietary protocol very close to Zigbee. Out receivers take this information and return it the main computer via a slow CANbus, 20kbits/s.

My project is to create a wireless link between two CANbus nodes to tie two systems together when no other wired link exists.

What I am finding is that a call to client.write can lock the system up for what appears to be permanent, warranting a push of the reset button, but recently noticed that it clears itself somehow, sometimes within minutes but other times in an hour or so.

I have tried the application watchdog but this apparently has no effect when running in threaded mode as it doesn't reset the photon after a minute of the loop being stuck.

This is getting long so let me get to some numbers and a few questions

We process between 1.5 - 2 million CANbus messages per day, these arrive in batches of up to 150 messages per second, roughly 6.7ms between messages. These are transferred immediately on receipt over TCP as a bundle of 22 bytes for each CANbus message. Checking the number of bytes written with client.write shows that when this returns 0 that the photon locks up. It no longer runs through the loop. TIme between lockups can be a few minutes to well over a week. The basic system not using TCP to transfer messages is rock solid and has been running for well over 100 million CANbus messages. It's only now that I'm trying to add TCP that I'm running into trouble.

There are number of other threads ;

and

github.com/particle-iot/device-os

Photon/P1: TCPClient: non-blocking, blocking, blocking with timeout writes support

particle-iot:develop ← particle-iot:feature/socket-send-nonblocking-timeout

opened 08:47PM - 31 Jan 18 UTC

avtolstoy

+170 -49

### Problem Unlike `TCPClient::read()`, `TCPClient::write()` is a blocking ca…ll and whenever TCP send queue gets completely filled (see #1461 for a particular case on Photon/P1 with low wifi signal), the application thread may be blocked for an indefinite amount of time. There needs to be a way to specify the maximum amount of time the write should take. ### Solution New `TCPClient::write()` overloads are added: - `size_t write(uint8_t b, system_tick_t timeout);` - `size_t write(const uint8_t *buf, size_t size, system_tick_t timeout);` ### Additional notes `write()` methods return `size_t`, which is an unsigned integer, however in case of an error, `TCPClient::write()` method may return a negative error code cast to `size_t`. We are still keeping this behavior, with a thought of deprecating it in the next release. Applications should use `TCPClient::getWriteError()/clearWriteError()` methods to check for errors, while `TCPClient::write()` should only return the number of bytes written. ### Steps to Test - `wiring/api` ### Example App ```c++ #include "application.h" SYSTEM_MODE(SEMI_AUTOMATIC); Serial1LogHandler dbg(115200, LOG_LEVEL_ALL); TCPServer tcpServerNonBlocking(5000); TCPServer tcpServerBlockingTimeout(5001); TCPServer tcpServerBlocking(5555); void setup() { WiFi.on(); WiFi.connect(); WiFi.selectAntenna(ANT_EXTERNAL); } static void sendBunchOfData(TCPClient&& client, system_tick_t timeout) { while (client.connected()) { uint8_t buf[500]; memset(buf, 'a', 500); int rc = client.write(buf, sizeof(buf), timeout); LOG(TRACE, "%d", rc); } } void loop() { sendBunchOfData(tcpServerNonBlocking.available(), 0); sendBunchOfData(tcpServerBlockingTimeout.available(), 1000); sendBunchOfData(tcpServerBlocking.available(), SOCKET_WAIT_FOREVER); } ``` Connect to the appropriate port using telnet or something similar, while looking at the log place a finger onto the external antenna connector. Device should be unable to send data, causing TCP send queue to fill up. The behavior on each of the ports should be the following: - 5000: as soon as the tcp queue is full, the log should be filled with a constant flow of -16 error codes - 5001: as soon as the tcp queue is full, the log should show -16 error codes every second - 5555: as soon as the tcp queue is full, there should be no log entries, indicating that the application thread is blocked ### References - #1461 - particle-iot/photon-wiced#25 - [CH10639] --- ### Completeness - [x] User is totes amazing for contributing! - [x] Contributor has signed CLA ([Info here](https://github.com/spark/firmware/blob/develop/CONTRIBUTING.md)) - [x] Problem and Solution clearly stated - [ ] Run unit/integration/application tests on device - [ ] Added documentation **NEEDS TO BE UPDATED** - [ ] Added to CHANGELOG.md after merging (add links to docs and issues)

that I have looked at and it appears I am not alone.
The last one is probably the final solution and I will try this after I ask a few questions.

When opening a TCP connection when should you close it? I suspect that I should not open and close the connection every 6ms. Our system runs 24 hours a day with a maximum lull in communication of about 30 seconds. I suspect you open the connection and use it till client.connected fails or an artificial keep alive signal fails in which case you close the connection and try to open a new one. I find that sometimes client.connected fails many times a minute and other times it runs for days without failure, I guess depending on how noisy the Wifi environment is.

When client.write returns 0 what should be done? Is the message stuck in a buffer? Is it lost? if this happens should I close the connection?

If the solution offered in system firmware 0.8.0-rc.2 is what I should try then what does the timeout mean. Again is the message stuck in a buffer or is it lost. Should I close the connection after the timeout?

I am running two photons, one server, one client. System firmware 0.7.0-rc.7
Both in threaded mode
RSSI is good between 58 and 62
free memory as report by System.freeMemory() is at 41984 during moments without traffic and dipping to around 32000 when there is traffic.

I'm not a programmer by profession but have been learning over the last year as best I can. Any help with this issue is greatly appreciated. What I'm looking for is a way to keep the photon running without having to reset it as the final product is not readily accessible. The information carried by our CANbus is time sensitive so we can't afford to either lose messages or delay them by minutes let alone hours.

Thanks in advance,
Joerg Seiler

Elco · February 26, 2018, 11:06pm

For reference, I’m shooting at multiple bugs related to TCP and WiFi.

Some fixes have been made by @avtolstoy, but I have uncovered more bugs since.

I’m reporting my findings under the first issue where he mentioned the fixes, because its easier to follow and multiple bugs are targeted at once.

The write timeout is indeed a good start, but the function still needs to support all overloads of Print to be easy to use. An alternative implementation is that timeout is made a property of the class, which will simplify use and doesn’t require overloading Print.

TcpClient.read() should also return a negative value on error, but I don’t think that negative value is properly propagated to the caller. I do see errors being logged to Serial when compiling with DEBUG_BUILD=y defined, but my own error handler never sees the negative return value.

Jseiler · February 27, 2018, 12:36am

Thank you for your prompt reply @Elco

I have been following you digging deep into this problem and have incorporated some of you workarounds into my code making it more robust, especially the part of calling client.stop before accepting a re-connection. I too incorporated something very similar to your tic toc example and if I don’t hear anything back in a few seconds, I wait for 10 seconds, I send client.stop and try to connect again, This usually works almost instantly unless there is an issue with Wifi, like you’ve already discovered.

One other thing that I have discovered is that the tic toc is very erratic, most of the time there is an instant response but as the speed increases messages seem to be hoarded somewhere. There will be a break of a few seconds and then a batch of messages all show up at once. I don’t know whether the photon is doing this or whether this is a router thing. I’m not connected to the particle cloud either, although there was a suggestion that doing so makes TCP more stable, I haven’t noticed this. I am on a specific channel so no channel switching in my home setup. I will check at work tomorrow to see how that is setup as I see disconnects there quite often around the same time of day and all 8 of my devices seem to disconnect at the same time, after a few minutes they all reconnect and all is quiet again.

I have no need to use client.print as all of our data is binary, I find that using client.print for this creates more of a mess, especially for those of us who aren’t master programmers.

@avtolstoy, may I ask what happens when client.write times out? Is the message lost and I need to try sending again? Or is it just sitting in a buffer waiting to go out. Or is the timeout a signal that I need to disconnect and reconnect again. Since for me messages flow in every 6.7ms or so, what is a reasonable timeout value, is 6ms to short of a timeout?

Sorry if these questions seem very basic but the TCP/IP world is very new to me and the few books I’ve gotten from the local library are scaring me a little as it seems that one can make a career just working with TCP/IP when all I want to do is transfer messages back and forth reliably.

Thanks,
Joerg

Elco · February 27, 2018, 12:46am

I see this too. Sometimes messages don’t arrive for a while and then 10 of them arrive at the same time.
I don’t know where these are queued either.

It could be a hiccup in the network and the system framework keeps them in the queue, but it could also be that the user thread schedules them and the system thread is busy doing something else and is not handling tcp for a while, so they are not actually sent. I don’t know but would like to find out the details. Just confirming that I see the same behavior.

avtolstoy · February 27, 2018, 2:27am

I'd say the timeout should be chosen depending on the requirements of your application. If you can't afford to wait longer than 6ms blocked in write() operation, then such a value sounds reasonable.

I'll quote from the unreleased documentation for TCPClient::write:

This function also takes an optional argument timeout, which allows the caller to specify the maximum amount of time the function may take. If timeout value is specified, write operation may succeed partially and it's up to the caller to check the actual number of bytes written and schedule the next write() call in order to send all the data out.

The application code may additionally check if an error occured during the last write() call by checking getWriteError() return value. Any non-zero error code indicates and error during write operation.

NOTE: write() currently may return negative error codes. This behavior will change in the next major release (0.9.0). Applications will be required to use getWriteError() to check for write errors.

So, if you provide a timeout value for write(), you need to check the returned value:

If it's <= 0 none of the supplied data was written into TCP send queue
If it's > 0, you need to calculate how much of the data from the supplied buffer was not written into the TCP send queue, and schedule the next write() as appropriate

The data that was successfully written will sit in the TCP send queue until it is received and ACKed by the other side or the connection errors out. Please do note that TCP is a streaming protocol and the data may be queued up due to packet losses, buffer overruns or other reasons, so it is not possbile to guarantee that your transmissions will happen exactly every 6.7ms for example, even if Nagle's algorithm is disabled. If you have tight timing requirements, perhaps UDP is a better candidate, however you will have to deal with packet losses yourself.

Jseiler · February 27, 2018, 3:56am

Thank you, that helps a lot.

Since TCP is at least an order of magnitude if not more faster than our slow CANbus, I suspect that writing to the TCP send queue and reporting back the number of bytes written probably happens in a matter of microseconds for a total of 22 bytes. Therefore if it times out in 6ms there is already something wrong, indicating don't write anymore data to the queue until whatever is wrong is fixed.

Three questions;

How big is the send queue?
When the number of bytes written is returned by client.write(), does that mean written and received by the other side? Or just written to the send queue?
when you say

how long is the timeout? How long is the wait for ACK before it errors out, and when it errors out is this when client.connected() fails? Or is some other error thrown?

I was looking at UDP for the very reason that it is recommended for time critical applications but it appears really daunting as all of the "automagic" buried in TCP is now left up to the programmer. It has been my understanding that TCP takes the complexity out of UDP. I'm a little leery of going down a rabbit hole I might not be able to get out of.

I did find this relatively easy to understand example by @ScruffR and @rickkas7

but I have no idea what kind of handshaking and error checking is going to be involved to make it a robust alternative. When TCP works it's awesome, when it breaks it's a bit of nightmare.

Thanks,
Joerg

Jseiler · February 27, 2018, 4:15am

I see this visually as our product has 3 communication LEDs, one for Zigbee traffic, one for CAN traffic, and of course one for TCP traffic. I send a keep alive message every two seconds and the TCP lights on sender and receiver come on in unison, this is literally like ping pong, message sent, message returned, done. When I send a burst of three messages the third message is often a straggler by a couple of hundred milliseconds or so. Then when I send a 130 messages a second for 15 seconds there are quite a few breaks in communication where the send light is on but the receiver light blinks on and off with the receiver getting the last of the messages about a second after the sender finishes sending. If I send longer streams the receiver keeps up but messages are received in fits and starts

The lights are turned on right before client.write() and client.read() and off by a timer routine 50ms later.

Maybe if I’m feeling brave I’ll look at UDP if TCP is too flakey at the moment.

Jseiler · February 27, 2018, 9:22pm

Just wanted to tie this post to another post based on an observation I made between the different behaviour of Redbears and Photons using TCP.

Elco · April 8, 2020, 11:41pm

I’m still having issues with the TCP stack.
I’m wondering whether you guys have found a solution?

I created an issue here:

What I see is that TCPClients write block indefinitely. UDP might be an option, did you explore that road?

Topic		Replies	Views
Redbear Duo vs Photon firmware Firmware	3	1496	March 2, 2018
Photon lock up in TCPClient write() Firmware	3	638	November 1, 2018
Strange lock up observation [v0.4.5/v0.4.4] Troubleshooting	2	679	September 28, 2015
[Photon] v0.4.7 hanging in UDP::stop? Firmware	12	2428	April 10, 2017
Device lockup question Firmware photon	6	850	November 13, 2020

TCP write lockup

Related topics