Upload response: TIMEOUT - B524

I have an issue where I get these "Upload response: TIMEOUT", and it is affecting a lot of our B524 modules.

I get more than 200 upload response timeout every 12 hour, We have upgraded some of our modules to the newest OS, and this does not help, these modules also gets the fault.

This issue startet maybe 2 months ago, before that we had no issues, it looks like it has startet as we get more B524 modules into our fleet.

Every time we get this issue, the modules does not upload, and we do not get any data, but if I then "RESET" the module, it uploads all data instantly, and runs for a while, before the problem returns.

How does the upload process work? By publishing events or using some proprietary system?

In any case, the symptoms you've described are common if you have a memory leak. Over time, less memory, or more fragmented memory, is available, eventually leading to a situation where a required block of memory can't be allocated. If you have frequent enough device vitals you might be able to see declining free memory over time, but this won't detected fragmented memory.

Aside from fixing the leak, adding an out of memory handler is the best way to work around the problem. Another solution is periodically calling System.reset() on the device, perhaps once a week.

the upload process is from Particle API to Microsoft Azure ServiceBus, through a measurement logger, and then into our SQL database.

We have roughly 200 B524, and the problem is affecting all in random order? So calling a reset once a week will not help, as we have 200 timeout in every 12 hours, with loss of data coming in.

Where is the timeout reported? Is it a webhook timeout?

The received of the webhook must respond with a HTTP status code within 20 seconds or the request will fail with a timeout error.

If that's occurring, you need to add more servers, or add jitter to the time your reports are uploaded so the devices don't all send at the same time. There is no way to increase the webhook timeout.

I get this "ESOCKETTIMEDOUT"

I think it is a webhook timeout, it is just reported in the console event log. We have above 200 B524 units in the field, where some of them transmits data every 15 minuts. All response i get is the text "Upload Response: Timeout", but i guess it is because they do not get the status code within 20 seconds, as you describe.

Is the webhook not designed to handle all of these units. We have other units, transmitting it the same way, as described earlier, and they do not have this issue, but of course they do not use the Particle Webhook.

Socket timeout indicates the recipient of the webhook did not respond within 20 seconds.

The Particle webhook service can handle thousands of devices; the problem is that it expects a response within 20 seconds or it assumes that the destination server is no longer up. If this happens repeatedly, requests to the server can be throttled to prevent overloading the server.

The main error is not the few "ESOCKETTIMEDOUT" we get. We get a timeout running PublishQueuePosix::instance().publish

we are running modified versions of your GitHub - rickkas7/BackgroundPublishRK: Library for doing Particle publish from a worker thread and GitHub - rickkas7/PublishQueuePosixRK: Version of publish queue for storing events on the POSIX flash file system
They are older versions (Posix pre 0.0.2)

The timeout happens inside the "BackgroundPublish::thread_f()" method

We have this code:

if(strncmp(event_name, "HLP1up", 6) == 0){
		
        	// then wait for publish to complete
        	while(/*!ok.isDone()*/ (zHttpResp == WAITING) && (millis() - lastPublish < 10000) && state != BACKGROUND_PUBLISH_STOP)
        	{
        	    // yield to rest of system while we wait
        	    delay(1);
        	}
	
        	if(completed_cb)
        	{
        	    completed_cb(/*ok.isSucceeded()*/ zHttpResp == OK ? true : false,
        	        event_name,
        	        event_data,
        	        event_context);
        	}
	
	
			if(zHttpResp == OK && pubRetrying == true){
        	    snprintf(dLbufs, 6+1, "^%s", event_data + 17);
        	    devLog(dLbufs);
        	    pubRetrying = false;
			}
			//else{ // for testing
			//    if(zHttpResp == OK) devLog("PubOK");
			//}
			
			if (zHttpResp == WAITING) {
        		zHttpResp = TIMEOUT;
        		devLog("NORESP");
        	    Log.info("No HTTP response, TIMEOUT");
        	    if(Particle.connected()) Particle.publish("HLP1log", "Upload response: TIMEOUT", PRIVATE, NO_ACK); // Errors handled in webhook response handle
        	}
	
			// zHttpResp == ERROR is handledin error handler
	
			zHttpResp == OK ? pubRetrying = false : pubRetrying = true; // Update status for next time
			//devLog("PubEnd"); // &tst
        }

we have hooks on hook-response/HLP1up_" + System.deviceID() and hook-error/HLP1up_" + System.deviceID() that should change the zHttpResp to OK and ERROR respectively.

However, the error we keep seeing is the "Upload response: TIMEOUT"

from this I gather either we have issues publishing the data from the module to particle cloud or receiving the response. My money is on the former, considering the HLP1up event isn't triggered.

Restarting the devices does help, but it varies a lot for how long.

There are a lot of variables at play here so it's hard to say for sure. However it looks like you're only waiting 10 seconds from the publish to occur. That's too short, because if a CoAP retry needs to occur, it can take up to 20 seconds.

Ideally it would be best to remove BackgroundPublishRK and use the CloudEvent API built into Device OS 6.3.0 and later. That handles asynchronous publish without requiring a worker thread and will greatly simplify your logic which may make it easier to debug what is actually happening.

I developed the module and was asked to take a look. Thank you for helping out with these suggestions @rickkas7

The overall pattern emerging across all and and increasing amount of modules, on old stable SW, seems to fit with a webhook endpoint that can only take one or two uploads at a time. Is there a such a limit on a highly used webhook, or can many B524's access the same webhook simultaneously?

Load seem minimal with up to 22 per minute?

Details

The modules run two softwares, a 2.5 year old OS 2.2.0 and a 1 years old OS 4.2.0. wo. problems. Diagnostics show stable memory over time. Suggested memory leak handing is included but not triggered.

The Particle webhook for uploading data to the microsoft service bus, in 10 days had 1 and rarely 2 ESOCKETTIMEDOUT per day out of about 7-8 thousand, so that is fine.

Following the SSE feed, the uploading timeout happens across different module ID's, the two software versions, countries, network operators, small/large data sizes and so on. It only happens on publishes to the upload webhook (with the most load). So it is not specific modules.

When upload webhook is not responsive, it can repeat ex. 7 times in a row and start working again (but it triggered a customer alarm). The PublishQueuePosixRK waitAfterFailure is changed from 30s to 5m.

After each upload failing, the module logs this with a publish to a logging webhook to loggly, and that never fails (seen from expected device pattern). So the connection seems ok ...

That seems like a capacity issue on your back-end.

The Particle webhook server can rate limit if the server experiences errors, but this shows up a skipped or sleep, and only occurs in response to multiple failed requests, either timeout, or a HTTP error such as a 4xx or 5xx response. The failed requests are measured per destination URL, not per webhook or per account.

There are a large number of Particle webhook servers, and more can be spun up on demand, so there isn't a fixed limit. Multiple requests that come in at the same time will be routed to multiple webhook servers and will be processed in parallel, regardless of the account or destination URL.

The timeout error occurs when the server for the destination URL does not respond within 20 seconds of the request. This timeout cannot be changed.

This is the current upload webhook status. There has been no reported errors today sending to the service bus.

At the same time publish to the webhook still times out live watching the SSE stream about 2-7 times per hour.

The modules are all LTE, could there be a bottle neck at the network operator?

Oh, sorry, I just realized you were talking about a time out on-device, in Particle.publish, not the webhook timeout. The on-device timeout occurs because the publish does not get a CoAP ACK within 20 seconds of the publish. There's an internal retry somewhere in that time as well. That typically occurs because of bad cellular connectivity.

Ah, sorry for not being more clear.

Thanks, I will increase the device timeout to 25s and see if it solves it.

Just strange we have not had this happen in a visible way for more than 2 years, until May this year, where it suddenly happens all the time.