Upload response: TIMEOUT - B524

I have an issue where I get these "Upload response: TIMEOUT", and it is affecting a lot of our B524 modules.

I get more than 200 upload response timeout every 12 hour, We have upgraded some of our modules to the newest OS, and this does not help, these modules also gets the fault.

This issue startet maybe 2 months ago, before that we had no issues, it looks like it has startet as we get more B524 modules into our fleet.

Every time we get this issue, the modules does not upload, and we do not get any data, but if I then "RESET" the module, it uploads all data instantly, and runs for a while, before the problem returns.

How does the upload process work? By publishing events or using some proprietary system?

In any case, the symptoms you've described are common if you have a memory leak. Over time, less memory, or more fragmented memory, is available, eventually leading to a situation where a required block of memory can't be allocated. If you have frequent enough device vitals you might be able to see declining free memory over time, but this won't detected fragmented memory.

Aside from fixing the leak, adding an out of memory handler is the best way to work around the problem. Another solution is periodically calling System.reset() on the device, perhaps once a week.

the upload process is from Particle API to Microsoft Azure ServiceBus, through a measurement logger, and then into our SQL database.

We have roughly 200 B524, and the problem is affecting all in random order? So calling a reset once a week will not help, as we have 200 timeout in every 12 hours, with loss of data coming in.

Where is the timeout reported? Is it a webhook timeout?

The received of the webhook must respond with a HTTP status code within 20 seconds or the request will fail with a timeout error.

If that's occurring, you need to add more servers, or add jitter to the time your reports are uploaded so the devices don't all send at the same time. There is no way to increase the webhook timeout.

I get this "ESOCKETTIMEDOUT"

I think it is a webhook timeout, it is just reported in the console event log. We have above 200 B524 units in the field, where some of them transmits data every 15 minuts. All response i get is the text "Upload Response: Timeout", but i guess it is because they do not get the status code within 20 seconds, as you describe.

Is the webhook not designed to handle all of these units. We have other units, transmitting it the same way, as described earlier, and they do not have this issue, but of course they do not use the Particle Webhook.

Socket timeout indicates the recipient of the webhook did not respond within 20 seconds.

The Particle webhook service can handle thousands of devices; the problem is that it expects a response within 20 seconds or it assumes that the destination server is no longer up. If this happens repeatedly, requests to the server can be throttled to prevent overloading the server.

The main error is not the few "ESOCKETTIMEDOUT" we get. We get a timeout running PublishQueuePosix::instance().publish

we are running modified versions of your GitHub - rickkas7/BackgroundPublishRK: Library for doing Particle publish from a worker thread and GitHub - rickkas7/PublishQueuePosixRK: Version of publish queue for storing events on the POSIX flash file system
They are older versions (Posix pre 0.0.2)

The timeout happens inside the "BackgroundPublish::thread_f()" method

We have this code:

if(strncmp(event_name, "HLP1up", 6) == 0){
		
        	// then wait for publish to complete
        	while(/*!ok.isDone()*/ (zHttpResp == WAITING) && (millis() - lastPublish < 10000) && state != BACKGROUND_PUBLISH_STOP)
        	{
        	    // yield to rest of system while we wait
        	    delay(1);
        	}
	
        	if(completed_cb)
        	{
        	    completed_cb(/*ok.isSucceeded()*/ zHttpResp == OK ? true : false,
        	        event_name,
        	        event_data,
        	        event_context);
        	}
	
	
			if(zHttpResp == OK && pubRetrying == true){
        	    snprintf(dLbufs, 6+1, "^%s", event_data + 17);
        	    devLog(dLbufs);
        	    pubRetrying = false;
			}
			//else{ // for testing
			//    if(zHttpResp == OK) devLog("PubOK");
			//}
			
			if (zHttpResp == WAITING) {
        		zHttpResp = TIMEOUT;
        		devLog("NORESP");
        	    Log.info("No HTTP response, TIMEOUT");
        	    if(Particle.connected()) Particle.publish("HLP1log", "Upload response: TIMEOUT", PRIVATE, NO_ACK); // Errors handled in webhook response handle
        	}
	
			// zHttpResp == ERROR is handledin error handler
	
			zHttpResp == OK ? pubRetrying = false : pubRetrying = true; // Update status for next time
			//devLog("PubEnd"); // &tst
        }

we have hooks on hook-response/HLP1up_" + System.deviceID() and hook-error/HLP1up_" + System.deviceID() that should change the zHttpResp to OK and ERROR respectively.

However, the error we keep seeing is the "Upload response: TIMEOUT"

from this I gather either we have issues publishing the data from the module to particle cloud or receiving the response. My money is on the former, considering the HLP1up event isn't triggered.

Restarting the devices does help, but it varies a lot for how long.

There are a lot of variables at play here so it's hard to say for sure. However it looks like you're only waiting 10 seconds from the publish to occur. That's too short, because if a CoAP retry needs to occur, it can take up to 20 seconds.

Ideally it would be best to remove BackgroundPublishRK and use the CloudEvent API built into Device OS 6.3.0 and later. That handles asynchronous publish without requiring a worker thread and will greatly simplify your logic which may make it easier to debug what is actually happening.

I developed the module and was asked to take a look. Thank you for helping out with these suggestions @rickkas7

The overall pattern emerging across all and and increasing amount of modules, on old stable SW, seems to fit with a webhook endpoint that can only take one or two uploads at a time. Is there a such a limit on a highly used webhook, or can many B524's access the same webhook simultaneously?

Load seem minimal with up to 22 per minute?

Details

The modules run two softwares, a 2.5 year old OS 2.2.0 and a 1 years old OS 4.2.0. wo. problems. Diagnostics show stable memory over time. Suggested memory leak handing is included but not triggered.

The Particle webhook for uploading data to the microsoft service bus, in 10 days had 1 and rarely 2 ESOCKETTIMEDOUT per day out of about 7-8 thousand, so that is fine.

Following the SSE feed, the uploading timeout happens across different module ID's, the two software versions, countries, network operators, small/large data sizes and so on. It only happens on publishes to the upload webhook (with the most load). So it is not specific modules.

When upload webhook is not responsive, it can repeat ex. 7 times in a row and start working again (but it triggered a customer alarm). The PublishQueuePosixRK waitAfterFailure is changed from 30s to 5m.

After each upload failing, the module logs this with a publish to a logging webhook to loggly, and that never fails (seen from expected device pattern). So the connection seems ok ...

That seems like a capacity issue on your back-end.

The Particle webhook server can rate limit if the server experiences errors, but this shows up a skipped or sleep, and only occurs in response to multiple failed requests, either timeout, or a HTTP error such as a 4xx or 5xx response. The failed requests are measured per destination URL, not per webhook or per account.

There are a large number of Particle webhook servers, and more can be spun up on demand, so there isn't a fixed limit. Multiple requests that come in at the same time will be routed to multiple webhook servers and will be processed in parallel, regardless of the account or destination URL.

The timeout error occurs when the server for the destination URL does not respond within 20 seconds of the request. This timeout cannot be changed.

This is the current upload webhook status. There has been no reported errors today sending to the service bus.

At the same time publish to the webhook still times out live watching the SSE stream about 2-7 times per hour.

The modules are all LTE, could there be a bottle neck at the network operator?

Oh, sorry, I just realized you were talking about a time out on-device, in Particle.publish, not the webhook timeout. The on-device timeout occurs because the publish does not get a CoAP ACK within 20 seconds of the publish. There's an internal retry somewhere in that time as well. That typically occurs because of bad cellular connectivity.

Ah, sorry for not being more clear.

Thanks, I will increase the device timeout to 25s and see if it solves it.

Just strange we have not had this happen in a visible way for more than 2 years, until May this year, where it suddenly happens all the time.

Sorry for the long post. The increased timeout does not have an effect.

The significant increase in TIMEOUTS reports from devices, happens when the upload webhooks response back from the service bus, is to be forwarded to the device, to confirm the upload, and is not received by the device.

An example of what it should look like. The timestamps are not a perfect match coming from different services.

First is our on-device log and below the SSE feed. "X" marks the OK ack back to the device.

 84K -78  i15  ^0  
071618:33:44 >RESP
071618:33:50 >RESP 
071618:33:56 >RESP 
071618:33:58 >RESP  <<< 4/4
 84K -78  i15  ^0  

2025-07-16T18:34:16.804   e00fce68e15685ca4a338b64 HLP1up "data":"{\"T\":\"2025-07-16T18:33:42Z\",\"R\":-78,\"I\":15,\"D\":\"{...
2025-07-16T18:34:16.906   particle-internal        hook-sent/HLP1up 
2025-07-16T18:34:17.325   particle-internal        hook-response/HLP1up_e00fce68e15685ca4a338b64/0 X
2025-07-16T18:34:18.864   e00fce68e15685ca4a338b64 HLP1up "data":"{\"T\":\"2025-07-16T18:33:42Z\",\"R\":-78,\"I\":15,\"D\":\"{...
2025-07-16T18:34:18.937   particle-internal        hook-sent/HLP1up 
2025-07-16T18:34:19.371   particle-internal        hook-response/HLP1up_e00fce68e15685ca4a338b64/0 X
2025-07-16T18:34:24.926   e00fce68e15685ca4a338b64 HLP1up "data":"{\"T\":\"2025-07-16T18:33:42Z\",\"R\":-78,\"I\":15,\"D\":\"{...
2025-07-16T18:34:24.991   particle-internal        hook-sent/HLP1up 
2025-07-16T18:34:25.320   particle-internal        hook-response/HLP1up_e00fce68e15685ca4a338b64/0 X
2025-07-16T18:34:30.699   e00fce68e15685ca4a338b64 HLP1up "data":"{\"T\":\"2025-07-16T18:33:42Z\",\"R\":-78,\"I\":15,\"D\":\"{...
2025-07-16T18:34:30.766   particle-internal        hook-sent/HLP1up 
2025-07-16T18:34:31.159   particle-internal        hook-response/HLP1up_e00fce68e15685ca4a338b64/0 X

Here an example where the webhook reply is not received, resulting in a TIMEOUT log
I have included the connection log to show the signal strength is fairly ok.

071618:48:44 >RESP 
071618:48:46 >RESP 
071618:48:48 >RESP 
071618:49:14 NORESP <<< results in a timeout log entry
071618:54:15 >RESP 
071618:54:15 ^18:48
 84K -78  i15  ^0  


2025-07-16T18:49:16.817  e00fce68e15685ca4a338b64 HLP1up "data":"{\"T\":\"2025-07-16T18:48:43Z\",\"R\":-78,\"I\":15,\"D\":\"{...
2025-07-16T18:49:16.925  particle-internal        hook-sent/HLP1up 
2025-07-16T18:49:17.369  particle-internal        hook-response/HLP1up_e00fce68e15685ca4a338b64/0 X
2025-07-16T18:49:18.895  e00fce68e15685ca4a338b64 HLP1up "data":"{\"T\":\"2025-07-16T18:48:43Z\",\"R\":-78,\"I\":15,\"D\":\"{...
2025-07-16T18:49:19.045  particle-internal        hook-sent/HLP1up 
2025-07-16T18:49:19.397  particle-internal        hook-response/HLP1up_e00fce68e15685ca4a338b64/0 X
2025-07-16T18:49:20.885  e00fce68e15685ca4a338b64 HLP1up "data":"{\"T\":\"2025-07-16T18:48:43Z\",\"R\":-78,\"I\":15,\"D\":\"{...
2025-07-16T18:49:20.991  particle-internal        hook-sent/HLP1up 
2025-07-16T18:49:21.329  particle-internal        hook-response/HLP1up_e00fce68e15685ca4a338b64/0 X
[...]
2025-07-16T18:49:48.114  e00fce68e15685ca4a338b64 HHH1log Upload response: TIMEOUT
[...]
2025-07-16T18:54:47.962  e00fce68e15685ca4a338b64 HLP1up "data":"{\"T\":\"2025-07-16T18:48:43Z\",\"R\":-78,\"I\":15,\"D\":\"{...
2025-07-16T18:54:48.073  particle-internal        hook-sent/HLP1up 
2025-07-16T18:54:48.491  particle-internal        hook-response/HLP1up_e00fce68e15685ca4a338b64/0 X


2025-07-17T06:05:02.433Z,LTE,H3G,238,06,50202,58131056,LTE,83.33,%,-80,dBm,RSRP,37.5,%,-13,dB,RSRQ,connected,0,0,2,unknown,connected,0,1,0,none,26,14,0,387,0,-210,unknown,unknown,43248,81876,166300,Open,ok,32428,148
2025-07-17T00:05:02.464Z,LTE,H3G,238,06,50202,58131006,LTE,57.5,%,-92,dBm,RSRP,62.5,%,-9,dB,RSRQ,connected,0,0,2,unknown,connected,0,1,0,none,25,14,0,5896,0,-210,unknown,unknown,21648,81876,166300,Open,ok,10828,52
2025-07-16T18:05:02.389Z,LTE,H3G,238,06,50202,58130976,LTE,40,%,-99,dBm,RSRP,50,%,-10,dB,RSRQ,connected,0,0,1,unknown,connected,0,1,0,none,15,13,0,36833,0,-210,unknown,unknown,48,82004,166300,Open,ok,37,2,725
2025-07-16T17:16:26.632Z,LTE,H3G,238,06,50202,58131006,LTE,57.5,%,-92,dBm,RSRP,62.5,%,-9,dB,RSRQ,connected,0,0,1,unknown,connected,0,1,0,none,7,3,0,5836,0,-210,unknown,unknown,11,81916,166316,Open,ok,5,2,
2025-07-16T13:47:56.777Z,LTE,H3G,238,06,50202,58130996,LTE,52.5,%,-94,dBm,RSRP,37.5,%,-13,dB,RSRQ,connected,0,0,1,unknown,connected,0,1,0,none,7,0,0,273,0,-210,unknown,unknown,5,81844,166316,Open,ok,5,6,

It was lucky to catch this on an office device uploading 4 chunks every 15 min. showing this behaviour on average every 3 batches of uploads for a few hours. It happens at any of the four chunks in a batch.

It also happens on devices with just one small upload once an hour. The phenomenon is across device id's, NO's, countries, time of day, OS 2.2.0, 4.2.0, 6.2.1., 2-year and 1-year old device code. It naturally happens more frequently on mobile units, but also randomly across stationary units with excellent signal over time.

The excellent included libraries PublishQueuePosixRK and BackgroundPublishRK may have covered this for some time. But since early May there is a tendency for some units to stop receiving the webhook reply for minutes up to hours, resulting in customer alarms. There the connection is often working well, as device publishing of the TIMEOUT log goes through.

A working heavy work-around, has been for our server to reset a device when reporting TIMEOUT, preventing customer issues. [This includes a reset of the cloud session]. I am currently testing an updated device work-around with a self reset including cloud session, as a lighter more precise alternative.

The consequences of this has been experienced since early may. I remember we have experienced and reported a more significant lack of webhook replies on the platform maybe 4? years ago, and some time after it was suddenly fixed.

This time it happens at a varying hundreds of times out of around 8 thousand uploads per 24h, but enough to disrupt the service before work-arounds.

So you're using an end-to-end confirmation, using subscription. That's a completely different problem.

Subscriptions can be lost because of poor connectivity, and there is no guaranteed delivery. There are other situations which probably are not happening in your case: Too many events for the device to process so they are discarded. The cloud has lost the ability to send to the device due to a mobile network or keep-alive issue, but this rarely happens immediately after a publish.

The lack of guaranteed delivery is one reason why we recommend handing the acknowledgement in your back-end server by sending a Particle function call to the device instead of using publish and subscribe. The Particle function cloud API can detect a failed function call and retry it, unlike webhook replies.

In lieu of that, setting up a system that does a retransmission from the device side when it does not receive the ACK may make sense. This will still require work to prevent duplicates, however.

Another thing going forward will be using extended publish, which allows events up to 16 Kbytes, so you won't need to send in chunks. Also, the CloudEvent API is fully asynchronous, so you can eliminate the background publish library, which will reduce complexity.

Thank you for taking a look.

Yes, in order not to not loose customer data, by design we must use end-to-end confirmation, as clearly needed. Each device only uploads one set of data waiting for confirmation each time, so sequencing is built in, and there is no other device communication going on during this.

Thank you for the suggestions going forward.

Uploading to a Microsoft Service Bus, device functions call confirms sounds less practical, as it ties together parts of the system that were decoupled on purpose. Extended publish sounds promising when it has matured into LTS releases, as that could lessen the load on the low capacity webhook.

It sounds like, while webhooks are scaled to no end, towards other services, they can be quite limited/lossy towards devices.

Yesterday there were 10.464 uploads and 344 timeouts (3.3%) where webhook replies were not received. This post is to communicate, that this is not close to what we have experienced earlier for years, in the hope that it might be considered at some time, hopefully in the near future.