Troubleshooting PublishQueueAsyncRK webhook failures and skips

Have started using PublishQueueAsyncRK by @rickkas7 after discovering our field test cell signal just simply sucks. We have a queue of 32k in regular RAM which allows our devices to be offline for 13+ hours. This seems to be working well and has moved us from telemetry success in the 50% to 70% range to 100%. But we have also gone from nearly no webhook failures and skips to 100s of failures/day and 10s of skips/day.

There are 30 Electron E-Series modules sending telemetry on a 5m period. It is not unusual for at least one of the device’s PublishQueueAsyncRK library to be sending out publishes spaced at 1s that have been queued up because of poor cell signal. Could be just a few or a hundred or more.

Our path now is: publishQueue.publish(…WITH_ACK) -> webhook -> HTTP POST API in Azure -> MySQL DB in Azure
– and it used to be –
Old path: Particle.publish(…NO_ACK) -> webhook -> HTTP POST API in Azure -> MySQL DB in Azure

We switched to PublishQueueAsyncRK in the morning on 9/1 and the history graph looks like this:

Obviously there is a NO_ACK and WITH_ACK difference here that is relevant. Previously we just flung the publish at the cloud and flung the data from the cloud to our DB. So using PublishQueueAsyncRK and WITH_ACK gives us the failure and skip problems. It seems that all of the failures are HTTP 409 CONFLICT issues from the API indicating that the data is already in the DB and an attempt was made to insert again which violates the primary key. The error log looks like this:

The errors appear in groups of 3 roughly 30s apart with the same data in each and for every case, I have verified that the data is indeed in the DB so the 409 CONFLICT is accurate. The HTTP response is:

HTTP/1.1 409 Conflict
Date: Fri, 04 Sep 2020 15:51:08 GMT
Server: Kestrel
Content-Type: application/problem+json; charset=utf-8
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked

This would seem to indicate that there was a webhook POST attempt previous to the grouping of 3 errors that succeeded but the webhook infrastructure didn’t see the usual 201 CREATED response for some reason and so tried 3 more times. Since we only have a history log 10 deep and we have lots of cloud traffic it’s difficult to identify where the problem is.

So here are the questions:

  1. How do I get a longer stream of cloud messages to examine these failures? I need to determine if this is: a firmware problem (using WITH_ACK and the success doesn’t get back to PublishQueueAsyncRK so it thinks it needs to send again), a Particle cloud problem (maybe a time-out on getting a response back), or an API problem (API chokes when traffic is heavy). Perhaps switching back to NO_ACK makes this go away and there are no more 409 CONFLICT issues but it seems like sweeping the dirt under the rug.
  2. The docs indicate that the SKIP issues reported in the history (10s/day) are because of too many 4XX errors - so in our case this would be too many 409 CONFLICT errors. So is it safe to assume that if I get question 1) solved the SKIPs also magically goes away? Presumably if there are no 409s there should be no SKIPs.

If at all possible, avoid returning a 4xx or 5xx error for things that aren’t actually errors. For example, if you have a non-fatal conflict, you’re better off returning a 200 and including a separate field within the response body that indicates that the data was already uploaded.

The reason is that all 4xx and 5xx conditions are considered to be errors, and if you get a sufficiently large number of errors, the webhook system will throttle calls to that hostname because of excess errors. This is designed to prevent overloading servers by making excessive webhook calls. The skipped count indicates that the webhook request was not made because of excessive errors and the event was discarded.

So, yes, eliminating the 409 conflict errors will solve the skipped errors.

There is no way to view more than 10 webhooks calls back as we don’t save them. The best way to keep large numbers of events is to subscribe to the Server Sent Events stream on a server you control and save the events to a database.

1 Like

Ok, thanks, that makes sense. We’ll change our API code to morph the HTTP 409 into an HTTP 2XX. Or I guess alternatively we can go with NO_ACK for a similar result.

We are also thinking, after observing over the long weekend, that it is most likely that the original HTTP 201 CREATED never makes its way back to the Electron and so the PublishQueueAsyncRK library is the source of the additional tries. For instance, the Electron does the publish and it actually makes it to the cloud, but it loses cell/cloud connection before the HTTP response makes it back. Perhaps we could verify this with some certainty if changing the firmware to NO_ACK also eliminates the failures and skips.

I can confirm this. In a recent trial we did, >3% of the webhook responses did not make it back to the Argon on excellent network conditions (OS 1.5.3). This was with a 5 second timeout waiting for the response on the device and no lost connections.

Checking a couple or the incidents live in the Particle console and also looking through tracked SSE’s, the Particle cloud did receive the response promptly but did not fire the response back to the device.

So it was clear that we can not rely on the response always being fired to the device under ideal conditions, and the end service has to be ok with duplicates from PublishQueueAsyncRK without responding with an error.

Good to know - thanks for your confirmation. And in our case, we have the added factor of seriously degraded cellular connections.