Have started using PublishQueueAsyncRK by @rickkas7 after discovering our field test cell signal just simply sucks. We have a queue of 32k in regular RAM which allows our devices to be offline for 13+ hours. This seems to be working well and has moved us from telemetry success in the 50% to 70% range to 100%. But we have also gone from nearly no webhook failures and skips to 100s of failures/day and 10s of skips/day.
There are 30 Electron E-Series modules sending telemetry on a 5m period. It is not unusual for at least one of the device’s PublishQueueAsyncRK library to be sending out publishes spaced at 1s that have been queued up because of poor cell signal. Could be just a few or a hundred or more.
Our path now is: publishQueue.publish(…WITH_ACK) -> webhook -> HTTP POST API in Azure -> MySQL DB in Azure
– and it used to be –
Old path: Particle.publish(…NO_ACK) -> webhook -> HTTP POST API in Azure -> MySQL DB in Azure
We switched to PublishQueueAsyncRK in the morning on 9/1 and the history graph looks like this:
Obviously there is a NO_ACK and WITH_ACK difference here that is relevant. Previously we just flung the publish at the cloud and flung the data from the cloud to our DB. So using PublishQueueAsyncRK and WITH_ACK gives us the failure and skip problems. It seems that all of the failures are HTTP 409 CONFLICT issues from the API indicating that the data is already in the DB and an attempt was made to insert again which violates the primary key. The error log looks like this:
The errors appear in groups of 3 roughly 30s apart with the same data in each and for every case, I have verified that the data is indeed in the DB so the 409 CONFLICT is accurate. The HTTP response is:
HTTP/1.1 409 Conflict
Date: Fri, 04 Sep 2020 15:51:08 GMT
Server: Kestrel
Content-Type: application/problem+json; charset=utf-8
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
This would seem to indicate that there was a webhook POST attempt previous to the grouping of 3 errors that succeeded but the webhook infrastructure didn’t see the usual 201 CREATED response for some reason and so tried 3 more times. Since we only have a history log 10 deep and we have lots of cloud traffic it’s difficult to identify where the problem is.
So here are the questions:
- How do I get a longer stream of cloud messages to examine these failures? I need to determine if this is: a firmware problem (using WITH_ACK and the success doesn’t get back to PublishQueueAsyncRK so it thinks it needs to send again), a Particle cloud problem (maybe a time-out on getting a response back), or an API problem (API chokes when traffic is heavy). Perhaps switching back to NO_ACK makes this go away and there are no more 409 CONFLICT issues but it seems like sweeping the dirt under the rug.
- The docs indicate that the SKIP issues reported in the history (10s/day) are because of too many 4XX errors - so in our case this would be too many 409 CONFLICT errors. So is it safe to assume that if I get question 1) solved the SKIPs also magically goes away? Presumably if there are no 409s there should be no SKIPs.