PublishQueuePosixRK with confirm from end service

thrmttnw · November 9, 2021, 5:01pm

I want to use PublishQueuePosixRK, but where PublishQueuePosixRK delete on acknowledge, I have a requirement to only delete uploads on confirm back from the end service.

It seems I need to modify these lines starting at line 83: BackgroundPublishRK/BackgroundPublishRK.cpp at main · rickkas7/BackgroundPublishRK · GitHub

        auto ok = Particle.publish(event_name, event_data, event_flags);

        // then wait for publish to complete
        while(!ok.isDone() && state != BACKGROUND_PUBLISH_STOP)
        {
            // yield to rest of system while we wait
            delay(1);
        }

To only OK, when there is a confirm back from a subscribe to the end service reply like:

	Particle.subscribe(String("hook-response/upLoad_" + System.deviceID()), uploadResponseHandler, MY_DEVICES);
    Particle.subscribe(String("hook-error/upLoad_" + System.deviceID()), uploadErrorHandler, MY_DEVICES);

As BackgroundPublishRK runs in a separate thread, can I Particle.subscribe in the main app thread and there modify a variable to be checked in the BackgroundPublishRK thread above? Does that variable need to be declared/protected in a special way?

rickkas7 · November 9, 2021, 6:49pm

You’ll probably want to make sure you have a unique identifier in your event payload. Because events can be out-of-order, delayed, or duplicated, you need to make sure the ACK you get back is for the event you think it is. An integer serial number that increases on publish is probably sufficient.

I don’t think you’ll have any special threading issues because you’re probably only passing an integer or boolean between threads. Those are basically atomic in this scenario.

thrmttnw · November 12, 2021, 2:12pm

Thanks, good point. Just started testing this in it’s original form. Looks neat!

I will probably try to change the 30 sec retry to 5 minutes, now that data operations count a lot, and a single outage can set off a storm of updates.

thrmttnw · November 22, 2021, 2:50pm

It seem to be running really well so far, and I am wondering about two things.

Do PublishQueuePosix::clearQueues() clear the actual files?
Maximum number of files. According to AN035 File System | Datasheets | Particle the 2MB filesystem should be used with max 200 files, each with 1 sector for meta data + 1 sector content, and each sector is 4096 Bytes. So as long as each publish is less than 4KB data, I can also get away with 200 files?

thrmttnw · November 24, 2021, 10:31am

On point 1. I found it here GitHub - rickkas7/SequentialFileRK: Library for managing sequentially numbered files on the flash file system on Particle Gen 3 devices and the answer is yes the whole directory with files is deleted.

On point 2. I am still not sure I understand AN035 File System | Datasheets | Particle regarding file size. Is it correctly understood that one file is minimum 8kB (two 4K sectors), even if the file only carries one byte of actual data?

Please help me understand this. Thanks.

thrmttnw · November 25, 2021, 9:18am

I guess the original question is answered, so I will put the remaining issue in a new topic and set up a support case.

chipmc · November 26, 2021, 7:40pm

@thrmttnw,

I am also using the PiblishSyncPOSIX and I have observed something I wanted to run by you.

Some of my devices, especially the ones with less than stellar cellular coverage, will end up with a missed data report. I publish my web hooks before I connect to cellular and then connect if it makes sense to. For this reason, all my web hooks are published one the Boron connects and PublishQueuePOSIX clears its cache of messages.

Occasionally, it seems that these messages do not get through, and in looking at my logs, I see three scenarios:

The Boron publishes, but it does not make it to Particle. I can see this happen occasionally in my logs where I can see the device move from the report state to the response wait state without seeing the webhook in the console log. Perhaps the webhook was sent before the data connection was fully established.
The Web hook is in the console log but the device resets because there is no response from the back-end service. When I look at the back-end service (Ubidots) I do not see the “dot” for that report.
The Web hook is in the console and a “dot” was created by Ubidots but the response code (HTTP code 201) was not received by the Ubidots handler in my code so the device counts this as an error.

By breaking it down like this, I thought perhaps some approach could be developed for each case. Here is what I was thinking - please let me know if this makes sense:

For the first case: Perhaps I should turn on WITH_ACK. I looked at the documentation and it is not clear what this will do for sending cached publish events. Do you have any sense of this? Do you use this flag?

For the second case, it seems like your idea of a serialized response would make the most sense. In this approach, if a response was not received in a 6 second window (I believe this is Particle’s window for responses) the webhook could be resent.

For the third case, I thought it might be possible for the Boron to send a webhook to the Ubidots API to get the “last activity” for the device which would do the same thing as the 201 code to validate receipt.

lastActivity only for Get All Variables by Device Endpoint

Please note: The query parameter lastActivity is only available for the endpoint
Get All Variables by Device
https://industrial.api.ubidots.com/api/v2.0/devices/<device_key>/variables/

Am I thinking about this right? If you are doing any of these today, any advice would be appreciated.

Thanks,

Chip

chipmc · November 30, 2021, 3:08pm

@rickkas7

Is there any way to get more information on the role the WITH_ACK plays when you are using PublishQueuePosix.

In reading the docs, it seems that there will already be multiple attempts (for Cellular devices) even when no ACK flag is set.

Here is my understanding of ACK in Publish for Generation 3 cellular devices when connected to Particle:

Regardless of the ACK flag, messages are republished multiple times (up to 3?) if no acknowledgement is heard.
The ACK flag then is mainly about when execution is resumed - NO_ACK immediate and WITH_ACK after success or timeout / retries are exhausted.

So then what does all this mean when we are using PublishQueuePosix? There may be a backlog of messages, it would seem that WITH_ACK could cause quite a delay. Also, if a message is sent, and no ACK is received, is it still cleared from the publish queue?

In watching my logs (working through another connectivity issue), I occasionally see this message while the device is Particle.connected():

0001030509 [gsm0710muxer] ERROR: The other end has not replied to keep alives (TESTs) 5 times, considering muxed connection dead

The modem will then disconnect and reconnect to Particle. If PublishQueuePosix is not testing for ACK’s before clearing the queue, might all the queued messages be lost?

Said another way, is there a way to (without blocking) ensure that a message is acknowledged by Particle before it is cleared from the queue?

Thanks,

Chip

rickkas7 · November 30, 2021, 3:25pm

If you use NO_ACK mode, there are never any retransmissions, because without an ACK, there’s no way to know if the transmission went through or not.

If you use WITH_ACK (or do not specify the ACK mode, as with is the default), then there are three tries over a period of 20 seconds. At that point, Particle.publish() will return false. However, there is still a chance the event will go out again later after reconnecting, so beware of a possibility of duplicates.

PublishQueuePosix only ever works with a single event at a time. If Particle.connected() returns true, then the first event is published. If using ACK mode (recommended), then the event is not removed from the queue until Particle.publish() returns true, meaning there was an ACK for the event.

In the normal case of Particle.publish() and checking the boolean result, the call is blocking. However, it actually returns a future, and you can test for completion without blocking. That’s what PublishQueuePosixRK does.

If the device sends a keep-alive ping and does not get an ACK, then the CoAP connection may be reset. This is unlikely to affect publishes in general, because keep-alives are normally only sent 23 minutes after any transmission. So when you publish, the timer is reset. I suppose there is a probably a race condition where something will happen. However, even if the CoAP session is reset, it should be no different than if the cloud reconnects during the publish timeout. Depending on how quickly the reconnection occurs, either the publish will time out, or go through.

chipmc · November 30, 2021, 3:42pm

@rickkas7 ,

Thank you very much for the clarification. The explanation on the Particle publish make sense and I certainly do want to use ACK to improve message reliability. Based on this, I will use (PRIVATE | WITH_ACK) going forward.

I had not appreciate the power of “future” util thinking about it in the context of publish queues. Thank you for that link and for the amazing work you have done with this library.

Chip

thrmttnw · December 1, 2021, 8:23am

I have found that when upload data are malformed compared to the web hook definition, they will just be thrown away by the web hook service with no feedback.

With good data, sometimes the web hook service still does not fire. And when it does fire, sometimes it will get no response from the end service. And when it does get a response, sometimes the response handler still do not fire. And when it does fire, sometimes the device will not receive it. I have seen all of these, but not so much the last year or so (back end must be improved a lot compared to a couple of years back).

Without control over the back end service point (Azure service bus) I could not serialize the way @rickkas7 suggested.

Instead, having an always on device, libs have been modified, so a publish wait for max 10s for an (error) response handler event, and resend after 5 minutes instead of 30s. So a false positive is less likely (ex. web hooks may retry for a couple of minutes).

I ignore this for simplicity - the device lacking a confirm results in a resend. Rarely uploading the same data twice, was ok with the end service in my case.

On sleeping battery driven devices in the past, devices simply waited with resends until next wake, letting deep sleep effectively restart/reset the modem (instead of an elaborate reset algorithm). It is simple and saves a lot of juice.

In a case where uploads were "urgent", when there were waiting uploads just before sleep, I changed the wake period from 1 hour to [5] minutes, and back again when the queue was cleared.

A remaining question is how much time to allocate for wake periods, depending on how poor connections you will accept.

Depending on you power budget, and not having experience with deep sleeping on this platform, I am thinking stay connected until the queue is empty, or the device has been connected for max. [15-20s]. Whatever happens first. On an earlier product on another platform, if it took longer than that to send small amount of data, it was highly unlikely to happen anyway (2G).

What do the data for your devices tell you is a "normal max connection time" for getting data through in your locations?

Have you modified PublishQueuePosixRK to act based on webhook response handlers ?

chipmc · December 1, 2021, 2:21pm

@thrmttnw ,

Thank you for the thorough response - very helpful.

Based on your response and some clarifications from Rick, I am a bit less worried about PublishQueueRK either locking up code execution or emptying the queue when the Particle API connection is broken.

My plan is now to:

Turn on WITH_ACK for PublishQueuePosix publishes
Test for an empty queue before sleeping once connected using

size_t PublishQueuePosix::getNumEvents()

Gets the total number of events queued.

size_t getNumEvents()
This is the number of events in the RAM-based queue and the file-based queue. This operation is fast; the file queue length is stored in RAM, so this command does not need to access the file system.

If an event is currently being sent, the result includes this event.

Currently, my device stay awake for one minute when they wake. This has worked well but I think I could simply test for an empty queue before sleeping.

I have not modified the PublishQueuePosix library and I would like to avoid doing so if I can. That said, I am a bit concerned that it is a 0.0.1 release - I am putting a lot of faith in its operation.

So far, it is working well with only one issue that was my fault. I published using the library before executing the .setup() command - this caused a hard fault.

Thanks,

Chip

rickkas7 · December 2, 2021, 12:46am

I wouldn’t worry too much about the 0.0.1 version. I fixed all of the known issues and unnecessary complexity in the PublishQueueAsyncRK library and apparently did such a good job that I haven’t had to fix any bugs. Though I hesitate to say that out loud.

chipmc · December 2, 2021, 2:06am

@rickkas7 ,

Yes, sorry did not mean to imply otherwise. I have it running on ~150 devices now and it has been a godsend! Thank you!

Chip

system · January 1, 2022, 12:07pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
PublishQueuePosix library help Libraries	5	534	December 21, 2021
PublishQuequeAsyncRK - WITH_ACK issue - any solution? Libraries	6	490	June 28, 2019
3.2.0 issue with PublishQueuePosixRK Libraries boron	3	389	March 23, 2022
New library to delay Publish to cloud: PublishQueue Libraries	5	2987	January 20, 2017
PublishQueuePosixRK Library Issues Edge Systems tracker , tracker-som	0	424	August 17, 2022

PublishQueuePosixRK with confirm from end service

Related topics