PublishQueuePosix - WITH_ACK vs NO_ACK and other questions

"WITH_ACK or not to ACK, that is the question"

So I'm long overdue in migrating from the legacy PublishQueueAsyncRK to using the PublishQueuePosix library. As I transition the code to make it more robust, I was wonder if the WITH_ACK has the same behavior in the new library as the original.

Per Github of the original library: PublishQueueAsyncRK

I recommend using WITH_ACK. The worker thread will wait for the ACK from the cloud before dequeing the event. This allows for several tries to send the event, and if it does not work, the send will be tried again in 30 seconds if cloud-connected. New events can still be queued during this time.

Can I assume the new PublishQueuePosix library has the same functionality under the hood as the original. I.e. events are only taken out of the queue after an ACK.

Both methods are considered a single data operation, correct? I assume so given this: Particle.function() - Cloud functions | Reference | Particle

Each publish uses one Data Operation from your monthly or yearly quota. This is true for both WITH_ACK and NO_ACK modes.

So what would be a scenario where someone prefer to use NO_ACK over WITH_ACK? To me it seems, why not always use WITH_ACK? Or said differently, what's the disadvantage of using WITH_ACK?

A related question... when are files written to the flash file system? For example, I set the parameters like this:

    PublishQueuePosix::instance().setup();
    PublishQueuePosix::instance().withRamQueueSize(10);
    PublishQueuePosix::instance().withFileQueueSize(50);

My understanding is once the RAM is filled up with more than 10 events, only then will it start writing to Flash. However, currently per the serial log. I'm getting this:

0001085058 [app.pubq] TRACE: publishCommon eventName=DataArray eventData=[{MY DATA}]
0001085060 [app.pubq] TRACE: fileQueueLen=1 ramQueueLen=1 connected=1
0001085149 [app.pubq] TRACE: writeQueueToFiles fileNum=4

Which to me indicates it just wrote file #4 to the flash file system despite the queue being a max of 10 and also being connected at the time. I just sleep, wake up, take readings, add the data to the queue and fall back asleep. Then every 20 minutes, it also connects and publishes the data out. In this scenario... why would the file be written to the flash file system?

Final Question, per github repo:

Setting it to 0 means all events will be written to the file system immediately to reduce the chance of losing an event. This has higher overhead and can cause flash wear if you are publishing very frequently.

How frequent is too frequent where this flash wear is something to be actually concerned over? Should I care if it writes to flash or not if I normally capture an event every 5 minutes?

ACKs don’t count as data operations so there’s little reason to not use them. The functionality primarily exists from the early days when cellular data was 1 MB per month, and the 122 bytes of overhead from an ACK mattered more.

The RAM queue size is the maximum. There are a number of reasons the queue will be written to before the maximum. Most notably:

  • Before going to sleep
  • On any failure to publish

The sleep is because HIBERNATE mode does not preserve RAM, and even in the other sleep modes the RAM could be cleared if the device was reset while asleep, so it’s safer to just always save the queue to the file system.

The second just simplifies the logic internally. The purpose of the RAM queue is to avoid having to write to flash in the common case of publishing a few events and immediately sending them out. However, once you have a failure, you then need to put the item back into the queue. Rather than having two code paths where you remove items from both the RAM queue and the file-based queue, PublishQueuePOSIX only deals with one or the other, which eliminates some weird edge cases.

Unless you have completely filled the flash file system with other data, you can write elements to the file system several times per minute safely. Wear is spread across the entire 2 MB in 4096 byte sectors, so as long as there are enough free sectors the wear is distributed across enough sectors that it wouldn’t really be possible to reach 100,000 erase cycles on a single sector.

3 Likes

Thanks @rickkas7 - Very much appreciated.

This library is great! Since I didn’t touch the publishQueue portion of my code in some time, I was still using NO_ACK, between an occasional reset or a lost message, occasionally I’d have some missing data on the cloud side and never knew why. I am pretty sure using this library saving to flash file system and WITH_ACK will alleviate that. I especially like the getCanSleep() method so I know when the queue is empty and able to sleep. Very nice!

THANK YOU!

On a slightly related note… I’ve been considering creating some sort of “cloud side” watchdog. Meaning that, if I expect the device to connect and publish data every 20 minutes, was thinking of using the ACK sent back from the cloud after a publish event to “pet” a cloud side software watchdog. Something like PublishQueuePosix::instance().LastCloudACKMillis() or maybe PublishQueuePosix::instance().LastCloudACKTime()

Then I can just occasionally compare that timestamp with Current time or current millis() to determine how long it’s been since I had a successful ACK. Once this time is greater than 60 minutes or even 2-3 hours, it could issue either a simple reset or Deep Power Down in an attempt to reset itself.

Now that I think about it what about something like:
PublishQueuePosix::instance().MinutesSinceLastCloudACK()

The thought is this would allow for a “Catch All” to self recover from any issue where the device says it’s connected but something is hung up in publishing/Ack events. In the past, I had a few strange scenarios were a device was connected, not publishing events and a AB1805 hardware watchdog as well as the AB1805 Deep Power down was still happy/did not reset the device. The only way it recovered was having the customer turn the device on/off or flashing new firmware OTA to basically force a reset over the cloud. Was thinking a cloud side watchdog like this would be a good catch all to improve self recover of any odd scenarios like that.

Thoughts? Would adding this to that library be a reasonable approach? If so, could this be added to the PublishQueuePosix library or what’s the recommended approach for a user application firmware to cloud watchdog?

1 Like

I have considered adding end-to-end confirmation as option to PublishQueuePosixRK. The way it would work is that the receiver of the event (webhook or SSE) processes the event, and then calls a function on the device to indicate that it has received it. Then the device will know for sure that it really has been received and processed successfully. This covers a whole range of failures, not only in the cloud connection, but also in the webhook service, and your own server. This will double the data operations, but it would be very reliable for when you really want to make sure the events get through. The other issue is that the payload would need to be JSON because it needs to include a packet serial number to know which one is being acknowledged. And of course you need a server to support it. I haven’t decided if this is a useful enough case given the limitations, but it’s feasible.

2 Likes

Yeah… interesting. In my personal use case, the data is not critical enough to justify 2X data operations for a true “end to end” confirmation of data transfer. I do have a backend that could accommodate this pretty easily but like I said, not for the cost of 2X the number of operations.

Rather, the ACK is kind of a data operation “free” response back from the Particle Cloud saying events are being published and being acknowledged so things are likely working as they should. It’s one step closure to a more robust/self healing application without the cost of 2X data operations.

@jgskarda , but wouldn't you be depending on the Cloud Side WD during the time that you have Cloud Problems?

The way I've dealt with this in the past was to use a much less frequent Webhook and have the device subscribe to it's own event, every 2 hours for instance. I used a relay timer board as the WD, since I couldn't find an IC that allowed long cycle times. You get the round-trip confirmation at whatever schedule you can live with, without the excessive overhead.

Would it be possible for you to hand-off the WD control after a specific # of "pets" and send a WD Publish Event and only "pet" after a successful round trip (on the 20'th cycle for instance) ? Then WD control goes back to your normal process.
Or Asynchronously and confirm the WD WebHook flag was raised within/after a certain number of pets.

Wouldn't that keep the failsafe intact, while adding a periodic/scheduled round-trip check to the Hardware WatchDog ?

I may be way off base here.

2 Likes

Appreciate the ideas... what I'm thinking the cloud ACK is what "pets" the software watchdog that resides on the Particle device so if anything fails in publishing of events or the cloud acknowledging the event, it never gets pet and would reset itself. This protects against anything that could happen where the device application code is executing and recovers for any reason the device is not publishing events/or not receiving ACK.

To note, this would be in addition to a hardware watchdog. The hardware watchdog would reset the device if the application code is hung up and does not get pet. This hardware one is more about the application code is running but for some reason not publishing/receiving an ACK.

I am conceptualizing the cloud one as a timer on the device that starts at 0 and is always timing up. Every time it receives an ACK back from any publish event, it resets the timer to 0. If the timer ever gets to say 2-4 hours, it issues a system reset or maybe even a deep power down (pull power for 30 seconds) via
AB1805.

If I understand what you are proposing would work as well (i.e. have a dedicated publish event type maybe called "petWDT" have the device publish that on some cadence. Then send a response or call a function on the device via subscribe to be the software watchdog I'm proposing above. This could work as well just takes a bit more to configure.

I also thought about monitoring the number of queued events in Posix. If the number of queued events reach the max configured then events are not getting out, something must be wrong so issue a reset. What I didn't like about this approach is it still requires something in the application code to create the publish events to fill the queue. So it's better than nothing but not really a full "catch all". I think the dedicated function you are proposing is one step closer but has a similar limitation, maybe the code that sends the petWDT watchdog event is working but the application code that sends out the needed application data/sensor data is hung up for some reason. Just brainstorming on what's the simplest yet widest reaching "catch all".

I suppose, I could probably keep track in my backend the timestamp when I last sent a "pet the watchdog" response to each device. With each publish event it processes if it's been > 2 hours since it was last pet, send a webbook response or call a function to pet it and update the timestamp in my backend. That way it just pets it every 2 hours instead of every single publish event. This also confirms actual sensor data is making it to my backend and would only add a few data operations here and there. Huh... just brainstorming options.

IF the ACK is exposed so you can capture it, why not use that to interrupt (no pun intended) the User Code from petting of the hardware Watchdog, when it’s missed an appropriate number of round trip Acknowledgments ? That would eliminate the need for the scheduled petWDT event.

Since we’re talking about preventing a scenario that might happen a few times a year, you really don’t want that control to be remote (cloud) based…I wouldn’t think.
This edge case is trying to recover from a Cloud/backend problem in the first place.

So what’s the easiest way to have the hardware WD pull the plug after a certain number of failed round trips, without significantly increasing the Data Operations?
I’m just thinking out loud.

1 Like

Yeah very true... it is a once or twice event a year that I'm trying to prevent. Let's not complicate it. :slight_smile: That's why it seems the ACK messages from publish events are already there to tap into

I think you and are saying nearly the same thing, it's just might be implemented slightly different. My understanding of what you are proposing is something like this within Loop():

  // It's been less than 120 minutes since the last ACK message, keep petting the hardware watchdog
  if(minsSinceLastACK < 120){
    ab1805.loop();
  }

Or maybe:

  // We have less than 24 consecutive failed publish attempts (did not receive an ACK). Keep petting the watchdog. If more than 24, then something is wrong, allow it to reset us after a timeout.  
  if(numFailedConsecutivePublishes > 24){
    ab1805.loop();
  }

After 120 minutes or after 24 consecutive failed publish attempts, then AB105 will no longer be pet and after an additional 124 seconds (or whatever the max hardware watchdog is set for) an external reset will be issued by the AB1805 since it's no longer being pet. Is this kind of what you are proposing?

Whereas I'm proposing something like:

  // It's been more than 120 minutes since the last ACK message, Issue a System Reset OR a DeepPowerDown reset.
  if(minsSinceLastACK > 120){
    System.reset();
    //ab1805.deepPowerDown() as an alternative remove power to the device for 30 seconds since it's likely a cloud connectivity issue we have. Deep power down the cellular modem in an attempt to self heal. 
  }

Many times, my device does not "stay awake" very long (take readings and publish data and fall back asleep). So normal operation, it's not awake long enough for the hardware watchdog to fully timeout (I think 2 minutes). So I'm thinking just issue the reset or DeepPowerDown immediately once the condition is met. Just my initial thoughts... My ask to Rick or what I can look into is can the library be updated to provide something like PublishQueuePosix::instance().minsSinceLastACK() Then it's as simple as:

// It's been more than 120 minutes since the last ACK message, Issue a System Reset OR a DeepPowerDown reset.
  if(PublishQueuePosix::instance().minsSinceLastACK() > 120){
    System.reset();
    // ab1805.deepPowerDown() as an alternative remove power to the device for 30 seconds since it's likely a cloud connectivity issue we have. Deep power down the cellular modem in an attempt to self heal. 
  }
1 Like

Hope the function can be called from the device main app for custom options like logging.

Again, thank you for PublishQueuePosixRK. Have had it running since last November without issues on a number of devices.

With a requirement for an end service ACK on always on devices, for a server not under control, 2 modifications were made (avoiding a packet serial):

  • BackgroundPublish::thread_f(): For a specific "event_name" (the data upload), it waits for up to 10 seconds for a variable change, that is set in the main app, when a subscribed confirm for that event name arrives.

  • PublishQueuePosix.h: waitAfterFailure changed from 30000 to 300000, for a 5 minute re-upload period that surpass webhook retries (normally), and reduce long outage data operations.

It is not bullet proof, and the end service must accept duplicate uploads.

With end server ACK's implemented, I also check number of queued.

1 Like

@rickkas7 I noticed another minor thing tonight with PublishQueuePosix::instance().getCanSleep(). I was originally evaluating .getCanSleep() immediately after connecting, once it returns true, I know all events are published and I can disconnect and then fall asleep. I.e. I was doing this:

        if (Particle.connected()) {
            //Now let's wait up to 60 seconds until all publish events are complete. Then we can go to sleep. 
            int i = 0;
            while (!PublishQueuePosix::instance().getCanSleep() and i <= 60) {
                  softDelay(1000);
                  i = i+1;
                  Log.info("Waiting for events to finish publishing, So far we waited %i seconds", i);
            }
         newState = st_sleep;
         }

What I noticed, is within the library the value of canSleep is not updated to false until after the first event in the queue is beginning to be published. Sounds fine, except, that first event doesn’t publish until after the waitAfterConnect delay of 2 seconds.

Ultimately, this means that .getCanSleep() returns true the first 2 seconds after being connected even when there are events in the queue. Ideally, it would return false with any events in the queue.

For now I just added a short delay before evaluating .getCanSleep() but wondering if there is a clean way to fix this within the library itself? Was the 2 second waitAfterConnect delay proven to be required or is this just some safety margin?

        if (Particle.connected()) {
            // Wait just over 2 seconds after Particle.Connected() before evaluating .getCanSleep() so it evaluates correctly. 
            softDelay(2100);
            //Now let's wait up to 60 seconds until all publish events are complete. Then we can go to sleep. 
            int i = 0;
            while (!PublishQueuePosix::instance().getCanSleep() and i <= 60) {
                  softDelay(1000);
                  i = i+1;
                  Log.info("Waiting for events to finish publishing, So far we waited %i seconds", i);
            }
         newState = st_sleep;
         }

I think I fixed the issue. Let me know if it works.

0.0.5 (2022-10-06)

  • I believe I fixed a situation where getCanSleep() can return true during the waitAfterConnect period after connecting even though there are events in the queue.

This is what I changed:

1 Like

That’s quick… Thanks @rickkas7. The changes made to the library looks like it will fix this. Thanks!

I’ll give it a try later this evening and post an update here.

Yeah, that did the trick and was able to remove the added 2 second dely. Thank you very much @rickkas7. As always, very much appreciated!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.