Hardware watchdog in Device OS 5.3.0

It would detect and reset the device if user firmware entered an infinite loop.

It’s not always a great solution to that problem during development, because if it happens quickly, you won’t be able to flash the device again before it resets, assuming the problem occurs relatively soon after boot. In this case, using safe mode (breathing magenta) is a better solution.

It wouldn’t hurt to enable it, as it can be useful in other cases where the device freezes later on. This could occur after corrupting memory (block overrun, under-run, using freed memory, etc.).

1 Like

I was not able to see the hardware watchdog is being stopped during sleep on the B402 in my experiments. Is it possible for it to stop on B402? It doesn’t look to be possible to stop the hardware watchdog at the application level, so is there some additional control available at the device os level?

I made a previous post with some example code here:
BSoM Internal Watchdog Sleep Question.

Alright... finally getting around to adding this additional Device OS based watchdog to my existing code base.

What I'd like to do is call Watchdog.refresh() on every successful publish (received an ACK back from the Particle cloud). This essentially would allow for a cloud side watchdog but without burning extra data operations to pet a watchdog using Particle.subscribe() or Particle.function(). I currently use the PublishQueuePosixRK library to handle queueing and publishing of events. A very nice library by the way. Digging through the .CPP of the library, it looks like there is a Private callback function that gets called with each successful publish:

void PublishQueuePosix::publishCompleteCallback(bool succeeded, const char *eventName, const char *eventData) {
    publishComplete = true;
    publishSuccess = succeeded;
}

@rickkas7 - What do you think of the approach outlined below. If it's a reasonable approach, would you be open to updating the PublishQueuePosix library to enable petting the device watchdog during this callback? I'd be happy to make a pull request. My initial thought is by default the value would be false. I'd simply call a function like PublishQueuePosix::ACKWatchdogRefresh(True). Might have to use an ifdef so the library would still work in earlier device OS versions? Here is what I had in mind:
In the main.cpp Setup:

  Watchdog.init(WatchdogConfiguration()
    .capabilities(WatchdogCap::NOTIFY | WatchdogCap::DEBUG_RUNNING | WatchdogCap::SLEEP_RUNNING)
    .timeout(60min));
  Watchdog.start();

  // Register this callback to be called when the cloud side watchdog timer expires. 
  // Since we also have the AB1805 watchdog, we can just set a flag and then handle the deep power down in loop. The device itself is running fine, it's just not successfully publishing events like it should. 
  Watchdog.onExpired([]() {
    watchdogExpiredFlag = true;
  });

In the main.cpp Loop to handle the watchdog expired event:

  //If the watchdog expired, then do a deep power down reset.
  if (watchdogExpiredFlag == true) {
    if(watchdogEnable == 1){
      Log.error("Device/Cloud Side Watchdog Expired. Let's do a deep power down reset");
      softDelay(100);
      ab1805.deepPowerDown();

    }
// If the watchdog is not enabled, then just Disconnect and turn cellular off. We don't want to reset in case the customer as an actuator on. 
    else{
      Log.error("Device/Cloud Side Watchdog Expired but watchdog is disabled, let's disconnect from Particle and turn cellular off");
      Particle.disconnect(CloudDisconnectOptions().graceful(true).timeout(8s));
      softDelay(1000);
      Cellular.off();
      softDelay(2000);
    }
  }

Modify the PublishQueuePosix.cpp

void PublishQueuePosix::publishCompleteCallback(bool succeeded, const char *eventName, const char *eventData) {
    publishComplete = true;
    publishSuccess = succeeded;
    Watchdog.refresh(); //Pet the watchdog during the publish complete callback:
}

Is this a reasonable approach?

EDIT
This approach may work on a P2 series but doesn't work as expected on a Boron/B Series due to the different behavior of watchdog. The MCU resets itself IMMEDIATLY after calling the Watchdog.onExpired interrupt.

Instead, I'll just let the device reset the MCU itself instead of the AB1805.deepPowerDown().

I added a new function to PublishQueuePosixRK version 0.0.6. It registers a callback function to call when the publish is complete. You can add your watchdog call in your callback function.

withPublishCompleteUserCallback()

PublishQueuePosix & PublishQueuePosix::withPublishCompleteUserCallback	(	std::function< void(bool succeeded, const char *eventName, const char *eventData)> 	cb	)	

Adds a callback function to call with publish is complete.

Parameters
cb Callback function or C++ lambda.

Returns
PublishQueuePosix&

The callback has this prototype and can be a function or a C++11 lambda, which allows the callback to be a class method.

void callback(bool succeeded, const char *eventName, const char *eventData)

The parameters are:

succeeded: true if the publish succeeded or false if faled
eventName: The original event name that was published (a copy of it, not the original pointer)
eventData: The original event data

Note that this callback will be called from the background thread used for publishing. You should not perform any lengthy operations and you should avoid using large amounts of stack space during this callback.

I would add this in setup() after PublishQueuePosix::instance().setup():

PublishQueuePosix::instance().withPublishCompleteUserCallback([](bool, const char *, const char *) {
    Watchdog.refresh();
});
6 Likes

This is perfect! A much more elegant approach. Thank you very much @rickkas7!

In the suggested code, I want to use the succeeded Boolean to conditionally call Watchdog.Refresh() correct? I only want to pet the watchdog when it's successful not just complete. So something like this:

//Perform setup on Publish Queue system
PublishQueuePosix::instance().setup();
PublishQueuePosix::instance().withPublishCompleteUserCallback([](bool succeeded, const char *eventName, const char *eventData) {
  if (succeeded){
    Watchdog.refresh();
    Log.info("Refresh the watchdog... an event was successfully published with ACK: %s", eventName); //Normally don't do this here since it's called from an interrupt. 
  }
  else{
    Log.error("Publish failed: %s", eventName); // Normally don't do this here since it's called from an interrupt. 
  }
});

I installed the updated library and in my brief testing this morning, it works exactly as expected. This should help eliminate the very few situations where a device would be connected to the cloud but stop publishing data for some unknown reason.

3 Likes

I have a question regarding the documentation for the watchdog that I just need some clarification on, especially the following section:

You can only call init() if the watchdog is not currently running. On the nRF52 platform (Boron, B Series SoM, Argon, Tracker SoM) you cannot stop the watchdog, so you necessarily can only start it once until the device is reset. This also means you cannot change the watchdog time period on the nRF52 platform.

So I cannot stop the watchdog when I've started it, and I cannot change the watchdog period after it has been started - so just to be sure, if I call System.reset(), then the watchdog is not running anymore after device starts up again? How about after a OTA update?

The watchdog is always stopped after the device is reset. This allows a software update to occur, DFU mode to be entered, or new firmware that does not use the watchdog to be run, all of which would fail if the watchdog was still running.

1 Like