Hardware watchdog in Device OS 5.3.0

On Gen 3 devices (Argon, Boron, B Series SoM, Tracker SoM, Tracker One, Monitor One, E404X) and the P2 and Photon 2, you can now enable the hardware watchdog in the MCU (nRF52840 or RTL8721). This is significantly more effective than the application (software) watchdog, and does not require external hardware.

The documentation is in the Device OS API reference.

This feature is only available in Device OS 5.3.0 and later. It will not be added to 4.x LTS. It will not be added to Gen 2 devices (E Series other than the E404X, Electron, Photon, P1).

9 Likes

There is mention of Monitor One, Is it available?

Iā€™m unsure whether the documentation for the watchdog is correct, but using

Watchdog.init(WatchdogConfiguration().timeout(180s));
Watchdog.start();

Will still reset the device after approx. 40-50 seconds from bootup, changing the timeout has not effect at all.

Itā€™s a mistake in the documentation, which is now fixed.

It is necessary to periodically call Watchdog.refresh().

2 Likes

This is great addition! Thank you very much!

A few quick follow-up questions:

  • Is the watchdog still active during sleep modes? I.e. Do I need to call something like Wachdog.stop() before sleeping and then Watchdog.start() after I wake up from sleeping?
  • Is there a max duration the Watchdog can be set to? I know the AB1805 had a max duration of I think 124 Seconds. Is there a limit here?
  • What is your recommendation for a ā€œsleepyā€ gen 3 device. Should we just set the watchdog timeout to be longer than the maximum duration of a sleep cycle. This would be ideal as it seems more like a ā€œcatch allā€ and the watchdog would never turn off once set. If I sleep for say 20 minutes at a time, Iā€™d like to just use: Watchdog.init(WatchdogConfiguration().timeout(1260s)); This would be 21 minutes.
  • Does Particle.Process() or any other library call Watchdog.Refresh() already or do I need to explicitly call it? Iā€™d very much prefer to keep it isolated if possible this way we have full control of how we want to pet the watchdog. My thought is using the Watchdog.Refresh() as a ā€œcloud sideā€ watchdog. I.e. if I donā€™t get an ACK from a Particle.Publish() or possibly only call it in a Particle.Subscribe() then something on the cloud connection is hung and this would be an easy way to accomplish that.

I would also prefer the system does not call Watchdog.Refresh()

The hardware watchdog is automatically disabled before sleep and when restarting, so you donā€™t have to worry about it during sleep or in safe mode for Device OS upgrades.

Watchdog.refresh() is not called anywhere in the system; you need to call it yourself.

Iā€™ll have to look into the maximum duration.

1 Like

Maximum timeouts:

  • nRF52840: 131071999 milliseconds
  • RTL872x: 8190000 milliseconds

It does seem like you could do some sort of a cloud-based refresh of the watchdog. I havenā€™t tested it, but the timeouts are long enough that it could work. Of course that will use more data operations.

1 Like

With such a long max timeout, could we add an option to keep the watchdog active during sleep?Maybe something like:

Watchdog.init(WatchdogConfiguration().timeout(180s).sleepMode(True);

By default this could be false so if someone doesn't explicitly add it, it'll disable it like it is now but I personally would like to keep it enabled during sleep. It makes it more of a broader "catch all". For example, maybe somehow the sleep duration was set wrong or something else happened that it wasn't falling asleep properly or waking up from sleep properly. What do you think?

I'd love to continue to use the AB1805 watchdog for the short term duration hardware watchdog (i.e. 124 seconds) but then would use this watchdog as a "cloud side" watchdog. With a max duration of over 2 hours or even much longer for the nRF52840, I'd like to keep it enabled all the time even during sleep modes and pet it only during an ACK of a publish event OR possibly a dedicated Particle.Function(). I'd have my backend that processes webbooks from a device (Python + SQL). It would keep track of when the watchdog was last pet and then call a Particle.Function() to pet the watchdog maybe once per hour? I'd have no problem at all burning 24 data operations a day/device for this extra cloud side watchdog functionality. If I needed could do every 2 hours even.

Generally speaking, my customers are "OK" with an occasional hang up for 1-2 hours but it's a pain in the butt if they have to travel to the site to give it a hard power off. I've had a few scenarios where the device was "connected". I could ping it, I could push new firmware to it, but it stopped sending data out and it was unable to process Particle functions. The AB1805 wasn't resetting it so it's like user application firmware was running but was still locked up somehow. This would be the catch all for that edge case issue.

It is possible to keep internal watchdog running during STOP or ULP sleep mode on nRF52840-based platforms (Argon, Boron, Tracker, B SoMs, E SoMX):

WatchdogConfiguration().capabilities(WatchdogCap::SLEEP_RUNNING)
3 Likes

Excellent! You guys rock! Maybe I missed it but might be good to add to the docs.

1 Like

Docs are updated with the capabilities option.

2 Likes

@rickkas7 ,

I know you said this will not be coming to LTE 4.x but, as it is focused on reliability which is likely very important to LTE customers, is there any chance of revisiting this?

Otherwise, I am tempted to move to the 5.x branch. This is a compelling feature.

@jgskarda ,

I do I think I will keep the AB1805 as well as it brings the Real Time Clock feature that is missing the Generation 3.

Thanks, Chip

1 Like

The point of the LTS branch is to limit the number of changes that are made. This means new features are rarely added to an LTS branch.

It is a reasonable pattern to adopt the feature branch (5.x) if you need a new feature, then upgrade to the next LTS branch (6.x) when that comes out, which will contain the new feature.

3 Likes

I agree with @chipmc I currently am on 4.X branch with the fleet. Although very tempting to migrate now to the 5.X branch for this feature, Iā€™m debating on the time and/or risks involved on testing and being on the non LTS release for the fleet. Would be great to see this on the LTS 4.X branch as well. Iā€™d have a lot more confidence deploying now to hopefully fix those few edge cases of devices locking up on the cloud side if it was a minor revision.

And yes, Iā€™m keeping AB1805 either way as I also require the RTC that is also sub second accurate. Good stuff!

1 Like

Anyone got it to work? Iā€™m still having problems that the Watchdog triggers on a B524, even though I call Watchdog.refresh() more than often enough - is it not possible to ā€˜dynamicallyā€™ calculate a watchdog timeout as I do below?

Iā€™m starting the watchdog timer as follows:

// Start watchdog timer
void Class::startWatchdog()
{
   // Calculate watchdog timer using sleep time and a timeout factor
   std::chrono::seconds watchdogTimeout(DeviceConfigurationConstants::WATCHDOG_TIMEOUT_FACTOR * Class::instance().getSleepTime()); // 4 * 60s, i.e. 4 * a std::crono::seconds of 60

   // Clamp to min. 3 minutes
   if (watchdogTimeout.count() < 180)
   {
      watchdogTimeout = 180s;
   }

   // Initialize Watchdog
   Watchdog.init(WatchdogConfiguration().capabilities(WatchdogCap::SLEEP_RUNNING).timeout(watchdogTimeout));

   // Start HW watchdog
   Watchdog.start();
}

I might be missing something, but Iā€™m scratching my head at this. I really want the timeout to be dynamic in relation to the set sleep time, but my calculations using chrono literals might be acting weird? It might be deserving of a separate topic, but this thread seems to be the main place for testing the new functionality.

An addition - running following function:

// Start watchdog timer
void Class::startWatchdog()
{

   // Initialize Watchdog
   Watchdog.init(WatchdogConfiguration().capabilities(WatchdogCap::SLEEP_RUNNING).timeout(4min));

   // Start HW watchdog
   Watchdog.start();

   while (1)
   {
      delay(1s);
      Logger.info("%i", System.uptime());
   }
}

Restarts the device after approx. 128 seconds seconds of System.uptime() - this is the core of my issue - that setting timeout to any chronoliteral has no effect when above X seconds.

  1. Setting timeout to 1 min works fine as it restarts after approx. 58-59 seconds of System.uptime()
  2. Setting it to 2 min works as well with a restart after approx. 119 mins
  3. Setting it to 3 min doesnā€™t work, as it restarts after approx. 48 seconds
  4. Setting it to 4 min doesnā€™t work, as it restarts after approx. 128 seconds
  5. Setting it to 5 min doesnā€™t work, as it restarts after approx. 37 seconds

I did not power cycle in between tests, but simply DFU ā†’ flash.

@jgskarda , a few years ago I started playing with an external relay timer as a 'Cloud Side" Watchdog. I would only Pet the Watchdog by subscribing to a Webhook response. This ensured Round-Trip Cloud Connectivity, and more importantly ensured that the backend service (that actually collects the normal data) was working. I used a separate "WDT" Webhook so this schedule was decoupled from normal data publishes (which remained NO_ACK).

It seems like a similar WDT Webhook (with Subscribe) could be used with the Hardware watchdog in Device OS 5.3.0+, since it's a "free" addition at this point.

You could publish to the WDT Webhook every 6 hours to "Pet", and reset after 12 hours if no Cloud & Backend connectivity (round-trip).
This seems safer than relying on a Cloud-Side Function Call during the few times that you need to reset the device because of Cloud-Side or Backend Problems.

I prefer the "Round-Trip" Approach as a fail-safe (but on a longer schedule) because that's what the IoT device is normally used for, pushing data to a backend service. If I can't receive data after 12 hours....then automatically pull the plug and start over. It seems to me the "Round-Trip" is the best approach to prevent the dreaded "Somebody has to go on a field trip to reset the dang thing", as a final measure.

And you can leave your existing external Watchdog alone and let it handle Firmware Crashes on a tighter schedule. This one is a free add-on for OS5.3.0

1 Like

@Colleen - do you mind seeing this issue to its resolution with @rickkas7 ?

1 Like

The bug has been found. Not sure what version the fix will be in yet.

2 Likes

This seems pretty cool. Iā€™m not familiar with hardware watchdogs so is there more documentation about what exactly they do and how they could be used? I read through the Particle API, but still trying to understand how I might use it. For example, students often have devices lock up with infinite loopsā€“would this be a use case for this?

Thanks for any more information,
Rob