On Gen 3 devices (Argon, Boron, B Series SoM, Tracker SoM, Tracker One, Monitor One, E404X) and the P2 and Photon 2, you can now enable the hardware watchdog in the MCU (nRF52840 or RTL8721). This is significantly more effective than the application (software) watchdog, and does not require external hardware.
This feature is only available in Device OS 5.3.0 and later. It will not be added to 4.x LTS. It will not be added to Gen 2 devices (E Series other than the E404X, Electron, Photon, P1).
Is the watchdog still active during sleep modes? I.e. Do I need to call something like Wachdog.stop() before sleeping and then Watchdog.start() after I wake up from sleeping?
Is there a max duration the Watchdog can be set to? I know the AB1805 had a max duration of I think 124 Seconds. Is there a limit here?
What is your recommendation for a āsleepyā gen 3 device. Should we just set the watchdog timeout to be longer than the maximum duration of a sleep cycle. This would be ideal as it seems more like a ācatch allā and the watchdog would never turn off once set. If I sleep for say 20 minutes at a time, Iād like to just use: Watchdog.init(WatchdogConfiguration().timeout(1260s)); This would be 21 minutes.
Does Particle.Process() or any other library call Watchdog.Refresh() already or do I need to explicitly call it? Iād very much prefer to keep it isolated if possible this way we have full control of how we want to pet the watchdog. My thought is using the Watchdog.Refresh() as a ācloud sideā watchdog. I.e. if I donāt get an ACK from a Particle.Publish() or possibly only call it in a Particle.Subscribe() then something on the cloud connection is hung and this would be an easy way to accomplish that.
The hardware watchdog is automatically disabled before sleep and when restarting, so you donāt have to worry about it during sleep or in safe mode for Device OS upgrades.
Watchdog.refresh() is not called anywhere in the system; you need to call it yourself.
It does seem like you could do some sort of a cloud-based refresh of the watchdog. I havenāt tested it, but the timeouts are long enough that it could work. Of course that will use more data operations.
By default this could be false so if someone doesn't explicitly add it, it'll disable it like it is now but I personally would like to keep it enabled during sleep. It makes it more of a broader "catch all". For example, maybe somehow the sleep duration was set wrong or something else happened that it wasn't falling asleep properly or waking up from sleep properly. What do you think?
I'd love to continue to use the AB1805 watchdog for the short term duration hardware watchdog (i.e. 124 seconds) but then would use this watchdog as a "cloud side" watchdog. With a max duration of over 2 hours or even much longer for the nRF52840, I'd like to keep it enabled all the time even during sleep modes and pet it only during an ACK of a publish event OR possibly a dedicated Particle.Function(). I'd have my backend that processes webbooks from a device (Python + SQL). It would keep track of when the watchdog was last pet and then call a Particle.Function() to pet the watchdog maybe once per hour? I'd have no problem at all burning 24 data operations a day/device for this extra cloud side watchdog functionality. If I needed could do every 2 hours even.
Generally speaking, my customers are "OK" with an occasional hang up for 1-2 hours but it's a pain in the butt if they have to travel to the site to give it a hard power off. I've had a few scenarios where the device was "connected". I could ping it, I could push new firmware to it, but it stopped sending data out and it was unable to process Particle functions. The AB1805 wasn't resetting it so it's like user application firmware was running but was still locked up somehow. This would be the catch all for that edge case issue.
I know you said this will not be coming to LTE 4.x but, as it is focused on reliability which is likely very important to LTE customers, is there any chance of revisiting this?
Otherwise, I am tempted to move to the 5.x branch. This is a compelling feature.
The point of the LTS branch is to limit the number of changes that are made. This means new features are rarely added to an LTS branch.
It is a reasonable pattern to adopt the feature branch (5.x) if you need a new feature, then upgrade to the next LTS branch (6.x) when that comes out, which will contain the new feature.
I agree with @chipmc I currently am on 4.X branch with the fleet. Although very tempting to migrate now to the 5.X branch for this feature, Iām debating on the time and/or risks involved on testing and being on the non LTS release for the fleet. Would be great to see this on the LTS 4.X branch as well. Iād have a lot more confidence deploying now to hopefully fix those few edge cases of devices locking up on the cloud side if it was a minor revision.
And yes, Iām keeping AB1805 either way as I also require the RTC that is also sub second accurate. Good stuff!
Anyone got it to work? Iām still having problems that the Watchdog triggers on a B524, even though I call Watchdog.refresh() more than often enough - is it not possible to ādynamicallyā calculate a watchdog timeout as I do below?
Iām starting the watchdog timer as follows:
// Start watchdog timer
void Class::startWatchdog()
{
// Calculate watchdog timer using sleep time and a timeout factor
std::chrono::seconds watchdogTimeout(DeviceConfigurationConstants::WATCHDOG_TIMEOUT_FACTOR * Class::instance().getSleepTime()); // 4 * 60s, i.e. 4 * a std::crono::seconds of 60
// Clamp to min. 3 minutes
if (watchdogTimeout.count() < 180)
{
watchdogTimeout = 180s;
}
// Initialize Watchdog
Watchdog.init(WatchdogConfiguration().capabilities(WatchdogCap::SLEEP_RUNNING).timeout(watchdogTimeout));
// Start HW watchdog
Watchdog.start();
}
I might be missing something, but Iām scratching my head at this. I really want the timeout to be dynamic in relation to the set sleep time, but my calculations using chrono literals might be acting weird? It might be deserving of a separate topic, but this thread seems to be the main place for testing the new functionality.
Restarts the device after approx. 128 seconds seconds of System.uptime() - this is the core of my issue - that setting timeout to any chronoliteral has no effect when above X seconds.
Setting timeout to 1 min works fine as it restarts after approx. 58-59 seconds of System.uptime()
Setting it to 2 min works as well with a restart after approx. 119 mins
Setting it to 3 min doesnāt work, as it restarts after approx. 48 seconds
Setting it to 4 min doesnāt work, as it restarts after approx. 128 seconds
Setting it to 5 min doesnāt work, as it restarts after approx. 37 seconds
I did not power cycle in between tests, but simply DFU ā flash.
@jgskarda , a few years ago I started playing with an external relay timer as a 'Cloud Side" Watchdog. I would only Pet the Watchdog by subscribing to a Webhook response. This ensured Round-Trip Cloud Connectivity, and more importantly ensured that the backend service (that actually collects the normal data) was working. I used a separate "WDT" Webhook so this schedule was decoupled from normal data publishes (which remained NO_ACK).
It seems like a similar WDT Webhook (with Subscribe) could be used with the Hardware watchdog in Device OS 5.3.0+, since it's a "free" addition at this point.
You could publish to the WDT Webhook every 6 hours to "Pet", and reset after 12 hours if no Cloud & Backend connectivity (round-trip).
This seems safer than relying on a Cloud-Side Function Call during the few times that you need to reset the device because of Cloud-Side or Backend Problems.
I prefer the "Round-Trip" Approach as a fail-safe (but on a longer schedule) because that's what the IoT device is normally used for, pushing data to a backend service. If I can't receive data after 12 hours....then automatically pull the plug and start over. It seems to me the "Round-Trip" is the best approach to prevent the dreaded "Somebody has to go on a field trip to reset the dang thing", as a final measure.
And you can leave your existing external Watchdog alone and let it handle Firmware Crashes on a tighter schedule. This one is a free add-on for OS5.3.0
This seems pretty cool. Iām not familiar with hardware watchdogs so is there more documentation about what exactly they do and how they could be used? I read through the Particle API, but still trying to understand how I might use it. For example, students often have devices lock up with infinite loopsāwould this be a use case for this?