Improving Electron Reliability - External Watchdog Timer


#1

All,

I have been moving my park sensors over to the Particle platform and it has been a great experience. With my sensors deployed in remote locations, reliability is key. To that end, I have built a carrier board for the Electron with an external Texas Instruments TPL5010 watchdog timer. You can see the hardware part of this project on this (long) dedicated thread. This thread focuses on the software required to make this work.

First, let me define the objectives of this software:

  1. To reset the Electron if it locks up
  2. To reset the Electron if the code is still running but the data it sends is not getting through
    Both of these are common failure modes and I don’t want to have to physically visit the device to reset it.

When I use reset in this post, it is a “pin reset” and not a hard (power cycle reset). The carrier board also supports hard resets but I will cover this separately.

Also, and I apologize if this is obvious, here is how the watchdog works. You set a timer interval by choosing a resistor value - from 100mSec to 2 hours. Then, the watchdog will send a signal to your Electron where you set up an interrupt looking for a transition from LOW to HIGH. Upon seeing that interrupt, the Electron needs to send a signal to the watchdog saying “done” and this much be done before the interval is up. This is sometimes called “petting” the dog. If the watchdog is not pet in time, it will pull the Electron’s reset pin low. Here is a picture of what this looks like.

Rather than share all the code, I will post the relevant snippets here. However, there are some elements of the overall program that are important.

  • The software must constantly save its “state” so that a reset does not cause you to loose data or functionality. I backup the state of the machine’s variables to a FRAM chip on the carrier. You can use the EEPROM on the Electron itself.
  • You will need to think about the timeframes needed for your watchdog timer. For my purposes, a 30 watchdog cycle can cover both the “lockup” and the “data not sent” issues. If the timeframes for these are radically different (i.e., a lockup condition cannot last more than 1 second but I am ok to try to retransmit data for an hour) then you may need two watchdog timers.
  • You need to have a closed loop for your data. In my case, I am sending data to Ubidots via a Webhook and then getting a response back from Ubidots via a Webhook response.
  • You need to track the number of times your watchdog resets the Electron. This should be a rare event. If it is not, you have software quality or signal strength issues you need to focus on.

Here is the code you will need to make this work

  1. Define your variables. You will need to define the pins for the watchdog connection, the state variables for the interrupt service routine:
// Pin Constants
 const int donePin = D6;
 const int wakeUpPin = A7;    // This is the Particle Electron WKP pin

 // Program Variables
 int resetCount;            // Counts the number of times the Electron has had a pin reset
 volatile bool watchdogPet = false; // keeps track of when we have pet the watchdog
 volatile bool doneEnabled = true;  // This enables petting the watchdog
  1. In Setup, you need to set the pinMode and attach the interrupt for the “wake” pin. I used A7 (the WKP pin) as I eventually want to implement sleep and this pin will wake up the Electron. You also need to check to see if the Electron is rebooting after a “pin reset” and keep track of this. Remember, watchdog resets should be rare events. I store this value in FRAM for reporting. Also, you need to subscribe to the Webhook responses.
   pinMode(donePin,OUTPUT);       // Allows us to pet the watchdog
   attachInterrupt(wakeUpPin, watchdogISR, RISING);   // The watchdog timer will signal us and we have to respond

  resetCount = FRAMread8(RESETCOUNT);       // Retrive system recount data from FRAMwrite8
  if (System.resetReason() == RESET_REASON_PIN_RESET)  // Check to see if we are starting from a pin reset
  {
    resetCount++;
    FRAMwrite8(RESETCOUNT,resetCount);    // If so, store incremented number - watchdog must have done This
  }
  Serial.print("Reset count: ");
  Serial.println(resetCount);

   Particle.subscribe("hook-response/hourly", myHandler, MY_DEVICES);      // Subscribe to the integration response event
   Particle.subscribe("hook-response/daily", myHandler, MY_DEVICES);      // Subscribe to the integration response event
  1. Define the Watchdog Interrupt Service Routine. Remember to keep it short with no Serial.prints or Delays.
void watchdogISR()
{
  if (doneEnabled)
  {
    digitalWrite(donePin, HIGH);
    digitalWrite(donePin, LOW);
    watchdogPet = true;
  }
}
  1. In the main loop, you can put this optional indicator to help you with troubleshooting. This is not needed in production code.
   if (watchdogPet)
   {
     Serial.println("We have pet the watchdog");
     watchdogPet = false;
   }
  1. When you send data, you need to reset the watchdog timer and then set a flag that prevents petting the dog. This flag will only be reset in two ways - upon receving the correct response code or upon reset.
bool SendEvent(bool hourlyEvent)
 {
   // Take the temperature and report to Ubidots - may set up custom webhooks later
   digitalWrite(donePin, HIGH);
   digitalWrite(donePin,LOW);     // Pet the dog so we have a full period for a response
   doneEnabled = false;           // Can't pet the dog unless we get a confirmation via Webhook Response and the right Ubidots code.
   Serial.println("Watchdog petting disabled");
   int currentTemp = getTemperature(0);  // 0 argument for degrees F
   stateOfCharge = int(batteryMonitor.getSoC());
   String data = String::format("{\"hourly\":%i, \"daily\":%i,\"battery\":%i, \"temp\":%i}",hourlyPersonCount, dailyPersonCount, stateOfCharge, currentTemp);
   if (hourlyEvent) Particle.publish("hourly", data, PRIVATE);
   else Particle.publish("daily", data, PRIVATE);
   return 1;
 }
  1. Since you have subscribed to the Webhook responses, you can parse the response and re-enable watchdog petting if you get the proper response code. This handler is specific to your service - I use @aguspg 's Ubidots.
void myHandler(const char *event, const char *data)
{
  if (!data) {              // First check to see if there is any data
    Serial.print("No data returned from WebHook ");
    Serial.println(event);
    return;
  }
  Serial.print(event);
  Serial.print(" returned ");
  Serial.println(data);
  String response = data;   // If there is data - copy it into a String variable
  int datainResponse = response.indexOf("hourly") + 24; // Find the "hourly" field and add 24 to get to the value
  String responseCodeString = response.substring(datainResponse,datainResponse+3);  // Trim all but the value
  int responseCode = responseCodeString.toInt();  // Put this into an int for comparisons
  switch (responseCode) {   // From the Ubidots API refernce https://ubidots.com/docs/api/#response-codes
    case 200:
      Serial.println("Request successfully completed");
      digitalWrite(donePin, HIGH);   // May have missed responding to an interrupt while petting disabled
      digitalWrite(donePin,LOW);     // Pet the dog so we have a full period for a response
      doneEnabled = true;   // Successful response - can pet the dog again
      break;
    case 201:
      Serial.println("Successful request - new data point created");
      dataInFlight = false;  // clear the data in flight flag
      digitalWrite(donePin, HIGH);   // May have missed responding to an interrupt while petting disabled
      digitalWrite(donePin,LOW);     // Pet the dog so we have a full period for a response
      doneEnabled = true;   // Successful response - can pet the dog again
      break;
    case 400:
      Serial.println("Bad request - check JSON body");
      break;
    case 403:
      Serial.println("Forbidden token not valid");
      break;
    case 404:
      Serial.println("Not found - verify variable and device ID");
      break;
    case 405:
      Serial.println("Method not allowed for API endpoint chosen");
      break;
    case 501:
      Serial.println("Internal error");
      break;
    default:
      Serial.print("Ubidots Response Code: ");    // Non-listed code - generic response
      Serial.println(responseCode);
      break;
  }

}

So, there you have it. Please take a look and let me know if you have questions. Also, if there is a better way to do this, I am very open to your suggestions.

Thanks,

Chip


Electron Carrier Board
Electron Watchdog Timeout Recommendations
External Hardware Watchdog Timer w/ a Photon
Blocking Particle.connect() and publish()
#2

Thanks for posting this and including the explanation!


#3

Thanks, this is very helpful. Going to give it a try myself.


Is it necessary an external watchdog timer for improving reliability?
#4

Hi Charles,
it’s never obvious, and thank you for your post!!!
Gustavo.


#5

Great setup, thanks for sharing.

I understand pulling the reset pin does not directly reset/power cycle a locked up modem in HW, as that needs a Cellular.off() - does Cellular.off() actually happen implicitly when the reset pin is pulled or would that need to be added to the code below

if (System.resetReason() == RESET_REASON_PIN_RESET)

to be sure a modem lock-up is always fixed?


#6

Pulling RST low does not reset the modem and as it directly resets the STM32 it has no chance to do anything in regards of calling Cellular.off()


#7

@thrmttnw,

I have two levels of reset as my boards are off in the woods and have to operate without human interaction. The first level is the watchdog as described above. But, if a “soft reset” does not solve the issue. My carrier board includes a circuit that I can use to power-cycle the Particle. By pulling one pin HIGH, it turns off both the LiPo and Vin to the Electron causing a “hard reset”. It does not get used often but, it is equivalent to what I would do if I were to travel to the sensor and intervene manually.

Thanks, Chip


#8

Thanks @chipmc and @ScruffR. Well, to avoid the “hard reset” part of the HW/SW design, I am wondering if this could work:

When the watch dog pulls the RST low resetting the STM32, and the Electron is running in SEMI_AUTOMATIC mode and wakes up with System.resetReason() == RESET_REASON_PIN_RESET, that could trigger calling Cellular.off() before moving on?

Or is Cellular.off() not considered a fully reliable hard reset of the modem (as in HW reset)?


#9

Cellular.off() is probably the better choice, since after a HW reset of the cellular module will wake into a half-woke state.
There are some discussions about low power mode where @rickkas7 explains the ins and outs a bit better.


#10

In your external hardware watchdog wake with pin reset case, you’d actually have to do Cellular.on(), Cellular.off(), then I’d probably go into SLEEP_MODE_DEEP for 30 seconds. That will likely clear up any potential modem state issues. Just be careful to not get into a rolling cycle of doing that.


#11

Thanks. It sounds like Cellular.off() done the right way is something that would mostly work. I am thinking that I am after something “certain” when all else fails.

So I looked at schematics for a modem reset pin and I found RESET_UC connected to pin PC4, and an example of how to pull it:

@rickkas7 In the case mentioned earlier with SEMI_AUTOMATIC mode, I would have to do a Cellular.on() before a Cellular.off() if i used that. So I am thinking that instead I might as well pull the RESET_UC pin right away, before moving on with normal business, whenever starting up and detecting a reset with:

System.resetReason() == RESET_REASON_PIN_RESET

@chipmc have you settled on a criteria for pulling the hard reset line in your applications?


#12

It’s not really necessary to manually handle RESET_UC in 0.7.0 and later. It’s done as part of Cellular.on() and cellular_on(NULL) now.


#13

@thrmttnw,

At this time, I simply have a counter for the number of resets and I have error conditions that trigger a reset. For example, if a Webhook to Ubidots does not get a “201” response code within a timeout period, it generates a reset. Once the reset count gets to 3, then it is zeroed and a hard reset is performed.

I may in the future come up with a more efficient model but, this process does seem to be working well for me now as I can’t remember the last time a device locked up on me in the field.

Thanks, Chip


#14

@chipmc thanks, sounds like a practical way to do it. Not having locked up devices in the field is primary.

@rickkas7 I think Cellular.off() / Cellular.on() will already have had the chance multiple times, before the watch dog reset is followed by a modem hard reset. Ex. if 2-3 wake-upload’s failed, it is modem reset time.

In generel I find the Electron platform itself extremely stable. But adding a combination of Arduino libraries on top, maybe with sleep-wake cycles, it is not hard to have resulting issues over time.


#15

Hi guys,

Recently I read this post on watchdog timers and I learned so much that I wanted to share that info with you.

Some things I did not know (among a trillion others):
image

The original post is here - it contains many other tips:
http://www.ganssle.com/item/great-watchdog-timers.htm

Jack runs The Embedded Muse since 1997. The Embedded Muse has provided hardware and software tips about building embedded systems.

Subscribe here for nice tips and discussions two times a month!

Gustavo.


#16

I’m adding a watchdog to a design today,

I found the article great and I’m now no longer wishing the Photons hardware watchdog was available knowing external watchdogs are the way to go.


#17

Photons come with one? I thought they had only software watchdogs…


#18

You folks may want to have a peek here:


#19

I said I was going to quit wishing the Photon had a Hardware Watchdog.

I used to think they were just as reliable as a external Watchdog circuit but the article you posed talked about why they are not.


#20

got it, thanks!

Thanks PK, now from reading there I wonder how one can deal with a watchdog (external or internal) when a device goes to sleep.
Do you have any hints?
Thanks