I have a few devices that are producing hard faults in the field but it is difficult to determine why and I can’t reliably replicate the hard fault in the office. I suspect it is happening when the device loses wifi connection. What is weird though is, on the off chance I am able to reproduce a hard fault in the office, it will restart automatically afterwards and go back to connecting to wifi. The devices in the field dont restart and requires someone to physically go and reset it. What causes a photon to restart and not restart after a hard fault?
If you are following the Troubleshooting topic you will see there has been a number of posts about WiFi and its unreliability/variability over time.
Suggest you try putting devices on OS 0.8.0-RC.11 and get the diagnostics history - hard faulting may be due to stack overflow or an application code corner condition. Are you using an application watchdog? That should avoid the need to physically press the reset button when it gets stuck. I have combined this with an event that collects relevant data when the watchdog kicks in like free memory, location in application where halt occurred. This has helped to identify and cure the causes.
Ah great idea actually! Never thought about application watchdog. I will definitely give that a go. Is the application watchdog able to reset the photon no matter what sos error code it went into? Also the ability to log where in the application where the halt occured sounds super useful. I was wondering if you were perhaps able to give an example how to do get that kind of information through the firmware?
I have seen quite a few issues with wifi disconnections on the forums. Some say updating to 0.8 RC fixes it while some say downgrading back to 0.7.0 fixed it for them. Also perhaps reducing the use of String would be a good idea. Thanks for the help.
I’ve tried testing the watchdog with a simple sketch:
STARTUP(System.enableFeature(FEATURE_RESET_INFO));
ApplicationWatchdog wd(30000, watchdog_function);
SYSTEM_THREAD(ENABLED)
void setup(){
}
void loop(){
waitUntil(Particle.connected);
Particle.publish("Reason", String(System.resetReason()));
Particle.publish("ResetData", String(System.resetReasonData()));
while(1){
}
}
void watchdog_function(){
System.reset(123);
}
Sorry as I’ve only just started working with this. First of all, when the watchdog triggers after the specified time (30seconds in this case) it causes the photon to get into SOS with hard fault as the error code. Is that supposed to be what happens when the watchdog triggers?
After the hard fault (not sure if intended by the watchdog) I guess I expect to see 60 (which is RESET_REASON_WATCHDOG) and “123” as the reset reason data. Instead I get 0 for both (which is RESET_REASON_NONE). Not sure what I’m doing wrong =\
What device OS version are you targeting?
With versions >= 0.7.0 this issue is known
https://github.com/particle-iot/firmware/issues/1513
You need to assign a bigger stack for the watchdog callback to prevent the SOS panic (see suggestion in the issue report).
@ScruffR has just confirmed another reason you should jump 0.7.0. I have downgraded circa 400 production devices to 0.6.3 as the only stable GA Device OS. Products in development are all on 0.8.0-RC.11. I haven’t had an application watchdog assertion failure since using 0.8.0-RC.11 and hence I have not increased the stack requirement.
This is how I setup the watchdog, currentPosition is updated throughout the code, in practice though you will soon be able to focus in on where it is getting stuck and once fixed remove this trace.
> void restartResume();
> int currentPosition;
> // watchdog declaration - application loop must call wd.checkin() to reset the watchdog timecount MCHK milliseconds
> ApplicationWatchdog wd(MCHK, restartResume);
I am using an eeprom struct to hold the state (as in FSM) the application was in when it triggered the watchdog - this also holds the currentPosition and some other data which allows the application to determine at startup whether it is recovering from a reset triggered by the watchdog. It can then ‘resume’ what it was doing before and also report by event that it has ‘faulted’. This has been very useful to engineer out the causes.
This was the issue, thanks ScruffR. Increasing the allocated size stopped it from SOS Erroring and now performs as expected. I’ll implements the concepts pointed out by @armor and hopefully I can get some useful information back. thanks for all the help. I have a feeling i’ll be back bothering you guys again if something goes wrong.