Some of my customers are currently facing an issue where their device(s) which have been running for hours/days then seemly stop responding. No cloud connection, loop() not working and this is only fixed by reseting the device. This is a major issue as these devices are responsible to do temperature control over several weeks.
I am unable to recreate this issue in my work/home environment and was wondering if anybody else have or had this issue and have a way to recreate it.
I have had this issue pop up on Device OS 1.01 and Device OS 1.5.2. This seem to have started between august - november since I have had a product in market since last year and have not experienced it since that given period. Which currently leads me to believe its on cloud side.
knowing nothing about your firmware does not help provide useful comments.
However, there could be so many things that affect the stability of a device:
the wifi connection or the router stability
if there is a microwave between them and the router (they generally use 2.4GHz as well)
a power outage
a new feature in the firmware
if best coding practices are being used
I have few devices at home, and every other power outage or something (whatever, a solar storm? just kidding or maybe not so much ) I have to reset one solid device that’s been running for years. Maybe this happens once or twice a year per device.
If I multiply this by a factor of 10 or 100 to match the number of customers you have, it could simply be a similar situation.
One way out of this, and in order to add robustness to your product would be to add a watchdog.
You can read more here and there (see below).
I would recommend you add one to your product, and there would be no more need for a customer to reset their device (ok, maybe you’ll be covered for up to 99% of the cases, let’s be realistic and conservative here).
Here’s a story about other fields using watchdogs:
Launched in January of 1994, the Clementine spacecraft spent two very successful months mapping the moon before leaving lunar orbit to head towards near-Earth asteroid Geographos.
A dual-processor Honeywell 1750 system handled telemetry and various spacecraft functions. Though the 1750 could control Clementine’s thrusters, it did so only in emergency situations; all routine thruster operations were under ground control.
On May 7, 1994, the 1750 experienced a floating point exception. This wasn’t unusual; some 3000 prior exceptions had been detected and handled properly. But immediately after the May 7 event downlinked data started varying wildly and nonsensically. Then the data froze. Controllers spent 20 minutes trying to bring the system back to life by sending software resets to the 1750; all were ignored. A hardware reset command finally brought Clementine back on-line.
Alive, yes, even communicating with the ground, but with virtually no fuel left.
Moral of the story: a watchdog can help you succeed in your mission.
As far as I’m concerned, I can trace back a similar issue since Novembre 11 2020. Sounds pretty similar.
Everything has been running for months flawlessly before that. I’ve had a few of these Photons for years. Very reliable piece of HW I have to say.
Currently I have two of these units deployed and in service, one thousands of miles away in Europe, and one at my main house in North America and obviously on two different Wifi Network (this clears the Wifi issue suggestions I’ve read as a potential root cause for this issue)
Both are running the latest FW release: 1.5.2.
Strange enough, both keep stopping communicating roughly at the same time after several days of good service. It has happened 6 times already since then and again last night at 11pm EST.
Like you, the only way we can resume the connectivity is by pressing a hard RESET (which makes it a bit tricky for the one in Europe where I have to dispatch a technician)
I keep making small corrections and update (OTA) the FW in an attempt to bypass this issue, without any success.
One last possibly is to test the pre-release version of 1.5.4 or even jump to 2.0.0 if that doesn’t present any API compatibility issues with my existing code.
I have been having similar issues with two of my Photon devices. They have been in service for months without these kinds of issues. Problems started in the last month or two. I just reset the devices as needed and assumed a connection issue. Now I wonder…
@gusgonnet Thanks for the suggestion for using a external watchdog timer and those links were an interesting read However most of my devices are produced/in market so using this in the next batch is the only option.
@Muskie@philstrick Looking through my affected devices, I had a look at the device vitals exported data and saw that cloud uptime on some of these connections were 1 ms. I wonder if this is causing it. It would be interesting if you guys had a look at the device vitals too and saw a recurring pattern.
Lastly, @Muskie you mentioned that this started to happen last month or two. That is roughly the same time when this started to happen to my devices too. I have one of the products which has been released for a year now and have only recently received a few complaints about this happening in the past two-three months.
Thanks for the tip. I’ll check into that, but it’s worth noting that for months we were running an old version: 1.4.4 on all our Photon. It was an old FW, but everything was running smoothly.
Back in Novembre this year, in an attempt to bring all my devices to the latest FW version, we pushed 1.5.2 across all our units.
We have experienced communication issues with the backend server ever since.
@Div_IoT they won’t. Unless we are in an OEM agreement, they are just letting the community figure this out among us.
Unfortunately this keeps pointing at their back-end infrastructure and there isn’t much we can do. In fact, I think they are 2 issues: one on the PHOTON Device OS (and more specifically 1.5.2), and one on in the back-end server they have introduced back in Novembre '20.
Reason is we are also running an ARGON unit which fails to communicate with the back-end server roughly and consistently at the same time (~11pm EST) regardless of their geolocation.
However and unlike the PHOTON, the ARGON is capable of recovering after few minutes/hours of disruption of service.
@Muskie@Div_IoT we’re happy to report that we may have circumvent this issue. First off, this is not a fix, just a way to prevent a bug on the Photon device, from being exposed.
In our case, we are using a MQTT pub/sub server to broadcast the sensor information. However and as I reported, at around 11pm EST, we noticed that the VM hosted in AWS, was running out of ressources and the MQTT server was therefore not able to process any events from the publisher (PHOTON) for several minutes.
As a result: we doubled the size of the VM to prevent this bottleneck on the MQTT server, and we haven’t seen any issues ever since.
Hope this help
Note that the ARGON was able to recover from this dead-letter queue issue, but not the PHOTON.