Over 20 Devices just went down

mstanley · May 7, 2019, 5:27am

We have determined that at this time, a full power cycle is the only resolution to the issue. However, the intent is that no external watchdog should be necessary to get out of this state in future device OS releases. We also have every intent to mitigate this particular issue from the cloud side as well.

mstanley · May 7, 2019, 5:29am

Hey @Mr.Roboto . Sorry about the miscommunication on this. The original incident that was investigated was in relations to webhooks, specifically. It was later understood that device sessions were impacted as well which opted for a second incident report to be opened and a prolonged investigation to occur. We are still performing investigations into this matter and are attempting to try and build a list of impacted devices and will be making efforts to communicate this to impacted customers.

As of right now, we have determined fully removing power from a device as one means of getting a device back online. Unfortunately, a normal reset will not get the device out of this state.

You may find more information on this topic at Updates for the Status Incident of May 4th

mstanley · May 7, 2019, 5:43am

Hey guys, I wanted to let you know we recognize this issue and are continuing to investigate the best means to try and handle this issue remotely. We are prioritizing engineering's resources for both short term and long term solutions to this incident at hand.

We strive for Particle to be a remote solution for the very purpose that each of you have sought Particle for. While this particular incident may have had us fall short on that purpose--we are doing our best to learn from this and use that knowledge to work towards preventing future occurrences and have Particle be the platform that each of you believe it should be.

trpropst · May 7, 2019, 12:50pm

@mstanley, your response seems to indicate that this could affect all devices with the R410M-02B module. For those devices where the issue is seen, is there any recovery option? A replacement is in process now but I’d much rather understand and avoid this issue. Had this device been in the field it would be much more challenging. I have another device that went offline several weeks ago but is several hundred miles away. No one has had the time to go debug it.

I will start a separate thread for this issue…

wileecoyoti · May 7, 2019, 6:34pm

Appreciate the work going in to figuring this out. I know how these things can be…

Are there any mitigation options that we might be able to manually implement in code until the 1.2.x fixes are available? Right now I’m down to more or less just one device that’s still stranded, the rest have resolved in one of the following ways:

Severed power, reconnected
reloaded firmware over USB (ones I had open on the bench)
battery died in field, reconnected when solar recharged
random, not sure why but hey it started working so who am I to complain
One unknown status because it’s in a battery deep sleep at the moment by design. Hopefully comes back to life after hibernation period expires.

peergum · May 7, 2019, 7:11pm

Thanks for the effort to the Particle team. Unfortunately another piece of trust is gone, but at least you’re doing the right thing. My advice (from professional experience) is it’s better to be open and apologetic than to try and hide the mess out of sight…

mstanley · May 7, 2019, 10:13pm

Just to clarify, the investigation in regards to R410M-02B is a separate issue that is unrelated to this incident at hand. We're determining if there could be any causation between this current investigation and the behavior observed in the R410M-02B module.

mstanley · May 7, 2019, 10:21pm

Right now, our recommended for-sure resolution to this until we release v1.2.x is considering a hardware watchdog timer. There are some code implementations in place in the Device OS that should have prevented this, but given current circumstances, this incident fell in between the cracks. I would keep an eye out on our update topic for future developments. It's possible as engineering uncovers the issue, they will be recommending near-term solutions in addition to the 1.2.x device OS release.

Each of the methods you described that got the device into a working state are in line with what we know thus far. The device in deep sleep may be a more troublesome device to get back online if it is not already. We are still investigating means of how to recover devices in this state. I believe some progress has been made and am awaiting more information from engineering on this.

justicefreed_amper · May 7, 2019, 10:26pm

Right now, our recommended for-sure resolution to this is a hardware watchdog timer.

By hardware watchdog timer, do you mean a timer attached to the reset pin, or a timer attached to a power switch?

will · May 7, 2019, 10:52pm

Wanted to cross-post this update from the official thread where we're tracking and posting updates on the issue.

We're in the process of exploring cloud-side solutions that will help individuals with no physical access to their hardware to get back online. We'll provide updates on that effort in the thread above.

I'd like to expand on this comment from Matthew. A key tenet of our platform is that we should abstract away connectivity management from our customers and ensure that devices automatically reconnect to the Cloud when their Internet connection is interrupted.

While a hardware watchdog is a great second line of defense that a customer can implement at the application level, I want to make sure it is clear that our team is pursuing a range of platform improvements as a result of this issue which include:

A cloud side fix that automatically assists devices stuck in this reconnect loop to successfully establish a new session with the Cloud
A Device OS improvement (scheduled for 1.2.X) which would make stronger guarantees that a session is never reused more than a certain number of times (which will also prevent an endless reconnect loop)
Infrastructure-level improvements to the caching system which minimize the likelihood and impact of similar failures in the future

Rftop · May 8, 2019, 12:06am

I'd say a power switch, considering Reset generally couldn't recover from this incident, or the incident 6 weeks prior on March 14.

I've added a cheap ($5) Timer Relay board to my critical Particle Devices, which stay powered 24/7:

The Count-down timer gets reset with a Digital Pin after each successful "WDT" webhook response.

I use the "WDT" publish/subscribe on a 30 minute schedule (adjustable) as a failsafe. If the timer runs out (the webhook response didn't make it back to the Electron/Boron), then the Particle's power will be switched off and will remain in the un-powered state for a user defined length of time (I use 15 seconds).

The separate "WDT" publish/subscribe schedule also allows me to continue to send No Acknowledge Publishes for regular sensor data (thousands per day), to not impact the cellular data usage significantly.

This is not a neat or clean as a "real" watchdog IC solution. But the easily adjustable ON/OFF Countdown Timer values and the visual feedback are nice, and it's only 1-wire plus power.

I cant risk another Cloud Incident requiring physical visits to all my customer's sites.

mstanley · May 8, 2019, 1:45am

It looks like @Rftop provided a pretty indepth answer.

I want to re-iterate though that a hardware watchdog is only our initial proposal as an an extremely short term solution to this incident. A hardware watchdog should only be a supplemental solution and not an expectation as something that is needed for reliability on Particle's platform.

It's my understanding that engineering has been hard at work today on digging into this. Work is being geared towards providing an immediate solutions for devices in an unreachable state in order to recover them remotely. There is also work in regards to providing software watchdog solutions for existing device OSes as well as software solutions baked into the tentative 1.2.x release.

For now, I would encourage everyone to sit tight as we await updates from engineering. I'm expecting us to have more information soon.

jimbol · May 8, 2019, 2:33am

@Rftop,
Do you have a link to the board? I’m looking for them currently. I’m traveling tomorrow, the third time in the past 18 months that I’ve had to physically intervene - 2 hour drive, 2 hour flight, 2 hour drive, 1 hour side by side 4x4 drive to get to the unresponsive unit.

It’s getting old, and more than a little expensive. Thank goodness I can piggyback this trip with another purpose.

And to anyone else, do you have a simple solution like the board @Rftop mentioned, but that will work with a 3.7v battery? (unit uses small battery and solar panel)

justicefreed_amper · May 8, 2019, 2:39am

While I agree, for those of us stuck on older firmware versions (because of RAM constraints), any changes you make to firmware don’t really help us. I happen to currently have a power-switching hardware watchdog timer using a TPL 5110 attached to a power switch, but was considering moving to just a reset pin timer due to issues with losing timestamps for too long on RTC.

I recommend anyone else with a high reliability application do the same (all my devices recovered in 10-20 minutes). Nonetheless, it is a very frustrating constraint that prevents me from otherwise improving my customer’s experience.

Rftop · May 8, 2019, 3:48am

Amazon
Ebay
Ebay USA Shipped

For Battery/Solar applications, I'd vote for:

TPL5110 for Electrons
TPL5111 for Borons

jimbol · May 8, 2019, 4:12am

Thx!

thrmttnw · May 8, 2019, 9:19am

I appreciate all the posts on watchdogs, thanks.

@mstanley @justicefreed_amper Agree. While I appreciate the platform efforts, until then, (and it would take a lot of months to prove that to myself), we would have to fly around the globe in these situations, it would kill the company quickly.

@Rftop I like the TPL5111 for Sleeping Boron’s.

My current project is an always online product including Boron 2G3G/LTE, and I am looking for a “one chip” power cycling watchdog solution for that relay solution.

I have not found one on the forums, so if anyone has something like that running, I would love to hear about it? Any pointers are appreciated?

will · May 8, 2019, 5:06pm

Understood. Wanted to make the point that the the fixes I described are redundant, in that the Cloud-side fix alone should prevent devices from reaching this state. The Device OS improvements are being put in place to provide additional safeguards against this from happening.

In case it's helpful for longer term planning, I wanted to add that the new Gen 3 hardware has twice as much RAM available (256k total vs. 128k total) than our Gen 2 hardware, if that is a bottleneck for your application.

justicefreed_amper · May 8, 2019, 6:15pm

Makes sense, and I appreciate that clarification. I’m not sure if that distinction really makes me feel all that much safer about it, but helpful to know.

I look forward to moving to Gen 3, but I am using the photon & electron in a production environment, and Gen 3 seems to be a minimum of 6 months away from being ready for production scale (handful of missing features that I would deem essential that are on the roadmap, some stability issues).

Moors7 · May 8, 2019, 8:34pm

Cross-posting this here for reference:

Topic		Replies	Views
Simultaneous Failure Of Multiple Electrons Troubleshooting	24	2171	November 7, 2017
Electron Connection Down General	21	1732	July 11, 2018
Electrons Disconnected Simultaneously Across the US Troubleshooting	11	1235	August 23, 2018
Most recent devices stopped connecting overnight Cellular	4	887	August 26, 2021
Cloud event on 4/19 @ 12:35PM (EDT)? Troubleshooting	2	547	April 20, 2018

Over 20 Devices just went down

Related topics