Over 20 Devices just went down

electron
Tags: #<Tag:0x00007fe2202f6458>

#9

Yeah I’ve got approximately 200 electrons that are unable to connect to the Particle cloud right now. All on v0.6.4. Some photons appear to be affected though none locally so cannot 100% confirm. Happy to run any diagnostic tests I can if helpful for debugging @cullenshane. I have an open MQTT connection to devices and can run stats and query logs.

Edit: some devices appear to be trickling in and reconnecting. Will update if that continues to hold.

Edit2: yeah everything seems to have stabilized after the stated fix happened. My devices automatically self-reset after 7 minutes of no Particle Cloud connection, so I think once they hit that after the fix they’ve been fine for the most part.


#10

We’re seeing the same behavior on our fleet. It appears that once devices get a new handshake, they can connect again and send webhooks. That’s probably why the manual resets/powercycles mentioned above have worked.


#11

Which fix are you referring to in Edit 2?


#12

Thanks for reporting the issues here everybody, we’re taking steps now to try and resolve this problem. We’re trying to address the problem here: https://status.particle.io/incidents/6rk2fk95462b


#13

Based on what I read in other threads, the new handshake can happen in 7 days… I think Particle should offer a way to force a new handshake from the console… can’t wait 7 days for the devices to come back to life… client will be pissed. And they’re remote, so manual power cycle is NOT an option (think of it, that’s why we opted for electrons in the first place… to NOT have to go on site)


#14

We have the site visit issue as well… Not able to reach devices or gateways. There is for sure something impacting connectivity to the Particle cloud. I very much hope that there is an option to recover devices without site visits… I’m looking at the documentation now and hoping there is some way to reach gateway & edge nodes. Impacted are our Argon, Boron, and Xenon devices.

_Todd


Internet available, Cloud not reachable!
Excessive number of disconnect events on multiple devices
#15

I see that Particle has labeled the issue as “solved” but it is indeed not solved. Almost all of our units (hundreds of them) are still offline from this and they check in every hour.

Also if it has been “resolved” how about some information on what caused this? It’s the least you could do…


#16

Engineer here that works on the original devices noted. To add a little more data:

The devices are in a couple of groups that are ~150 miles apart, one group’s devices are all within 100 feet of each other, another group is half a mile from the next device, and then there’s two more that are 50-100 miles from any other device. I think that rules out a tower problem (these are all LTE devices) at least.

The devices (at least the ones I can physically access), seem to think everything is just fine. They boot, connect, go in to a “breathing” (cloud connection?) status and pretend to post data before going back to sleep. I have a couple outside of cases that I get diagnostic data over serial, they just go through their process with no errors. Weird right?

Might be important there too: these being remote, solar powered setups they are asleep most of the time. If their batteries are particularly healthy they will stay awake for a few minutes to capture a GPS signal while not connected before connecting, posting data again, and going back to sleep. But otherwise asleep for ~29.5 minutes out of every 30.

All of them stopped registering with particle at about 2-3:00pm EST on 5/2. One (of ~20) came back online by itself today, and there are a few others that I have de-powered and reset or reloaded the firmware on that are back online. Most of the devices are in hermetically sealed cases and I wont have the physical access to unplug or reload. Hopefully the particle team can (re-provision?) them so we don’t have to have to go on a road trip and break the seals on our devices!


#17

DM sent with device ID’s etc.

Thanks for looking into this.
When you deploy a product that relies on someone else’s infrastructure, it is always a bit worrying when things go wrong, so it is nice to see that the issue is being investigated and dealt with, and input is being asked for to help solve the issue. Keep up the good work.

Best regards
Ewan.


#18

Same experience with a developer device. Looking to launch a product on this platform and trying to understand how to mitigate this in the future from device side: Is a HW watchdog used in any of the cases above, and does it use the RST pin, EN pin or power cycle the device?

As I understand it, in this case only a power cycle can solve it. I guess the data usage optimisation on cloud reconnects is involved and that survives everything but a full power cycle?

Any experience with this is very useful, Thanks.


#19

in my case i have 1 2G electron and it lost connection about 5 hours before i got the monthly invoice email from particle on 5/2. ( :slight_smile: ) so at first i thought there was some kind of issue with billing and/or sim authorization. my issue was a continuous blinking green which is trying to connect to cellular. this coming after about 2 months of being connected roughly 95% of that time. right now it’s running 1.0.1 and tinker. that’s all. it’s basically to test just how well the electron does by itself. although my experience is not similar to the reported cloud issues the interesting part , to me, is the time-frame of my issue is very close to the time-frames others are discussing, sometime late afternoon thursday 5/2 until ~ 1pm friday 5/3. otherwise i’d be thinking that there were currently some local cell tower problems which might still be the case but the overlapping time-frames has got me interested in the actual issue/issues the happened. i’d also like to hear/read more and hope particle will post some incident details at some point.


#20

We use 2 watchdogs (Watchdog lib + ApplicationTimer) on the devices. It didn’t help. Devices are healthy, they do what they have to do, wakeup, sample sensors, go to sleep. They just fail connecting to particle…

We’re studying using different hardware with no intermediary provider for the next batch of devices… Reliability is a must. 3G connection has to be either working or not working, so we can attempt other ways (Sat). It can’t be “kind of working, but not really”.


#21

More bad news to the story. I don’t know exactly what has been happening these last days, but I know that the failed connections are wasting our bandwidth: image


#22

Maybe a relay in your main power circuit that can trigger on x number of failed data transfers. To me though, it’s just another failure point, and I personally wouldn’t want to have to incorporate such a design. This just needs to be solved by Particle with better implementation and a more robust infrastructure. I am still waiting for an explanation of what happened and a mitigation plan. Like the rest of us, we need to rely on this service, and we pay a premium for both the hardware and service for this expectation.


#23

Hey everyone!

Wanted to give everyone a heads up that we have folks looking into this issue at the moment. We’re seeing a number of symptoms as a result of device service issues and are making an effort to collect as much information as possible (more information can be found on our status page.

We have an on-call staff working this weekend to collect and mitigate issues as best as possible over the weekend. We’ll be making all of these issues top priority this Monday once we have our full team available again.

Please follow the status page for this incident. For any questions or concerns–I recommend posting them to our community here. For any issues containing private or personal information, you may contact myself ( @mstanley ) or Dave ( @ParticleD ) or submit a support request.

Due to delays on our support portal at the moment, it is recommended that, if possible, to use our community in order to get a faster response to your issue.

Thanks everyone! Apologies for any trouble this might’ve caused.


Electron UnPause Failure and Increase Limit Failure
#24

As a heads up, this seems to be getting worse. I had a device that was working the rest of the weekend just recently join the rest in being unable to connect.

I also agree with @peergum in that this is an important issue and multiple days of downtime with the promise to ‘look at it Monday’ does not instill a lot of confidence.


#25

So most of the devices that had disappeared on the 2nd have progressively come back. We disabled/reenabled the SIMs on each of the ones that were still MIA and it may have triggered a forced handshake, which made them come to life too.

So if you can’t reach your devices in the field, that’s my best suggestion.


#26

Certainly understandable. I’m afraid I may not have communicated our escalation on this clearly. We had staff looking into this on Saturday and an emergency short notice meeting today to assess the situation.

Given the circumstances surrounding, this will likely be a significant undertaking to address and scope.

We already took a good look at the problem. Given the nature of what we’re looking at, this is going to be something we reprioritize engineering on starting Monday. We’re not simply looking at this “in our spare time”. We’re putting everything else on hold to ensure all our resources are front and center in place of what our routine plans were for Monday and the days to follow.


#27

Disabling and enabling SIMs was one of our theorized methods of getting a new forced handshake to get out of this bad state. Sounds like that may be the go to attempt for this particular situation


#28

Roger that, gave it a try on one device but it hasn’t reappeared yet. Any particulars to it? For example does the sim need to be disabled for a full cycle (IE device needs to attempt to connect with a disabled sim?)