Over 20 Devices just went down

More bad news to the story. I don’t know exactly what has been happening these last days, but I know that the failed connections are wasting our bandwidth: image

Maybe a relay in your main power circuit that can trigger on x number of failed data transfers. To me though, it’s just another failure point, and I personally wouldn’t want to have to incorporate such a design. This just needs to be solved by Particle with better implementation and a more robust infrastructure. I am still waiting for an explanation of what happened and a mitigation plan. Like the rest of us, we need to rely on this service, and we pay a premium for both the hardware and service for this expectation.

Hey everyone!

Wanted to give everyone a heads up that we have folks looking into this issue at the moment. We’re seeing a number of symptoms as a result of device service issues and are making an effort to collect as much information as possible (more information can be found on our status page.

We have an on-call staff working this weekend to collect and mitigate issues as best as possible over the weekend. We’ll be making all of these issues top priority this Monday once we have our full team available again.

Please follow the status page for this incident. For any questions or concerns–I recommend posting them to our community here. For any issues containing private or personal information, you may contact myself ( @mstanley ) or Dave ( @ParticleD ) or submit a support request.

Due to delays on our support portal at the moment, it is recommended that, if possible, to use our community in order to get a faster response to your issue.

Thanks everyone! Apologies for any trouble this might’ve caused.

4 Likes

As a heads up, this seems to be getting worse. I had a device that was working the rest of the weekend just recently join the rest in being unable to connect.

I also agree with @peergum in that this is an important issue and multiple days of downtime with the promise to ‘look at it Monday’ does not instill a lot of confidence.

So most of the devices that had disappeared on the 2nd have progressively come back. We disabled/reenabled the SIMs on each of the ones that were still MIA and it may have triggered a forced handshake, which made them come to life too.

So if you can’t reach your devices in the field, that’s my best suggestion.

1 Like

Certainly understandable. I’m afraid I may not have communicated our escalation on this clearly. We had staff looking into this on Saturday and an emergency short notice meeting today to assess the situation.

Given the circumstances surrounding, this will likely be a significant undertaking to address and scope.

We already took a good look at the problem. Given the nature of what we’re looking at, this is going to be something we reprioritize engineering on starting Monday. We’re not simply looking at this “in our spare time”. We’re putting everything else on hold to ensure all our resources are front and center in place of what our routine plans were for Monday and the days to follow.

2 Likes

Disabling and enabling SIMs was one of our theorized methods of getting a new forced handshake to get out of this bad state. Sounds like that may be the go to attempt for this particular situation

Roger that, gave it a try on one device but it hasn’t reappeared yet. Any particulars to it? For example does the sim need to be disabled for a full cycle (IE device needs to attempt to connect with a disabled sim?)

I had a device that stopped being recognized by the device cloud this weekend. The device is here and I can see that it glows cyan so it thinks the connection is successful. I attempted to disable / enable the SIM. After that, the cellular connection fails (just blinks green indefinitely). Now I’m seeing this with the cloud debug tool:

clouddebug: press letter corresponding to the command
a - enter APN for 3rd-party SIM card
k - set keep-alive value
c - show carriers at this location
t - run normal tests (occurs automatically after 10 seconds)
or tap the MODE button once to show carriers
starting tests...
turning cellular on...
deviceID=<redacted>
manufacturer=u-blox
model=SARA-R410M-02B
firmware version=L0.0.00.00.05.06 [Feb 03 2018 13:00:41]
ordering code=SARA-R410M-02B
IMEI=<redacted>
IMSI=u-blox
ICCID=<redacted>
0000020797 [app] INFO: enabling trace logging
attempting to connect to the cellular network...
0000118048 [gsm0710muxer] ERROR: The other end has not replied to keep alives (TESTs) 5 times, considering muxed connection dead
0000118048 [gsm0710muxer] ERROR: The other end has not replied to keep alives (TESTs) 5 times, considering muxed connection dead
0000129703 [hal] ERROR: Failed to power off modem
0000129703 [hal] ERROR: Failed to power off modem
0000150805 [hal] ERROR: No response from NCP
0000150805 [hal] ERROR: No response from NCP

I had another device show something similar about a month ago and Particle replaced it. Could this device have coincidentally died during this device cloud issue or have others seen similar symptoms?

With regarding to bandwidth, the huge jump in the bill that just came in is really appreciated. I wish we had received as much data in our DBs as we seemed to have consumed…

Thanks for nothing, Particle.

I also have one electron that went down May 2, last transmission was 1242pm Arizona time. I have an E series nearby running the same firmware, which has continued to run correctly.

I’ve just started reading up on this issue, and have just deactivated and reactivated my SIM card, after reading that this seems to have helped some of the units that have been affected. Fingers crossed.

Any update for us? I’ve got offline Xenons on which I can test some firmware if needed. I’m running a Xenon gateway over ethernet and could try to get some packet captures as well.

Update: attempted enable/disable SIMs on one of my field devices both short term and long, no dice.

Hi all – we have an update on the source of connectivity issues here:

I'm sorry for the ongoing issues this might have caused. The issue is complex (as laid out by Matthew's thread), we are still working on diagnosing the problem and working on a Device OS fix. In the meantime, completely depowering and then repowering devices should get them to re-connect; we are still trying to find a fix for remote devices (do try to disable the SIM and leave it disabled for 10 minutes or more, which may work).

If anyone experienced runaway bandwidth consumption, please send your bill in to customer service and we're happy to refund the bill. @peergum I'm sorry you experienced this, you shouldn't be charged for these sorts of problems.

The team is still hard at work on this issue, and we'll keep you posted on further progress.

3 Likes

Hey @wileecoyoti,

Thanks for the update. While it was a popular theory up front, it’s not conclusive why the SIM toggling would work. It’s possible in our test cases that other behavior may have contributed to it getting out of its bad state.

My apologies this did not resolve it for you. But thank you for giving this a try and providing us with the update. Please see the topic that @zach posted in regards to more updates on this matter.

Hi @kenmacd

We have posted an update to this matter. You may find more information at the topic posted by Zach.

Hi @peergum

We recognize that there is a spike in data usage as a result of this incident and we intend to make right on that.

I highly recommend that you, and any other uses who had devices fall into this state to contact us on our support portal referencing this reply, along with a list of any device IDs and associated ICCIDs that had increased data usage as a result of this incident. We will work to ensure that any overages charged as a result of this incident are credited to your account.

Please note that for any E Series LTE, Boron 2G/3G, or Boron LTE devices, it may take up to 7 days for our mobile operator to reflect these usage records to us. Therefore, any users who had devices fall into this bad state may not have this excess usage reflected right away. I encourage all device owners for these device types to keep an eye on usage records and to reach out to us if they notice an increase at any point in the next 7 to 10 days.

2 Likes

Hi @trpropst

Recent investigations with our engineering team on a separate issue outside the scope of this one has discovered some undesired behavior in the R410M-02B cellular module that is found in both our E Series LTE and Boron LTE devices. We are currently working with our cellular module partner in order to address these issues and figure out next steps to resolve this in existing and future Particle devices. Our engineering team continues to dig into the implications of the LTE cellular module and will be working with our documentation team to provide a full errata sheet on this.

1 Like

Hi @bkhalterman

Please see @zach’s response and the referenced topic in his post for more information on this. We acknowledge the shortcomings that lead to this incident and are determined to approach resolutions to the issue on both the Device OS and cloud fronts.

Hi @dkryder , please refer to Updates for the Status Incident of May 4th for more information. Feel free to post any questions you may have there.