[Resolved] Outage! Particle cloud down

mterrill · May 31, 2020, 5:18am

I’ve got multiple test units and multiple customers that cannot connect to Particle cloud. Rapid cyan flash, single red flash. It’s basically same symptom as when the keys are invalid.

Update: Seeing nothing on particle subscribe was a furphy, ‘particle subscribe mine’ no longer works on the latest console. However still having issues with devices reliably connecting.

polystyrene · May 31, 2020, 6:21am

Yeah, I just got blown up with alerts too. It looks like they reset the device service or bounced all connected devices around 5:15am UTC. I’ve seen this before, and it wreaks total havoc on our customer base. Some devices just take huge amounts of time to finally reconnect if they’re not manually reset.

mterrill · May 31, 2020, 6:29am

Yep, it’s a massive PITA. Second time I’ve reported an outage in as many months. I’ve got FB threads full of customers whose devices are in limbo.

Particle cloud seems to have a range of issues going on. Devices listed as offline but enumerating their exposed functions etc, but ‘invalid token’ when you try to call one, timeouts on variable calls.

  Variables:
    version (string)
    temps (string)
    debug (string)
    debugPID (string)
    cookid (string)
  Functions:
    int setpitf (String args) 
    int listenMode (String args) 
    int stopLidOpen (String args) 
    int credsFactory (String args) 
    int credsClear (String args) 
    int credsWrong (String args) 
    int testFan (String args) 
    int reset (String args) 
➜  sdb-firmwareflash git:(master) particle call 33005a001051353338363333 setpitf
Error calling function: `setpitf`: Invalid access token
➜  sdb-firmwareflash git:(master) particle call 33005a001051353338363333 setpitf 1
Error calling function: `setpitf`: Invalid access token
➜  sdb-firmwareflash git:(master) particle login
? Please enter your email address xxxxxxxxxxx
? Please enter your password [hidden]
> Successfully completed login!
➜  sdb-firmwareflash git:(master) particle call 33005a001051353338363333 setpitf 1
Error calling function: `setpitf`: Invalid access token
➜  sdb-firmwareflash git:(master) particle get 33005a001051353338363333 temps
Error: Timed out.
Error while reading value: Some variables could not be read
➜  sdb-firmwareflash git:(master) particle get 33005a001051353338363333 temps
Error: Timed out.
Error while reading value: Some variables could not be read
➜  sdb-firmwareflash git:(master)

mterrill · May 31, 2020, 6:31am

Very simply, with this outage and the last, it seems they don’t have real devices looping through connection tests and alerting if they’re not seen or successful within a minute. status.particle is alllllll green, and that’s quite obviously not the case.

polystyrene · May 31, 2020, 6:38am

Honestly, most devices reconnected early on after the event for me, but it may be a numbers game. Definitely been way too long without an update. However, I just saw a big improvement in connected device count in the last few minutes as well. Hopefully they’ve got something sorted out now.

mterrill · May 31, 2020, 6:49am

devices are turning on and connecting ok now. however devices that went online while cloud was down aren’t listening to function calls and will need a physical restart

ScruffR · May 31, 2020, 11:32am

I guess this comes from the fact that the first (default) parameter has become the event name (i.e. particle subscribe will do what particle subscribe mine did) while the "scope" has become an explicit option (i.e. particle subscribe <evtName> --device <deviceName> where mine would stand for (all) MY_DEVICES, the alternative option is --all for ALL_DEVICES).

Starting with CLI v1.27 particle subscribe mine would subscribe to all your events starting with mine.

mstanley · May 31, 2020, 7:15pm

Hi folks,

Sorry for the troubles here. Particle’s engineering team was made aware of this incident yesterday and was investigating the matter. Based on preliminary investigations, not all devices were experiencing difficulties, thus this was not determined to be a cloud outage.

Engineering’s investigation has continued into Particle’s planned maintenance window scheduled for today. Metrics have determined that all devices have since recovered and should be in a happy, cyan-breathing state.

If anybody is still experiencing issues following this maintenance window for fleet connectivity, please be sure to inform us by filing a support ticket at https://support.particle.io/

policenauts · May 31, 2020, 7:22pm

Hi @mstanley, I’m brand new to Particle (just got my Photon up and running yesterday), but as of the time of this writing, most requests to my device cloud this morning via the API are failing with {"error":"Timed out."} - both using the GET button in the console as well as using a get request. It seems to work intermittently, but as soon as I try to make a subsequent request it hangs. Am I doing something wrong?

My code (to simply grab the value from a weight scale):

bool isAvailable = false;
char c;
String string;

void setup(){
    // set up serial comms over the USB for debugging
    Serial1.begin(9600);
    Particle.variable("isAvailable", isAvailable);
    Particle.variable("c",c);
    Particle.variable("string",string);

    Particle.publishVitals(5);

}

void loop(){
    // Serial.println(Serial1.available());.
    isAvailable = Serial1.available();
    c = Serial1.read();
    string = Serial1.readStringUntil('\n');
}```

mstanley · May 31, 2020, 7:26pm

Hi @policenauts,

Happy to be of assistance.

Particle has just concluded a scheduled maintenance window approximately 25 (from 11am - 12pm PDT )minutes ago. More details on this window may be found here at: https://status.particle.io/incidents/bbp57y75c4lw

During this time, core Particle features such as API calls are expected to have intermittent issues.

As our maintenance window has just concluded, I would encourage you to retry all of the calls you were having difficulties with this morning and see if you are able to reproduce the issues.

If you are able to continue reproducing this bad behavior, I would encourage you to file a support ticket at the previously provided support link, providing the following information:

Your account email
API endpoints you are attempting to call
Any device IDs you are attempting to call against
Source code (as provided here)
Status of your Particle device’s LED
Any other relevant information that could be helpful to diagnosis

Do let me know if you have any questions or concerns!

policenauts · May 31, 2020, 7:42pm

Thanks @mstanley, I’ll file a ticket.

A related question - is it common for Particle to do scheduled maintenance during US business hours (even on weekends)? Thanks.

mstanley · May 31, 2020, 8:16pm

Hi @policenauts

This is Particle’s first time doing a scheduled maintenance. Feedback from our community as a result of this maintenance will be helpful in determining when Particle will conduct future maintenance windows.

mterrill · June 1, 2020, 2:52am

Yeah I checked out the new syntax options. I’m not sure about that versioning though. I had the cli matching deviceos 1.4.4, so it wasn’t particularly old. I have a bash script called ‘particlegrep’ that I’ve been using for years and upgrading to cli 2.7.0 late last week it stopped working.

#!/bin/bash
echo 'hi there mark'
particle subscribe mine | grep $1

Now the ‘mine’ is redundant after many years of active use.

mterrill · June 1, 2020, 3:16am

Hi @mstanley, this wasn’t during the scheduled hours, it was well before.

Per status.particle: "Particle will be conducting scheduled maintenance on Sunday May 31st from 18:00-19:00 UTC (11am - 12pm PDT) "
For those playing at home, that is:
Epoch timestamp : 1590948000
Timestamp in milliseconds: 1590948000000
Date and time (GMT) : Sunday, May 31, 2020 18:00:00
Date and time (your time zone): Monday, June 1, 2020 4:00:00 GMT+10:00

Any of our customers who rebooted their device during this period could not connect:
Sunday, May 31, 2020 15:00:00 to 5:15:00 GMT+10:00

Devices would simply not connect per the original rapid blue and single red flash symptom
Later towards the end some devices would also go online into limbo and be orphaned and unable to call their functions without a physical reset after the resolution around 5:15 AEST.

I fundamentally don’t agree with your statement: “Based on preliminary investigations, not all devices were experiencing difficulties, thus this was not determined to be a cloud outage.”. That’s like saying the person who just lost a leg in a car accident is perfectly healthy unless they’re dead.

Your testing should have identified that new connections were failing. If that wasn’t apparent, and no one in the support team tried turning on physical devices and connecting them, then that’s an area worth investigating. Presuming you did see all the devices fail to connect and publish functions for over 2 hours, that to me warrants an update on status.particle even if previously registered devices were still working. Status.particle is where I check first as then it saves me the effort of carefully testing with physical devices myself.

You’re welcome to jump onto my customer forum and see their opinions on having their afternoon cooks for dinner time interrupted by Particle’s “it’s not butter” outage. One customer even suggested changing providers. https://www.facebook.com/groups/smartfireowners/permalink/2329137957382892/

mstanley · June 1, 2020, 3:47am

Hi Mark,

Perhaps my statement was misunderstood.

As mentioned previously, Particle’s engineering team was made aware of this yesterday and, as such, began a preliminary investigation yesterday. It is recognized this was before the scheduled maintenance today. The statement was to convey that the investigation began yesterday and lead into today’s scheduled maintenance window.

As part of yesterday’s preliminary investigation, today’s routine maintenance window also sought bring these devices back online.

We did recognize a subset of devices that were having difficulty connecting. However, our observations showed that not all devices were experiencing this issue. What lead to this subset of devices having difficulty connecting is something best left to a root cause analysis by Particle’s engineering team.

If you have any concerns regarding this incident, I welcome you to send me a private message here on the community forum and I would be happy to get you in touch with our VP of Customer Success regarding this matter.

mterrill · June 1, 2020, 3:51am

I’ve said my bit. There are even other threads on here with other customers with the same issues. Updating status.particle when there is an identified issue would help everyone.

mstanley · June 1, 2020, 3:55am

Hi Mark

Your concerns are noted and will be relayed internally.

AlbertZeroK · June 1, 2020, 10:27am

I’ve held the keyboard and worked through my own outages with 6k / minutes flying out the door in sales - so I understand how difficult dev ops and monitoring is… But, I can not imagine doing maintenance during the day, if it was emergency maintenance, that should be in the communication. OUCH!! This is also not the first time the community has made Particle aware of their own outage - I’d suggest tracking load balancer volume and error rates, it’s not that difficult and would allow you to provide a reasonable level of monitoring. Throw it through pager duty or ops genie, and have your engineering team on call, make sure this is a top down initiative as well, if the CTO isn’t spitting made at this point and ready to fix it - fire him.

In 2020, having a customer tell you you’re down, when you’re a cloud service provider, is, embarrassing. I’m not kidding. You guys have to fix this if you ever want to survive.

will · June 1, 2020, 7:34pm

Fully agreed, @mterrill – a totally reasonable expectation. Some communication wires may have gotten crossed given the proximity of the issue to the scheduled maintenance window that resulted in a failure of our typical statuspage notifications. I will run a quick post-mort today with our Eng/CS teams.

will · June 1, 2020, 8:23pm

Thanks for the thoughts, Albert.

I want to be very clear that this was an external communications issue, not an internal awareness issue, though I understand how a failure on the former can appear to be a failure for the latter.

Topic		Replies	Views
Particle API Down? Cloud	37	4629	August 5, 2016
Anyone else not seeing events from their devices? Troubleshooting	29	6741	July 17, 2016
Delay in 'online' status and failed particle.function Cloud electron	32	2358	January 22, 2020
Over 20 Devices just went down Troubleshooting electron	66	7063	May 9, 2019
Particle Cloud "Cheat Sheet" Cloud	34	6917	February 2, 2017

[Resolved] Outage! Particle cloud down

Related topics