Fast blinking cyan photon

Seems back to normal to me! Thanks guys!!!

2 Likes

Thank you all for the fast responses. Seems ok on my end now. I’ll keep you posted on further developments. Cheers.

EDIT: Seems to connect as expected, but connection drops after about 50s, then the photon immediately reconnects. Repeats indefinitely. Is this behavior known?

Not seeing this here, does the same happen with Tinker too?

cannot reproduce, neither with Tinker nor with minimal blink example. Problem must be on my side. Sorry for the bother.

2 Likes

@mihaigalos Are you publishing any events, if so how often?

I’m seeing this again today and last night (11/24 21:30 to 22:00 PST). Most of the time it connects after one cycle of rapid cyan (15 to 20 seconds), a quick red burst, then a second or two more of rapid cyan. Sometimes, it takes 2 or 3 cycles of this before it connects, and sometimes it connects right away. It’s definitely a random thing.

After Edit:

Now (22:30 PST on 11/25) it’s even worse. It’s been taking several minutes to connect now. Tested again 5 minutes later, and it’s connecting right away. Tested again (same code all these times) at 22:38, and it went through 7 cycles, taking 3 minutes 13 seconds to connect.

Further Edit:

Same problem this morning (11/26 9:30 PST) with 2 different Photons, including one that’s running the ā€œBlink an LEDā€ sample app. That Photon went through 3 cycles of the rapid cyan (~20 seconds)/quick red flash sequence before connecting. All my Photons are running 0.6.0. Any ideas @BDub, @bryce, @zachary ?

1 Like

Hi all,

I posted ( Particle cloud registration issues? Stuck on rapid Cyan blink (0.6.0/RC2) ) about my rapid cyan blinking and Ric kindly asked me if I’d seen the 1-2 red blinks (which I have).

It seems like a primitive rate limit on the cloud, or as @zachary alluded to, a load balancing decision to a server that is swamped. I have 15 perfectly setup particles on the table next to me, I can pick one at random and it will be 50/50 if it connects, generally if i turn that particle off and wait ~3 mins it has a 70/30 chance of working second time.

Perhaps a sticky load balancer to a server that’s run out of session space or incoming request rate is too high for the apache/nginx/whatever you chose?

Either way, I’m glad to find out that its an observed issue, a) I’m not going crazy b) I can start telling my customers its not them, its something being worked on.

Ok, time to go back to packing Smartfires for my customers and cooking brisket!

Just letting y’all know we’re investigating. Most devices (i.e., looking at stats that are averages across the whole :particle: :cloud:) are performing fine, however handshake failures are a little elevated. We think we know why. I’ll report back here in a bit.

1 Like

OK, handshake failure rates are back down. There is a known bug in a dependency that we need to upgrade. We’ll schedule that work on a sprint soon. We also created a new alert on a new composite metric that will page the engineer on duty when this happens in the future. Thanks for the heads-up everybody!

4 Likes

Is there a reason some devices would be more likely to suffer this problem than others? All three of the devices I've been using over the last several days have been connecting very slowly a lot of the time.

Thanks for jumping on this!

Just luck-o-the-draw in round robin load balancing — nothing particular to a device — though your local region of the DNS might have had an unhealthy IP cached for a minute. Some instances of the TCP device service were accepting new connections at the normal rate while some instances were degraded and slow to handshake or not accepting new connections at all.

Devices would always retry until they found a healthy server, it just might take a few tries — the max I saw in testing was 3 failures before a success. The UDP device service used by the Electron was unaffected.

2 Likes

Just out of curiosity, what does the red flash mean? Is that an indication that it’s retrying another server (if so, my max was 6 failures).

The quick red burst does indicate that key exchange during handshake didn’t succeede. Hence the when this is a persisiting issue, the common call is

particle keys doctor <yourDeviceID>
particle keys server

If the keys got out of sync, that would fix it. But in this case, it’s more likely due to the flaky connection between your device and a busy server(s).

Hi Scruffr,

I suppose the trick there is how to remedy units in the field.

I just finished having a conversation with a customer complaining about drop outs and reconnection problems. His problems were 13 hours ago, ie after Zach’s post.

If it helps, the customer and I are in Australia, Melbourne specifically.

I’ll tonight fire up a few of the Smartfires I have at home and see if they all connect.

Is the open source particle cloud still viable for use? Ie if we decide to roll our own

Regards,
Mark

Hi, what is the cycle time in seconds for an attempt?

I have an interrupt that forces listening mode if it hasn’t connected after 30 seconds, I was observing the red flash just on the 29 second mark roughly.

Really interested in this

There’s a 20 second timeout on each attempt to receive a given number of bytes from the server. The handshake method contains 2 calls to blocking_receive, one on line 150, the other on line 172.

If I had to guess at what precisely was going on, I’d say you probably got 29 seconds because the degraded server was slow to (1) open the TCP socket and (2) send the initial nonce (9 seconds combined). Then I suspect your device timed out waiting to receive the AES key material.

3 Likes