Cloud publish/subscribe down this morning?

Muskie · February 22, 2015, 3:15pm

My remote appears to connect and publish, but none of my other cores are receiving their subscriptions. All was working well at midnight (EST) today but nothing this morning.

Is anyone else experiencing this today?

bgzenator · February 22, 2015, 11:10pm

I’ve also experienced connection errors while trying to subscribe to my cores’ events.

kennethlimcp · February 23, 2015, 12:26am

@dave and @jgoggins,

any updates about the ?

Dave · February 23, 2015, 1:46am

Hey Guys,

Thanks for the ping, I noticed very high cpu usage earlier on one of the systems responsible for managing pub/sub events, and we restarted it about an hour or so ago. Subscriptions seem to have returned to normal, but I wasn’t aware of downtime / impacts. I’ll look into monitoring so we can detect this in the future.

Thanks,
David

Muskie · February 23, 2015, 1:53am

Thanks @Dave for the update. My subscribed cores are still not receiving updates, but I’m hoping that changes soon.

jamesMeludia · February 23, 2015, 2:50am

Thanks @Dave for the update. Same issue here. Hope to get back to normal soon. Good work btw

Muskie · February 23, 2015, 4:42pm

Still down today. No subscribes coming into my cores. @Dave any idea what may be causing this or when it will be fixed?

Dave · February 23, 2015, 4:59pm

Hi @Muskie,

Hmm, can you restart your cores, or post your code, or can you send me your core ids and I can check the logs?

Thanks,
David

Muskie · February 23, 2015, 8:54pm

@Dave it seems to be publishing now. Just started about 10 minutes ago to receive the publications and it now continues on schedule. May I ask what you did to resolve the problem?

Dave · February 23, 2015, 9:00pm

Hi @Muskie,

After studying your logs, I noticed a place where an assumption I had breaks down. The cloud uses the ping messages from your devices as ‘proof’ that the session with that core is live. As your devices move around the cloud, this keepalive lets old sessions close gracefully, and prevents an old session from interfering with messages going to/from your devices. Most of the time this isn’t necessary, and it is there as a failsafe, since it’s rare for these old sessions to linger.

My assumption was wrong because very chatty devices aren’t sending the pings I was expecting, so in this rare condition where an old session is left around, they weren’t providing that proof. I’m working on a fix that detects when the session is very active / isn’t sending pings, and will clean up old sessions as expected. We’re seeing this occur recently due to so many more devices joining the cloud, so we’re testing the horizontal scaling architecture we designed back at the beginning.

I went to deploy this fix, but quickly rolled it back when I saw slightly different behavior under load. Restarting the service your core was on fixed the problem for you since your old session was cleaned up. My hope is to deploy the fix for this again later today.

Thanks!
David

Topic		Replies	Views
Pub/Sub down again this morning? 3-815 Troubleshooting	5	966	March 8, 2015
Spark Cloud Outtage	16	1798	May 5, 2014
New Blog Post: Spark.subscribe() and Spark Projects General	11	2250	May 27, 2014
Pub/sub losing event between 2 cores Cloud	12	2145	March 9, 2015
API outages & teaching Cloud	5	1551	April 11, 2016

Cloud publish/subscribe down this morning?

Related topics