Cloud publish/subscribe down this morning?

My remote appears to connect and publish, but none of my other cores are receiving their subscriptions. All was working well at midnight (EST) today but nothing this morning.

Is anyone else experiencing this today?

1 Like

I’ve also experienced connection errors while trying to subscribe to my cores’ events.

@dave and @jgoggins,

any updates about the :cloud:?

1 Like

Hey Guys,

Thanks for the ping, I noticed very high cpu usage earlier on one of the systems responsible for managing pub/sub events, and we restarted it about an hour or so ago. Subscriptions seem to have returned to normal, but I wasn’t aware of downtime / impacts. I’ll look into monitoring so we can detect this in the future.

Thanks,
David

1 Like

Thanks @Dave for the update. My subscribed cores are still not receiving updates, but I’m hoping that changes soon.

Thanks @Dave for the update. Same issue here. Hope to get back to normal soon. Good work btw

Still down today. No subscribes coming into my cores. @Dave any idea what may be causing this or when it will be fixed?

Hi @Muskie,

Hmm, can you restart your cores, or post your code, or can you send me your core ids and I can check the logs?

Thanks,
David

@Dave it seems to be publishing now. Just started about 10 minutes ago to receive the publications and it now continues on schedule. May I ask what you did to resolve the problem?

Hi @Muskie,

After studying your logs, I noticed a place where an assumption I had breaks down. The cloud uses the ping messages from your devices as ‘proof’ that the session with that core is live. As your devices move around the cloud, this keepalive lets old sessions close gracefully, and prevents an old session from interfering with messages going to/from your devices. Most of the time this isn’t necessary, and it is there as a failsafe, since it’s rare for these old sessions to linger.

My assumption was wrong because very chatty devices aren’t sending the pings I was expecting, so in this rare condition where an old session is left around, they weren’t providing that proof. I’m working on a fix that detects when the session is very active / isn’t sending pings, and will clean up old sessions as expected. We’re seeing this occur recently due to so many more devices joining the cloud, so we’re testing the horizontal scaling architecture we designed back at the beginning. :slight_smile:

I went to deploy this fix, but quickly rolled it back when I saw slightly different behavior under load. Restarting the service your core was on fixed the problem for you since your old session was cleaned up. My hope is to deploy the fix for this again later today. :slight_smile:

Thanks!
David