Local Cloud: Cores listed as 'offline', but still breathing cyan / OTA ready

chuank · May 20, 2015, 7:58am

As I continue working on a robust local cloud setup for a trio of test Cores, I’m encountering an interesting issue.

Sometimes the Cores drop out from spark-server, but yet they continue a quasi-connected existence, represented by their breathing cyan status LED, and my ability to still flash OTA to them. However, in this state, I am no longer able to retrieve cloud variables, functions or SSEs or list them as being ‘online’, and the only fix so far is to either do a manual reset, or force an OTA to remotely reset the Core. This dropout can happen anywhere from a half hour to overnight, but usually when you least expect it

Below are the relevant logs for anyone who can help with this issue.

When everything is working smoothly, running particle list gives me this readout, which is correct (core IDs removed). The Cores are also happily publishing SSEs by the way (I wonder if that could be the cause?)

Checking with the cloud...
chicken_mighty (xxxx) is online
  Variables:
    humidity (double)
    tempC (double)
    dewPt (double)
    dewPtSlow (double)
cranky_banjo (xxxx) is online
  Variables:
    humidity (double)
    tempC (double)
    dewPt (double)
    dewPtSlow (double)
normal_ninja (xxxx) is online
  Variables:
    humidity (double)
    tempC (double)
    dewPt (double)
    dewPtSlow (double)

The corresponding spark-server log when this was run didn’t report anything amiss either:

ListDevices { userID: 'aaaa' }
isDeviceOnline: Pinging core...  { coreID: 'xxxx' }
isDeviceOnline: Pinging core...  { coreID: 'yyyy' }
isDeviceOnline: Pinging core...  { coreID: 'zzzz' }
ListDevices... waiting for connected state to settle  { userID: 'aaaa' }
isDeviceOnline: Device service thinks it is online...  { coreID: 'xxxx' }
isDeviceOnline: Device service thinks it is online...  { coreID: 'yyyy' }
isDeviceOnline: Device service thinks it is online...  { coreID: 'zzzz' }
192.168.2.101 - - [Wed, 20 May 2015 07:26:36 GMT] "GET /v1/devices?access_token=bbbb HTTP/1.1" 200 387 "-" "-"
GetAttr { coreID: 'xxxx', userID: 'aaaa' }
192.168.2.101 - - [Wed, 20 May 2015 07:26:41 GMT] "GET /v1/devices/xxxx?access_token=bbbb HTTP/1.1" 200 224 "-" "-"
GetAttr { coreID: 'yyyy', userID: 'aaaa' }
192.168.2.101 - - [Wed, 20 May 2015 07:26:41 GMT] "GET /v1/devices/yyyy?access_token=bbbb HTTP/1.1" 200 224 "-" "-"
GetAttr { coreID: 'yyyy', userID: 'aaaa' }
192.168.2.101 - - [Wed, 20 May 2015 07:26:41 GMT] "GET /v1/devices/zzzz?access_token=bbbb HTTP/1.1" 200 226 "-" "-"

Now here comes the ‘fun’ part. After a period of time (x hours, overnight?!), one or more Cores start ‘dropping out’. Which is perfectly normal especially if they get intermittent WiFi signal. And I’ve seen Cores on the Particle Cloud drop out and reconnect again with no issue. On the local cloud though, the spark-server logs report:

recovering from early data!  { step: 5, data: '410102f58eb174', cache_key: '_1' }
connect.multipart() will be removed in connect 3.0
visit https://github.com/senchalabs/connect/wiki/Connect-3.0 for alternatives
connect.limit() will be removed in connect 3.0
recovering from early data!  { step: 5, data: '4101b074bdb174', cache_key: '_0' }
recovering from early data!  { step: 6,
  data: '5002b075b1450d0d737061726b2f6363333030302d70617463682d76657273696f6eff312e3239',
  cache_key: '_0' }
recovering from early data!  { step: 5, data: '41012697cbb174', cache_key: '_3' }
recovering from early data!  { step: 5, data: '4101f2a032b174', cache_key: '_4' }
recovering from early data!  { step: 5, data: '4101d5ce54b174', cache_key: '_5' }
recovering from early data!  { step: 5, data: '4101459fdfb174', cache_key: '_6' }
recovering from early data!  { step: 5, data: '41010c467db174', cache_key: '_7' }
recovering from early data!  { step: 5, data: '4101e34ef7b174', cache_key: '_9' }
recovering from early data!  { step: 5, data: '4101613f92b174', cache_key: '_11' }
recovering from early data!  { step: 5, data: '4101ed8b00b174', cache_key: '_12' }
recovering from early data!  { step: 5, data: '4101b10507b174', cache_key: '_13' }
recovering from early data!  { step: 5, data: '41013546f5b174', cache_key: '_14' }
recovering from early data!  { step: 5, data: '41015e0be4b174', cache_key: '_15' }
recovering from early data!  { step: 5, data: '41011dd828b174', cache_key: '_16' }
recovering from early data!  { step: 5, data: '41010d21ffb174', cache_key: '_26' }
on ready { coreID: '54ff6e066667515129382367',
  ip: '192.168.2.108',
  product_id: 0,
  firmware_version: 10,
  cache_key: '_26' }
Core online!
(node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.
Trace
    at Object.addListener (events.js:160:15)
    at Object.EventPublisher.subscribe (/home/pi/spark-server/node_modules/spark-protocol/lib/EventPublisher.js:52:14)
    at Object.CoreController.subscribe (/home/pi/spark-server/lib/CoreController.js:145:20)
    at EventsApi.get_events (/home/pi/spark-server/views/EventViews001.js:198:10)
    at callbacks (/home/pi/spark-server/node_modules/express/lib/router/index.js:164:37)
    at param (/home/pi/spark-server/node_modules/express/lib/router/index.js:138:11)
    at param (/home/pi/spark-server/node_modules/express/lib/router/index.js:135:11)
    at pass (/home/pi/spark-server/node_modules/express/lib/router/index.js:145:5)
    at Router._dispatch (/home/pi/spark-server/node_modules/express/lib/router/index.js:173:5)
    at Object.router (/home/pi/spark-server/node_modules/express/lib/router/index.js:33:10)

Can someone share what ‘recovering from early data’ does?

And when I run particle list again, I get, for example:

Checking with the cloud...
chicken_mighty (xxxx) is offline
cranky_banjo (xxxx) is online
  Variables:
    humidity (double)
    tempC (double)
    dewPt (double)
    dewPtSlow (double)
normal_ninja (xxxx) is offline

The EventEmitter memory leak looks very suspect, and I wonder if that has to do with my spark.publish() lines in my code, which I’m using sprintf to format my char pubOut[63] as JSON output:

sprintf(pubOut, "{\"c\":\"%2.2f\",\"h\":\"%2.2f\", \"d\":\"%2.2f\",\"ds\":\"%2.2f\"}", tempC, humidity, dewPt, dewPtSlow);

Anyone with any idea what’s going on?

ArtLa · May 20, 2015, 9:23pm

We are sharing the 2.4GHz WiFi band with microwave ovens, and when my microwave oven is in use, there is absolutely NO wireless WiFi activity possible. check to see if you can cause the disconnect issue by running your microwave for a few minutes (boiling water if you don’t have any food to cook.) It would be nice to eliminate this possibility.

bko · May 20, 2015, 9:52pm

Hi @chuank

A lot of these problems are traceable back to firmware issues. There was a similar thread on a local cloud problem that turned out to be using up all the IP sockets in the core.

Can you try simplifying your firmware to the point where almost nothing can go wrong? You seem to have some variables in your firmware, try removing all the sensor code for these variables and just increment the values every time through loop or something similar. Boiling it down to the simplest test case will help a lot.

[EDIT]

One other thought--check your DHCP lease time and see if it corresponds to the time when the cores go zombie.

chuank · May 21, 2015, 9:39pm

Thanks @bko, I’ve been testing the Cores for a while now, and it does appear to be related to my firmware code. Publishing a simple incremental counter no longer produced this error in my logs:

(node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.`

… and my Cores seem to be behaving. It’ll be a bit more work (and time) to slowly slot in the rest of my code a bit of a time and wait for dropouts, but let’s see how that goes.

I did find this to reveal more questions though. Does a SSE opened from a client side (in my case, a JS webpage) take up an IP socket from a Core, and that we have a maximum limit of up to 11 (or 7, based on the link you shared) concurrent connections per Core to a cloud? The upshot of this is that I’m now a lot more stringent with my SSE calls on my client side, but I’m curious nonetheless on how SSEs are dealt with on the spark-server side.

My DHCP lease time is 24hrs, so that wasn’t the cause but it’s an important check, thanks again for the additional tip.

@ArtLa, thanks for the advice on microwave ovens. While I do not have one, that can definitely be something to look out for in future installations. I do have a narrow corridor between my router and living room that does horrible things to my WiFi signal strength, perhaps a WiFi RSSI visualiser is on the cards!

bko · May 21, 2015, 10:20pm

You will have to get @dave to comment about the Spark Cloud questions, but I can say on the firmware side that the cloud connection always takes exactly one TCP socket in the Core and all traffic for Spark functions, variables, events and over-the-air programming happens on that one socket.

chuank · May 22, 2015, 7:14am

Looks like I spoke too soon… a 2am entry in the log last night shows:

particle-server-1 (out): 192.168.2.101 - - [Fri, 22 May 2015 02:09:57 GMT] "GET /v1/events/dht22?access_token=0e20ee37fb82012639080f6caa5585164ba92f1a HTTP/1.1" 200 - "http://iota.local/freeboard/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36"
particle-server-1 (err):     at EventsApi.get_events (/home/pi/spark-server/views/EventViews001.js:198:10)
particle-server-1 (err):     at callbacks (/home/pi/spark-server/node_modules/express/lib/router/index.js:164:37)
particle-server-1 (err):     at param (/home/pi/spark-server/node_modules/express/lib/router/index.js:138:11)
particle-server-1 (err):     at param (/home/pi/spark-server/node_modules/express/lib/router/index.js:135:11)
particle-server-1 (err):     at pass (/home/pi/spark-server/node_modules/express/lib/router/index.js:145:5)
particle-server-1 (err):     at Router._dispatch (/home/pi/spark-server/node_modules/express/lib/router/index.js:173:5)
particle-server-1 (err):     at Object.router (/home/pi/spark-server/node_modules/express/lib/router/index.js:33:10)
particle-server-1 (err): (node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.
particle-server-1 (err): Trace
particle-server-1 (err):     at Object.addListener (events.js:160:15)
particle-server-1 (err):     at Object.EventPublisher.subscribe (/home/pi/spark-server/node_modules/spark-protocol/lib/EventPublisher.js:52:14)
particle-server-1 (err):     at Object.CoreController.subscribe (/home/pi/spark-server/lib/CoreController.js:145:20)
particle-server-1 (err):     at EventsApi.get_events (/home/pi/spark-server/views/EventViews001.js:197:10)
particle-server-1 (err):     at callbacks (/home/pi/spark-server/node_modules/express/lib/router/index.js:164:37)
particle-server-1 (err):     at param (/home/pi/spark-server/node_modules/express/lib/router/index.js:138:11)
particle-server-1 (err):     at param (/home/pi/spark-server/node_modules/express/lib/router/index.js:135:11)
particle-server-1 (err):     at pass (/home/pi/spark-server/node_modules/express/lib/router/index.js:145:5)
particle-server-1 (err):     at Router._dispatch (/home/pi/spark-server/node_modules/express/lib/router/index.js:173:5)
particle-server-1 (err):     at Object.router (/home/pi/spark-server/node_modules/express/lib/router/index.js:33:10)
particle-server-1 (err): recovering from early data!  { step: 5, data: '41010b15d8b174', cache_key: '_45' }

This time, though, my Cores have not turned into zombies and I was able to reopen my SSE connections and get the Cores publishing right after without any resets. I’m just publishing a LDR readout here (a single analogRead call). More testing needed.

So now, I’m trying to confirm if this is a SSE-specific issue, related to Spark.publish on the local cloud:

spark-server is currently failing somewhat gracefully now by abandoning broken SSEs
short fix is to re-open new SSE connections from the client side (in my case, through EventSource / JS)
however, is the previously-made SSE connection never terminated on the spark-server side, causing the EventEmitter listener to instantiate itself needlessly (which is why I got that EventEmitter error)?

Thing is, my EventSource JS now only attempts to reconnect if the previous state was down, and my Chrome console is not tracing any extraneous SSE connections. So I’m wondering here just what is causing the so-called memory leak in the EventEmitter code on spark-server.

If anyone has encountered similar issues on a local cloud and SSEs, it’d be good to hear from your experience.

chuank · May 23, 2015, 10:22am

It looks like the situation has improved.

The root cause for my issue seems two-fold:

my external EventSource (SSE) connections were not well-managed and were spamming the cloud server to open new persistent connections – something that spark-server wasn’t quite happy to deal with. I ended up writing a keep-alive interval check in JS to reconnect only when the previous connection goes down. The EventEmitter memory leak message hasn’t shown up since.
I made sure to include Spark.process() inside a blocking DHT22 acquisition loop in my code, but during moments when the sensor froze up, the Core too went into limbo. Using an updated DHT library and a non-blocking method to check for sensor acquisition seemed to work for me.

Looking into the spark-server source, particularly in EventViews001.js, revealed the usual cleanup routines done when a request and/or resource is closed.

This also turned into a good read for me figuring out the SSE specification. Interestingly, the reconnection parameter as described here is commented out in the spark-server source. So if spark-server currently doesn’t support resuming of SSE connections, the onus seems to be on client-side code to maintain the open connection (and reconnect if it drops for any reason).

@Dave, @nexxy, it’d be great to hear from you what’s a best approach in dealing with writing client-side SSE connections to work Particle devices.

Meanwhile, I’ll be sending my Cores through another overnight test and hope the situation’s been resolved completely.

Thanks all!

Dave · May 26, 2015, 4:42pm

Hi @chuank,

Good question! Totally, in SSE land, it’s typical for the client to be responsible for resuming connections, I’m not sure the server can do much to re-open the connection if the client loses it. I wrote an example SSE client using SparkJS here that resumes when dropped:

github.com

particle-iot/sparkjs/blob/master/examples/node/get-event-stream-forever.js

/*jslint node: true */
"use strict";

var spark = require('../../lib/spark');
//var spark = require('spark');

//example code to re open the SSE stream when it ends
 var openStream = function() {

    //Get your event stream
    var req = spark.getEventStream(false, 'mine', function(data) {
      console.log("Event: " + JSON.stringify(data));
    });

    req.on('end', function() {
        console.log("ended!  re-opening in 3 seconds...");
        setTimeout(openStream, 3 * 1000);
    });
};

This file has been truncated. show original

I think some of delay (at least a second or so), or an exponential backoff is really important, so you don’t spam your own server!

Thanks,
David

chuank · May 27, 2015, 11:24pm

Thanks @Dave, I ended up implementing a similar keepalive checker that instantiates setTimeouts only after it’s previous instance goes down – my JS script is intended to run on users’ browser windows that are potentially left idle/sleep for days, and spamming my cloud is something I wasn’t keen on!

The exponential backoff sounds like an interesting approach – thanks for the suggestion.

Any news on when the local cloud will be updated to include device and (device+event) specific SSEs?

fwiw I’m still getting occasional EventEmitter memory leak detected messages in the spark-server logs, although this has become more a rare occurrence (once daily?). It’s reported as a warning by the express module, and I’m getting a consistent 55+MB RAM footprint for spark-server, so all is well – for now. Looking forward to my Photons and rolling out a larger fleet of devices to test the server with.

Dave · May 28, 2015, 2:49pm

The EventEmitter in the local server will give a warning if there are more than 10 listeners. It’s meant to help you find memory leaks, but if you know you’ll routinely have more, you can tell it to chill out with one of these https://nodejs.org/api/events.html#events_emitter_setmaxlisteners_n

I’m hoping the local cloud will get some love this summer.

Thanks,
David

Topic		Replies	Views
[Solved] Spark Core offline with local Cloud Troubleshooting	5	2079	July 2, 2015
Two cores breathe cyan, but "offline" [RESOLVED] Troubleshooting	10	2459	September 7, 2015
Flashing green and not breathing cyan again if not restarted [Solved] Troubleshooting	4	1254	January 8, 2015
Spark Core no longer communicating with the cloud Troubleshooting	9	1308	April 7, 2015
Non-Cyan-Flash Offline Core Firmware	33	4896	July 23, 2014

Local Cloud: Cores listed as 'offline', but still breathing cyan / OTA ready

Related topics