Local cloud - SOS panic flash with user firmware [Solved]

Setting up the local cloud is becoming an adventure. Here’s my latest problem:

  • Local cloud all set up correctly as far as I can tell
  • Point the core at the local cloud
  • The core cycles through all the colours of the rainbow
  • The log says the core is connected over and over:

Connection from: 10.0.1.77, connId: 14
on ready { coreID: ‘MY_CORE_ID’,
ip: ‘10.0.1.77’,
product_id: 0,
firmware_version: 6,
cache_key: undefined }
Core online!

  • core does this about 10 -15 times then finally connects.

here is a video of the lights:

Any ideas?

that’s what i get, it never succeeds after 15 times though, and never shows up in “spark list”, unless that’s all its supposed to do?

just updated spark-cli (just in case) and now it really seems broken, keep getting this now with 0.3.96:

There was an error reading undefined:  [SyntaxError: Unexpected token h]

@pixelboy:

Same up to me see my post https://community.spark.io/t/spark-local-cloud-beta-collection-of-issues/6146 or here https://community.spark.io/t/deep-update-source-file-for-spark-cli/5872/15

We should check if on github already is an issue opened to that topic. Me/you can open one so that the Spark Team know of that problem.

Hi @Pixelboy,

Thank you for posting that video! Your core is doing an SOS, and then reporting a hard fault. We have an issue open for this here: https://github.com/spark/spark-server/issues/18

I’m trying to track down the issue, but I’m not able to reproduce it at the office, this is a big priority for me, and it’ll get fixed, it might take a little while though.

Thanks,
David

@Dave - so this isn’t entirely a side-effect of the local cloud then looking at the issue?

i only ever see it when trying to setup the local cloud (and failing!)

local build or remote cloud i’ve never had the flashing like that, only (broken) local cloud.

Hi @sej7278,

I think the issue is related to the extreme low latency of the local cloud, and isn’t necessarily local cloud specific. I haven’t been able to reproduce it yet, and the local cloud uses the same libraries and modules as the larger cloud service, the only major difference is how close the server is to you (local vs. global).

@sej7278 - can you tell me more about how and in what environment you’re hosting your local cloud, what firmware you’re running, etc?

Thanks,
David

I was hoping I just messed something up in the setup. I was testing the spark core for a installation I’m doing for a client. I am very happy with the product so far… it works flawlessly on the cloud. The issue with the installation is there is no connection to the internet. I was going to run the local version. I hope we can get to the bottom of it.

@clyde I’ll take a look at your posts perhaps two minds are better than one.

@Dave - If it is a low latency issue shouldn’t than be really reproducible? It sounds to me like it is something else.

Here is my set up for the local cloud…

  • MacBook Pro - 10.9.3
  • Apple Airport extreme
  • I have the most recent CLI update
  • I applied the deep update to the core but not the CC3000 patch
  • Currently running tinker tool

Should I apply the CC3000 patch?

Some other weirdness… i gave the core a name but when I do a spark list it says the name is null.

null (MY_CORE_ID) is online
Functions:
int digitalread(String args)
int digitalwrite(String args)
int analogread(String args)
int analogwrite(String args)

I just to complicate things further… when I plugged in the core right now it logged on to the LC no problem!

Here is some of my server log messages when I ran spark list

ListDevices { userID: ‘sLNAe+tWVWRjjmAs/O9CJXMXbFN63fWf’ }
isDeviceOnline: Pinging core… { coreID: ‘MY_CORE_ID’ }
ListDevices… waiting for connected state to settle { userID: ‘sLNAe+tWVWRjjmAs/O9CJXMXbFN63fWf’ }
isDeviceOnline: Device service thinks it is online… { coreID: ‘MY_CORE_ID’ }
10.0.1.138 - - [Mon, 04 Aug 2014 01:34:20 GMT] “GET /v1/devices?access_token=MY_ACCESS_TOKEN HTTP/1.1” 200 117 “-” "-"
GetAttr { coreID: ‘MY_CORE_ID’,
userID: ‘sLNAe+tWVWRjjmAs/O9CJXMXbFN63fWf’ }
10.0.1.138 - - [Mon, 04 Aug 2014 01:34:20 GMT] “GET /v1/devices/MY_CORE_ID?access_token=MY_ACCESS_TOKEN HTTP/1.1” 200 191 “-” "-"
1: Core disconnected: socket error Error: read ETIMEDOUT { coreID: ‘MY_CORE_ID’,
cache_key: undefined,
duration: 31632.486 }
Session ended for 17

Does this help at all?

I think I may have solved my problem… I don’t know if this will solve the problem @clyde is having… my clue was how the core didn’t seem to have a name. Just for shits and giggles I claimed the core through “spark setup” and this seems to have fixed the problem. I will try testing it out a bit more and attempt to put something other than tinker tool on it.

Oh man it’s so nice to see it breathing cyan on the local cloud!

1 Like

Update: If I load a compiled Arduino on the core I get the problem back again. To be clear:

  • If I load the tinker tool firmware and clam the core it works every time
  • If I load compiled firmware and claim the core it works the first time but if I disconnect I’m back to square one

I don’t think this has anything to do with low latency… there is something different about doing this locally.

I took your blink LED code and saved it as .ino file. I used the cloud to compile to a .bin uploaded it to the core. I started haven the same issues as before. I ran spark setup and went through all the steps… and voila! breathing cyan and blinking LED. It worked! I unplugged it and I’m back to square one.

However if I load tinker and claim the core it works even if I unplug it and plug it back in.

This leads me to believe there is something fundamentally different between uploading code through the CLI and the Cloud.

Let’s take a step back and look at the situation.

I was the 1st few to have noticed this behavior that only stock tinker firmware connects well with the local :cloud:.

Any other generic firmware exhibits SOS flashes when attempting to connect to the local :cloud:.

We can now easily reproduce the issue as the server I connected to remotely hosted on a Mac Mini with firmware other than tinker firmware worked fine too.

So the issue has been isolated to be a local/global server issue. The priority now is to be able to determine what is the difference between a server locally and remotely which exhibited the behavior.

ah you might be onto something there @kennethlimcp as i never use tinker or the remote cloud, only firmware i’ve compiled locally, even the 0.30 firmware exhibits it.

@Dave i’m using debian linux locally, not on a shared server or anything. spark-cli 0.3.96, node.js v0.10.29, deep update/cc3000 patched spark, wep wifi.

@pixelboy i don’t think its possible to claim a core through the local cloud yet is it? you sure you’re not connecting to the remote one and that’s why its working?

I would like to isolate the issue as non firmware related for now as I had success with local cloud on a remote platform

Claim isn’t available yet but I remember seeing a commit recently to make claim work.

Once I have tested, I will post here. :smile:

EDIT:

I was able to run stock tinker on @gruvin’s local environment without much obstacle. He should be waking up happy i guess :smiley:

@gruvin, The problem seems persistent but i made some changes to your spark-server configuration.

This will need more debugging and a firmware guy… Maybe enabling the debug on core-firmware side to understand which section of code isn’t handshaking properly…

@sej7278 sorry perhaps I misspoke… not claim but setup…whenever I use spark setup and go through the process it alway connects even with custom firmware.

Is it possible to get a bin version of tinker I would like to see if I have issues after manually loading the firmware?

Thanks for that. Though for the record, I am not seeing any difference since your changes.

I too have seen it working fine, just as you showed in your screenshot. But only after the 'core eventually connects without the SOS and reboot loop. In 99 out of 100 occasions (guessing) my 'core will not connect the first time. Once it finally does, like @pixelboy, if I reset the 'core, then we're back to SOS/reboot for about 3 minutes 50 seconds, before my 'core finally connects without the SOS. From that point, all appears well and good, until the next disconnection -- whether than be by reset or any other means, like shutting down the server or severing the network connection.

As reported elsewhere, each time the SOS/reboot occurs, the server actually reports that a valid handshake occurred and that the connection completed successfully. Then of course, that 'valid' connection ID times out, because the 'core crashed. I assume this is what @pixelboy is seeing, also.

It's very interesting that this problem occurs persistently with just Tinker for me, while for @kennethlimcp and and another (I think?) it seems to only be with user firmware.

Random guess/thought: This is a silly long shot in the dark, but I have had issues in other projects where my 64-bit OS X machine compiles with 64-bit length integers, causing problems when transmitting to 32-bit compiled code, which was expecting 32-bit length integers. Hence, 8 bytes arrive instead of 4, causing a buffer overrun. I would not have thought that JavaScript/node.js could have this problem. But I'll look into it anyway, in the context of the handshake's counter variable.

FWIW, is @pixelboy running 64-bit Linux?

This is generally not the case for me. Once in a blue moon, going through spark setup will result in a connection without SOS directly afterwards. But by far the majority of the time, it does not. (It does eventually, after many SOS/reboot cycles.) It's a bit random, so I think you might find that further tests show less reliability in that process than you thought ... maybe. Who knows what other variables are at play, at this stage.

@gruvin,

since you are able to compile locally, can you quickly hack this code and place some delay?

Also, i changed some parameters in the spark-server while attempting to fix the issue so do remember to reclone the repo. :wink:

The other thing i wanted to try but never got it is to flash a user firmware which disables to the :cloud: by default before connecting manually in setup() and watch the behavior.

@kennethlimcp … good idea! Actually though (as reported in a different thread) I already spent a few hours putting delays all over the place, in both the firmware and server, starting with directly before and after spark_protocol.send_time_request(); and including deep inside the handshake code itself, firmware side.

Remarkably, none of those delays made once ounce of difference. I still witnessed the exact same SOS behaviour. This surprised me greatly and is why I am not convinced that we’re actually dealing with a race condition, even though all the evidence and past experience with such things points that way. (My delays varied in time from 50ms to 200ms.)

Ah awesome! That’s good to know. At least we can now know that the race condition might not be related…

I still think it’s really a latency issue since the deployment on a remote Mac Mini gave reliable connections with my own user firmware.

We need to find out which portion of the handshaking is causing issues…

Will you be able to turn on the DEBUG mode and watch the behavior differences between the spark cloud and local cloud?

We might be able to notice a pattern which causes the failure.

Or you might want to test out the other suggestion first:

The other thing i wanted to try but never got it is to flash a user firmware which disables to the cloud by default before connecting manually in setup() and watch the behavior.

Do you happen to know what version of OS X that Mac Mini was running? I don't think this is important, but I'd like to note it anyway, if possible.

Indeed. And believe me, I have tried. But I have been unable to find anything at all going wrong within the handshake code itself, on either side. I was hoping that this was only because I did not have a clear enough understanding of what should be happening, despite the good code commenting that's present in there.

Specifically, the firmware at the 'core end of the handshake appears to exit cleanly and without error (always) so far as I can see. But there are interrupts at play (I presume) so the handshake code could be getting interrupted at any point, which I can not easily determine.

In that last regard, I was interested by @towynlin's comment (#3) in this GitHub issue. This clear evidence of asynchronous design points again to the likelihood of a race condition being the cause of our woes. Again though, as far as I have been able to tell so far, it's not happening inside the handshake code itself. Not directly, anyway.

As cautionary note, we should not trust solely my work and comments to date. Someone else should go through a similar elimination process, using delays here and there. There's every chance that I may have missed something.

The firmware doesn't include any debug mode, other than commented out logging lines. I uncommented much of that at one point, as well as adding my own in places. All that pointed to the handshake completing without error, so far as I could tell. (But again, cautionary note as above.) Oh wait ... no, that was on the server side. Hmmm. Not sure how to get debug info out of the 'core itself, yet.

I don't know how to do this.