Local cloud: Persistent hard fault after core disconnection

Hi

I started playing around with a local cloud spark-server, now that’s it’s been released. I have a core connecting to a local server and it seemed to be working fine. Yay \o/

However, I quickly noticed a problem. Perhaps the easiest way to explain it is some steps for others to replicate the problem …

1: Start spark-server (node main.js).

2: Power up a pre-configured and ready to run local 'core.

The 'core connects, starts breathing cyan and otherwise works as expected. So far, so good.

3: Cause a disconnection between the core and server by any of the following means …
a) reset the core
b) power cycle the 'core
c) terminate the spark-server and re-launch it (optionally reset/power-cycle the 'core.)

At this point, the 'core can no longer establish a working connection (red flashing LED, then reboot, in endless loop) in spite of spark-server continuously reporting that it has …

server started { host: 'localhost', port: 5683 }
Connection from: 192.168.1.120, connId: 1
on ready { coreID: '53ff6d065067544853360587',
  ip: '192.168.1.120',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: undefined }
Core online!
Connection from: 192.168.1.120, connId: 2
on ready { coreID: '53ff6d065067544853360587',
  ip: '192.168.1.120',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: undefined }
Core online!
Connection from: 192.168.1.120, connId: 3
on ready { coreID: '53ff6d065067544853360587',
  ip: '192.168.1.120',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: undefined }
Core online!

… and so on.

4: Cycle power on the core

No change

5: Power off core, terminate server, restart server, power up core

Still no change!

6: Wait for at least 3 minutes (maybe more) and repeat step 5. (Sometimes just waiting a long time while the core keeps rebooting will eventually bring it back online.)

Finally, the core connects again. But things repeat just the same, if we go back to step 3.

So in summary, the server reports that the 'core connected OK, but the 'core itself just barfs blood red laser light. Ouch. :stuck_out_tongue:


The reboot cycle LED flashing goes like so …

Solid cyan, fades, then rapid green flash, then rapid cyan for about half a second, then red flashing, which usually (but not always) appears to be, “SOS” in continental Morse code.


My guess is that a re-connection occurring before some time-out period has passed, results in a response being sent to the core that it is not understanding – or perhaps the timing is too fast on a local system or something? But I have no idea how to debug this problem further.

Any clues anyone?

I reported this issue on github.

If you run tinker app on the core, It should work well. Other user firmware will exhibit the red flashes behaviour

I hunt that down then. Thanks.

I'm only running Tinker. Didn't want to introduce any other variables. For clarity, that means I have the freshly checked out, latest version of core-firmware. Oh, wait. maybe I don't? I used my forked copy! Hmmm. I better resolve that and get back shortly ...

1 Like

I now have the latest versions of core-communication-lib/common-lib/firmware and still having the same problem. I need to know how to debug this at the core end of things – or at least see raw logs of the network traffic, somehow.

This may only be happening with the latest “deep update” in the core, too. Alas, I updated all my cores, so I cannot test that possibility with out a means of downgrading the CC3000 firmware.

Hmmm. Just tested resetting my cloud connected core. I cannot produce any problems with that one and it has been updated also.

I’ve now had a pretty good look through the related local spark-server js code. I cannot find anything to validate my, “timing issue” guess, from above.

So at this point, without being able to see what is going on inside the 'core directly after being sent the final Hello message in the handshake sequence (presumably the time when it barfs its “SOS”) I can go no further. :frowning:

Everything up to and including that final Hello message looks fine and dandy from the server’s point of view. The 'core is “online”. But the 'core itself is in fact very much not online.

Hi @gruvin,

Hmm. That’s really odd. What branches are you pulling / compiling from the repos, and you’re definitely using just Tinker, no changes if you run git diff or git status ?

The spark-protocol module should be almost exactly the same as the main server, and an SOS on the core indicates some kind of firmware crash / issue. Have you tried compiling tinker against the compile-server2 branches?

Thanks,
David

Master branch, no changes what-so-ever. Fresh clones of all three. Evidence follows ...

spark-test bryan$ git clone https://github.com/spark/core-communication-lib.git
Cloning into 'core-communication-lib'...
remote: Reusing existing pack: 1504, done.
remote: Counting objects: 15, done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 1519 (delta 3), reused 0 (delta 0)
Receiving objects: 100% (1519/1519), 5.72 MiB | 342.00 KiB/s, done.
Resolving deltas: 100% (952/952), done.
Checking connectivity... done.
spark-test bryan$ git clone https://github.com/spark/core-common-lib.git
Cloning into 'core-common-lib'...
remote: Reusing existing pack: 1510, done.
remote: Total 1510 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (1510/1510), 4.73 MiB | 941.00 KiB/s, done.
Resolving deltas: 100% (912/912), done.
Checking connectivity... done.
spark-test bryan$ git clone https://github.com/spark/core-firmware.git
Cloning into 'core-firmware'...
remote: Reusing existing pack: 6366, done.
remote: Total 6366 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (6366/6366), 60.75 MiB | 2.15 MiB/s, done.
Resolving deltas: 100% (4366/4366), done.
Checking connectivity... done.
spark-test bryan$ cd core-firmware/build/
build bryan$ make program-dfu
...
...
DfuSe interface name: "Internal Flash  "
Downloading to address = 0x08005000, size = 78864
..............................................................................
File downloaded successfully

I'm gonna try re-generating keys and stuff, to see if it makes a difference. Doubt it though, since the core does connect the first time (usually).

Oh and if it helps, I notice now that after the SOS of the red LED, there is short delay, then one single flash. Does the firmware try to say what the error was by counting out flashes?

Oh! I just noticed something I didn’t before.

EDIT: No, false observation. It’s not consistent, after all. sigh

Hmmm … just now, I’m seeing something new and kind of opposite to before …

Let’s not dwell on this particular scenario, since it’s only happened once and will no doubt just lead us astray. But, for the record …

I ran spark keys doctor <core_id> to reinstall keys. Now, after my 'core eventually satisfies itself of a valid connection and sits there breathing cyan, the server is displaying this …

1: Core disconnected: socket error Error: read ECONNRESET { coreID: '53ff6d065067544853360587',
  cache_key: undefined,
  duration: 12.745 }
Session ended for 28
Connection from: 192.168.1.120, connId: 29
on ready { coreID: '53ff6d065067544853360587',
  ip: '192.168.1.120',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: undefined }
Core online!
1: Core disconnected: socket error Error: read ETIMEDOUT { coreID: '53ff6d065067544853360587',
  cache_key: undefined,
  duration: 754.44 }
Session ended for 29
1: Core disconnected: socket error Error: read ETIMEDOUT { coreID: '53ff6d065067544853360587',
  cache_key: undefined,
  duration: 618.491 }
Session ended for 29

… yet the core just keeps breathing cyan, like nothing is wrong. :-/

The following post links to a video, recorded minutes after the above, when things returned to, “normal”.

I made a video to show exactly what I’m seeing, with the timing of events made clear.

http://www.screencast.com/t/juMQ3yDt

I have not been able to reproduce my previous post’s scenario for the video. So, the video shows only the originally reported problem.

The doc says that SOS-one-flash-SOS is hard fault, which is usually following a null pointer or overwriting memory or some such thing.

1 Like

Hi @gruvin,

Thank you for recording that video! We’re noticing what we think might be a race condition in the firmware after handshake that only happens when running in extremely low latency scenarios like the local cloud.

Those error messages ( `error: read ETIMEDOUT`` ) generally mean that sockets from old connections are timing out and getting cleaned up. You can tell which are which by matching the ‘connId’ property between the connect / disconnect events. Another way to test and make sure your core is really online is by sending it a command like a function call / variable request.

It looks like you’re posting in the issue thread as well, so I’ll try to pick up there :slight_smile: If I can find a way to reliably reproduce this bug here, I’ll be much closer to fixing it.

Thanks!
David

Yes. The timeouts align with the connection that was briefly established before the hard fault at the core end and subsequent reboot. I probably cut a bit too much out of the middle of the video, to see that clearly.

I’ll move to the issue thread.

1 Like

If you think you might have the skills to have a go at resolving this local cloud problem on MY system, then please let me know.

The issue is described in detail in post, above. Apparently, I am the only person having this problem in a reliably replicable manner, so far. So, I have set up a remotely accessible Sparkcore and *nix development shell, to allow someone with greater skills/knowledge than I to work on the problem.

There is an SSH login awaiting you, with a live auto-answer (Skype) video feed so you can watch the core in real time. I have written shell scripts to allow you to place the 'core in DFU or listening mode or factory reset, etc (via a custom USB interface, connected to the 'core’s buttons.) I can send a full set of notes/instructions.

Please PM me login details and send a contact request to Skype user gruvin_nz.

Failure to find a solution is completely acceptable! :slight_smile: You are most welcome to have a go, even just for fun or whatever.

Many thanks!

P.S: @Dave has already offered to try and get someone at Spark to do this. But I believe they are currently overwhelmed with more important workloads. Meanwhile, I have a Sparkcore and to some extent an iMac tied up 24/7, waiting to facilitate this offer. So any help would be much appreciated. Thanks everyone.

P.P.S: The only editor currently installed is vi (vim – my personal preference.) If you prefer something else (shell capable, of course) then let me know and I’ll organise it. Thanks.

Hey @gruvin, I can help you with that, probably.

The other time, I helped debug a spark-server hosted on a Mac Mini.

You're more than welcome to have a go. I'll PM the login details and notes I sent to @Dave previously.

Note that this is a stock standard, master branch Tinker and local cloud server. No custom user code involved what-so-ever. :wink:

EDIT: PM sent.

@kennethlimcp @Dave I’m getting the same issues as @gruvin. I have setup the local cloud following the exact steps mentioned on the GitHub page.

It will get stuck in the loop of: [white -> green -> cyan_rapid_blink -> SOS_red] …
Meanwhile, server output for the [n]th iteration is as follows:

Connection from: 192.168.0.198, connId: [**n**]
on ready { coreID: '55ff73065075555326351787',
  ip: '192.168.0.198',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: undefined }
Core online!

… until it successfully makes a connection and starts breathing cyan. Once it is breathing cyan I can actually talk to the core with POST calls but the server also outputs this every so often:

1: Core disconnected: socket error Error: read ETIMEDOUT { coreID: '55ff73065075555326351787',
  cache_key: undefined,
  duration: 617.946 }
Session ended for [**n**]

The session doesn’t actually end, but continues to spit that out, while still being able to perform POST calls.

@ndevries ... we ended up thrashing this out more over at this other thread: Local cloud - core attempts to connect up to 10 times before it’s....

I found a work-around, noted in post #79 (roughly) from me, over there. Look for the heading ...

"So we to have a temporary workaround"

1 Like