Local cloud - SOS panic flash with user firmware [Solved]

gruvin · August 7, 2014, 10:58pm

When this all started, I was using gcc-arm-none-eabi-4_8-2013q4. But yesterday, I did upgrade to gcc-arm-none-eabi-4_8-2014q2 and found no improvement.

gruvin · August 7, 2014, 11:04pm

  /*
  if (len > QUEUE_SIZE) { // TODO add sanity check on data, e.g. CRC
      return false;
  }
  */

...

Connection from: 192.168.1.136, connId: 46
on ready { coreID: '53ff6d065067544853360587',
  ip: '192.168.1.136',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: undefined }
Core online!

Core SOS. So no change what-so-ever on my side -- not even the extra message reported by @pixelboy.

(This was using the latest gcc version 4.8.4 20140526 [2014q2] )

gruvin · August 7, 2014, 11:06pm

When we talk about "tinker", we're really talking about the full source set from the repository, with no local user code modifications. In other words, the repository code is the tinker app. (Oh, I see you figured that out. Yes -- the stock firmware (specifically application.cpp) is Tinker. )

In any case, the file spark_tinker.bin in the CLI tools will have been manually renamed to that by someone, from build/core-firmware.bin.

gruvin · August 7, 2014, 11:24pm

Indeed!

I should add that I am using (in all cases) the Mac binary versions of GNU ARM toolchain, under OS X 10.9.4.

This may be important to note, because it is remotely possible that the Mac binaries differ in some way from the Windows version. At a guess then, this could explain why (if it's even true) it is that some people can compile without the SOS issue, while others cannot, even though we are using the same tool chain version.

I suppose I could test for that myself in a Windows VM. If I get time today, I will.

kennethlimcp · August 8, 2014, 10:56am

I’m getting interesting results

This will be the log output if the core managed to connect using spark_7 tinker:

Your server IP address is: 192.168.1.247
server started { host: 'localhost', port: 5683 }
Connection from: 192.168.1.216, connId: 1
on ready { coreID: '53ff6f065075535135261687',
  ip: '192.168.1.216',
  product_id: 0,
  firmware_version: 6,
  cache_key: '_0' }
Core online!
routeMessage got a NULL coap message  { coreID: '53ff6f065075535135261687' }
got counter  7377  expecting  7376 { coreID: '53ff6f065075535135261687' }
1: Core disconnected: Bad Counter { coreID: '53ff6f065075535135261687',
  cache_key: '_0',
  duration: 0.002 }
Session ended for _0
SparkCore - sendReply before READY { coreID: '53ff6f065075535135261687' }
Connection from: 192.168.1.216, connId: 2
on ready { coreID: '53ff6f065075535135261687',
  ip: '192.168.1.216',
  product_id: 0,
  firmware_version: 6,
  cache_key: '_1' }
Core online!
routeMessage got a NULL coap message  { coreID: '53ff6f065075535135261687' }
got counter  25341  expecting  25340 { coreID: '53ff6f065075535135261687' }
1: Core disconnected: Bad Counter { coreID: '53ff6f065075535135261687',
  cache_key: '_1',
  duration: 0.004 }
Session ended for _1
Connection from: 192.168.1.216, connId: 3
on ready { coreID: '53ff6f065075535135261687',
  ip: '192.168.1.216',
  product_id: 0,
  firmware_version: 6,
  cache_key: '_2' }
Core online!
routeMessage got a NULL coap message  { coreID: '53ff6f065075535135261687' }
got counter  63852  expecting  63851 { coreID: '53ff6f065075535135261687' }
1: Core disconnected: Bad Counter { coreID: '53ff6f065075535135261687',
  cache_key: '_2',
  duration: 0.004 }
Session ended for _2
Connection from: 192.168.1.216, connId: 4
on ready { coreID: '53ff6f065075535135261687',
  ip: '192.168.1.216',
  product_id: 0,
  firmware_version: 6,
  cache_key: '_3' }
Core online!

I’m testing with SYSTEM_MODE(SEMI_AUTOMATIC) and the function causing SOS red flashes is Spark.connect()

Also, another successful connection:

Connection from: 192.168.1.216, connId: 16
on ready { coreID: '53ff6f065075535135261687',
  ip: '192.168.1.216',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: '_15' }
Core online!
Connection from: 192.168.1.216, connId: 17
on ready { coreID: '53ff6f065075535135261687',
  ip: '192.168.1.216',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: '_16' }
Core online!
1: Core disconnected: socket error Error: read ECONNRESET { coreID: '53ff6f065075535135261687',
  cache_key: '_1',
  duration: 285.665 }
Session ended for _1
Connection from: 192.168.1.216, connId: 18
on ready { coreID: '53ff6f065075535135261687',
  ip: '192.168.1.216',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: '_17' }
Core online!

@Dave,

1.) it seems to me that the server is not handling the connection properly.

A socket error will result in a successful connection thereafter.

Is there somewhere i can modify to make it time-out earlier? That might help create a test-case

2.) The previous socket starts to get socket error Error: read ECONNRESET and get closed one after another the moment a core gets online

I think the key lies in duration: 620.017. When the server starts to kill connectoins around this timing, and the connection gets through. We need to get it to error out faster on the local .

However, i tried digging but couldn’t figure out which to change…

3.) Core disconnected: Bad Counter which also somehow killed connections/close sockets worked well too…

satishgn · August 11, 2014, 12:01pm

The culprit for red SOS is the call to Multicast_Presence_Announcement(); within Spark_Handshake() in spark_utilities.cpp

Temporary solution till we make a official release:
comment Multicast_Presence_Announcement() (since that uses a hardcoded IP), local cloud should work fine.

kennethlimcp · August 11, 2014, 12:04pm

@satishgn,

can you comment more on what’s the issue and why does the issue not present itself between the core and spark cloud?

Thanks for the work!

I got really curious and started digging but was unable to determine with sending a multicast broadcast message caused the SOS…

satishgn · August 11, 2014, 2:08pm

@kennethlimcp, I am not sure why we multicast to address “224.0.1.187” and port “5683” as my knowledge of the server code is limited. The firmware commit(Multicast CoAP presence announcement) was done over here: https://github.com/spark/core-firmware/commit/b717286dbcffedf211c343cc820c762e70a782d2
Also found that calling UDP’s sendto() just once instead of thrice as below gets the core connected to local cloud.

  //for (int i = 3; i > 0; i--)
  {
    sendto(multicast_socket, announcement, 19, 0, &addr, sizeof(sockaddr));
  }

The solution now is to multicast to localhost:5683 for local cloud and 224.0.1.187:5683 for spark cloud.

satishgn · August 11, 2014, 2:10pm

@kennethlimcp, were you able to connect to local cloud by commenting the Multicast_Presence_Announcement()?

kennethlimcp · August 11, 2014, 2:11pm

I don’t have much knowledge in this but it seems like “224.0.1.x” is a common multicast address used locally and shouldn’t affect whether if it’s connect to the local or spark cloud.

How did you traced this down?

I did not setup my local build environment so it’s hard for me to test this. Shall wait for @gruvin!

satishgn · August 11, 2014, 2:47pm

Interesting. found that on : http://www.iana.org/assignments/multicast-addresses/multicast-addresses.xhtml => 224.0.1.187 -> All CoAP Nodes -> http://www.iana.org/go/rfc7252

Will check with our CoAP lead @zachary tonight as why presence announcement is made to IP 224.0.1.187 after handshake.

But I can confirm that while testing on local cloud, commenting Multicast_Presence_Announcement() in Spark_Handshake() prevents the core entering in SOS HardFault Handler. Awaitng on @kennethlimcp, @gruvin to reconfirm this finding.

satishgn · August 11, 2014, 3:08pm

I have made a commit for this fix on branch: https://github.com/spark/core-firmware/tree/feature/new-led-interactions

Please do : “git checkout feature/new-led-interactions” and rebuild core-firmware to test the fix

pixelboy · August 11, 2014, 3:56pm

@satishgn @kennethlimcp I just compiled after commenting out that line. It works… but the behavior of the LED is a bit unusual. When I unplug the power and plug it back in the light breaths blue for 3 - 6 seconds and then flashes green once and flips to breathing cyan… it bypasses the green flashing connect phase. It works but I’m not sure what the lights mean,

kennethlimcp · August 11, 2014, 11:31pm

That’s the new led behavior for V0.3.1 and it’s normal. I haven’t checked about the breathing blue status light but the behavior is the same when I tested the latest firmware from Web IDE.

It’s weird that you did not managed a blinking green. I will only be able to test this weekend and it will be nice for someone else to report back.

Great that you managed to use the local cloud without any issues now!

satishgn · August 12, 2014, 5:08am

For the same reason, I had mentioned about building the firmware after doing “git checkout feature/new-led-interactions” which fixes the various LED state changes as shown below. It will be merged with the master soon and released.

WIFI OFF => Breathing WHITE
WIFI ON and LISTENING (OR profiles not found) => Blinking BLUE
WIFI DISCONNECTED (WiFi.disconnect() called) => Breathing BLUE
WIFI CONNECTING (trying to connect to stored profiles) => Blinking GREEN
WIFI CONNECTED (IP address issued) => Breathing GREEN
CLOUD CONNECTING => Blinking CYAN
CLOUD CONNECTED => Breathing CYAN

Dave · August 12, 2014, 5:58pm

Thanks for tracking this down!

pixelboy · August 13, 2014, 9:05pm

Yes… thank you for your help!

kennethlimcp · August 16, 2014, 11:02pm

@satishgn,

I have tested with the master build default tinker firmware and it seems to be working fine with the local . Will test with some custom firmware later

Thanks!

Dave · August 21, 2014, 5:59pm

BOOM!

So, I re-discovered this bug during our brief window of upgrading Node.js on production, which is why we had to roll it back briefly. After a really deep dive into the server / firmware code, I think I found the real cause of this. I’ll just rolled it out to the local server (spark-protocol master branch) as well.

I’ll test it some more, and then roll it out to the npm installer when it’s ready. If someone wanted to test this and let me know if it helps, that’d be grand.

Thanks,
David

FlyingYanz · August 21, 2014, 11:00pm

Hi Dave,

I’m keep having problem using my local cloud server and I’d really want to try this. In order to do that, should I just update the spark-server folder from the git server? and run the server again after that?

Thanks,
Yan

Topic		Replies	Views
Question about SOS panic flash Cloud	9	1866	March 28, 2015
Spark local Cloud (Beta) - Collection of issues Troubleshooting	8	2367	August 7, 2014
Local cloud server connection problem, spark core led turn red and then reset Cloud	8	2691	August 27, 2014
[Solved] Spark Core offline with local Cloud Troubleshooting	5	2079	July 2, 2015
Core connects to local server but immediately drops it (showing red light) Troubleshooting	5	1430	September 10, 2015

Local cloud - SOS panic flash with user firmware [Solved]

Related topics