Bug bounty: Kill the 'Cyan flash of death'

This smells like a race condition to me; so adding timeouts is just a kludge that kicks the can down the road.

@mtnscott there are two potential issues to debug: why the Core disconnects, and why it fails to re-connect.

For the former (why it disconnects), the failure, I believe, occurs somewhere in the SparkProtocol::event_loop() function:

As part of the main loop (main.cpp:142), Spark_Communication_Loop() is called (main.cpp:178, spark_utilities.cpp:402). This returns the results of spark_protocol.event_loop() (spark_utilities.cpp:404), which is defined in core-communications-lib, spark_protocol.cpp:125. If this function returns false, I believe thatā€™s what causes the CFOD (main.cpp:185).

Now as for the second issue: in the SPARK_WLAN_Loop(), if the Core is not connected, it calls Spark_Connect() (spark_wlan.cpp:533). For some reason, this is not re-initiating the connection like itā€™s supposed to. Because we never see any further changes to the LED, it suggests to me that Spark_Connect() may be blocking forever.

dermotos - take a look at this thread - https://community.spark.io/t/core-without-cloud-fixed/1072

1 Like

Thanks but I cant really understand that. Im currently use the cloud IDE, I havenā€™t had much luck figuring out how to do it all locally.

@dermotos Just follow the instructions from https://community.spark.io/t/the-spark-core-firmware/532

Iā€™m an arm-newbie and got it work (with OSX) !

Best Wishes,

Frido

1 Like

Iā€™m testing a potential fix to CFOD (deactivating the watchdog reset), which I have committed to a branch:

https://github.com/spark/core-firmware/tree/feature/debug-cfod

(core-communication-lib and core-common-lib stay on the master branch)

Unfortunately it takes a while to run the test, but if anyone wants to get their Core going with this and tell me if it fixes the issue, let me know!

Hereā€™s my test:

Is it possible to perhaps post the .BIN so that I only have to install the dfu upload utility instead of the whole toolchain? I'm able to reliably get a CFOD, so I think I'd be a good test case. I'm just not sure I'm ready to load up yet another tool-chain onto my laptop.

Dave O

If you follow the Github link Zack provided, there appears to be a .BIN file in the Build folder

Indeed! Thanks very much. I hadnā€™t even though to look there. Iā€™ll get dfu installed and give it a whirl!

Dave O

Awesome, thanks for your help!

The annoying thing about this bug is that it takes forever to test. :tired_face:

I just flashed my Core with this firmware over the USB. If I do a OTA upload through the Spark Cloud of a new application does that also overwrite the firmware each time?

Yup, it sure does! Anything you flash OTA is compiled from the compile-server2 branch and not master.

ok perfect. I have one Core running now with a code that was frequently freezing with plenty of calls through the Spark Cloud with both functions and variables

After 11 minutes my Core got stuck in a Cyan blinking loop and it took about 2,5 minutes for it to restart and reconnect to the network and start running the code again.

So Iā€™ve got the fix running on my Core displaying a countdown on an I2C OLED display.

so my Core frooze up again this time breathing the Cyan Led nicely but no calls through the Spark Cloud api goes through. All returns a 408.

After 10 minute of no response I pressed reset on the Core and it connected straight back up and became responsive through the cloud API.

@zach Just downloaded the changed main.h file and rebuilt my firmware, I get further, but my core still goes into CFoD. This time it took 67 mins, before it was ~30. Sorry :frowning:

I was able to upload the core-firmware.bin by downloading it from git and then just using the dfu-util to flash it over. However, that will give me "Tinker" installed on my core and not my other simple app (which I know will CFOD within 120 minutes.)

I assume that pushing code via a REST call results in the same pipeline being exercised and it will revert my core to the compile-server2 branch as well. So it sound like if I want to test this out, I need to get the dev environment set up?

Trying to help out here, not add to the burden! :slight_smile:

Dave O

@sjunnesson that actually sounds like a different issue that has to do with blocking code and long delays in the loop().

Is that true of your code?

Yes, thatā€™s correct - to test out firmware-in-progress, youā€™ll have to build on your own machine.