Bug bounty: Kill the 'Cyan flash of death'

Iā€™m really digging the improved stability so far. Thanks everyone for all of your hard work discussing, testing and fixing! :blue_heart: :spark:

2 Likes

@BDub Same here man, I canā€™t get it to jam up now :smile:

Everything that used to eventually cause a lock up no longer causes lockups! Imagine that!

ā€œThatā€™s one small step for man, one giant leap for mankind.ā€ :rocket:

Thanks @david_s5 ! What are you gonna do with all them new Spark Cores?

1 Like

Congratulation!
This is the power of the open source community!
All honors guys!

1 Like

Hi guys, congratulations and a big thank you to everyone! Iā€™ve noticed my Spark get significantly stable over the past many weeks. :smile:

To get the latest update, I just have to clone the repos and stay on the Master branch right?

1 Like

@nitnut Great to hear about the improved stability.

The master branch has the latest and greatest code updates by the Spark team and contributors, but those havenā€™t been as thoroughly tested by the community.

If you build locally, the latest stable GitHub branch is actually compile-server2. This branch matches the firmware pushed to your Core via the Web UI.

3 Likes

I was so excited to get these fixes pushed to the compile-server2 branch (i.e. what the Web based IDE uses) and try them out. You can only imagine how disappointed I was when I found my core CFoD-ing with great regularity. :frowning:

But wait! I read the post listed below and discovered that if you flash the same sketch WITHOUT MODIFYING ANYTHING, it just pushes the "old" compiled binary to the Spark - it doesn't re-compile with the new changes

So I changed my sketch, hit the "Verify" button, and pushed it to the Spark. My sketch has been running for almost 4 hours now and I've had two "recoveries" which normally would have CFOD-ed the Spark, but it just re-connects and continues on. This is just awesome!

Thank you so much, @david_s5, for your solution to this. Also many thanks to so many others for helping out. Many hands make light work!

Dave O

3 Likes

Exact same thing happened to me :smile:

Its working out great for me also, Iā€™m putting it through some pretty abnormal situations also and its doing just fine.

1 Like

Quick update re: TI; they sent us an engineering drop for a potential firmware fix but it didnā€™t have any effect. @mohit has provided them feedback, so weā€™re still iterating on the underlying root cause with Texas Instruments. Will keep you posted as we hear more!

5 Likes

Update: Weā€™ll be receiving the second round of CC3K patches from TI this Sunday. These patches will focus on resolving the invalidation of the ARP table inside the module. Shall post the results upon testing!

9 Likes

This bug doesnā€™t stand a chance! Its days are numberedā€¦
:slight_smile:

3 Likes

Any news regarding the cc3k patches?

2 Likes

Thanks for checking in @altery
The patch that was suppose to arrive last week, came out this morning. We are testing it right now as we speak. Will report shortly.

7 Likes

:+1: any luck with latest firmware from TI? @mohit

1 Like

OK - finally some good news.
We received a firmware workaround to test from TI on Monday.
Results so far are good, including a stress test that has run without error for 42 hours and over 200,000 passes - numbers we have never been able to achieve before.
Important: this test has not required any of the other error recovery mechanisms, the core has not rebooted during this time, a single connection to the cloud has been maintained, no transactions have been lost etc etc.

These are good test results, but thatā€™s not the end of it, yet.

In summary:

  • We should soon be able to release a patch to the CC3000 firmware that vastly reduces the probability of CFOD.
  • The exact mechanism for this release is currently under development.
  • There will need to be subsequent patch(es) to the CC3000 that will fix the root cause of CFOD.

Here are the gory details for those interested:

  1. This workaround has not yet been put through TIā€™s official test and release procedure; we have a conference call with them tomorrow morning (USA time) to discuss their testing & release schedule. They are looking for a slot in their test plan so that they can release a formal patch as quickly as possible. Look for a further update after that meeting.
  2. We here at Spark need to package up CC3000 firmware upgrades in a manner that is: straightforward; reliable; easy to comprehend; and simple for all users.
  3. I use the word workaround above intentionally. This firmware update does not fix the root cause of the CC3000 failures that lead to CFOD; however it does appear to avoid the most common situations that lead to the CC3000 failure, and thence CFOD. Here is what we believe is going on - Iā€™ll apologise in advance if parts of this are not explained clearly - we have been deep inside this problem for so long it feels like family.
  • The root cause of the CC3000 failures that lead to CFOD is buffer starvation and/or allocation issues that can result in a deadlock.
  • The situation is that the CC3000 has buffers to send, but finds it also needs buffers for either ARP packets or TCP protocol packets before it can proceed and transmit the data, but there are none available.
  • This is a complex problem and anyone who has written a TCP/IP stack on limited resource hardware is familiar with these kind of issues.
  • For whatever reason, fixing this in the CC3000 is proving extraordinarily difficult.
  • In addition, the current behaviour of the CC3000 is to continually update its ARP cache based on packets it receives, regardless of whether those packets form part of ongoing traffic through the CC3000.
  • With this behaviour, if the ARP cache is already full, then a random ARP cache entry is chosen and replaced.
  • In the Core, if that ARP entry ejected is the one for the default gateway, and there are already packet(s) enqueued in the CC3000 ready to be sent to the cloud, then the CC3000 must ARP to find the MAC address of the default gateway.
  • This is apparently when the CC3000 can find itself in a deadlock, needing buffers to send the ARP request and process the reply, but not having any available.
  • So there are a series of events, each individually contributing to the probability that the CC3000 will fail.
  • This explains why: busy networks made CFOD more likely (more packets to pollute the ARP cache); busy applications made CFOD more likely (more chance that the CC3000 will have pending TCP traffic to the cloud at any moment); time to failure was highly variable (the random chance that the ARP entry ejected would belong to the Coreā€™s TCP connection to the cloud.)
  • The fix we have been testing stops the automated update of the ARP cache based on packets received. It is now only updated on a need-to basis.
  • So, while it does not fix the root cause resource allocation/starvation problem, it appears to vastly reduce the probability that this bug will be triggered in practical use.

Note: Errors in this post should be considered mine, not TIā€™s.

17 Likes

Thanks for the detailed writeup on the status for CFOD!

I believe this would already bring the probability of encountering CFOD on a regular basis to a much smaller % and get more people working on the :spark: core

:smile:

Wow this is super great news! Thanks for your work on this @AndyW and the detailed write up :smile:

Iā€™m pretty familiar with running the CC3000 patching firmware, and itā€™s not too hard for most of the local programming guysā€¦ so hopefully it can be released as a BIN for anyone that wants to attempt it the manual way, while some fancy but reliable update mechanism is being worked on for most users that probably just use the Sparkulator.

2 Likes

Iā€™m sure they can just write a sketch that updates the CC3000 over SPI, then take the compiled HEX file and use the Cloud Update mechanism to place an ā€œUpdate CC3000 Firmwareā€ button on the Sparkulator. Perhaps under each Core in the device list?

Itā€™s not like itā€™s hard to update the firmware. Iā€™ve done it manually via UART, through a program compiled in CCS and through a sketch with Energia.

The above method is similar to how TI officially does the firmware updates for the EVM with their windows software. Theyā€™ve release a compiled MSP430 program that you upload to your board and it takes updates the CC3000 over SPI. (The MSP430 binary contains the firmware.)

1 Like

@AndyW ditto on what @BDub said, you are an awesome engineer and an incredible asset to Spark! Thank you for the great news!

1 Like

@AndyW Thanks for the update and all your hard work on a solution to the CFOD! Thanks to everyone else who contributed as well!

1 Like

Yeah, thanks for staying on top of this @AndyW, not just on behalf of the :spark: Community, but also 43Oh.com and all the people out there with buggy CC3000 modules! :heartpulse:

3 Likes