Bug bounty: Kill the 'Cyan flash of death'

RWB · February 27, 2014, 5:51pm

@zach AWESOME!

That Oprah picture put a smile on my face I’m glad you guys have a good sense of humor.

This Spark community is going to explode over the next year. Can’t wait to see all the different ways people are going to put the Spark to use.

Steph · February 27, 2014, 6:15pm

@Frido
@sjunnesson
@dorth
@wtfuzz
@RWB
@bko
@Hypnopompia
@BDub
@pra
@bkize
@mattande
@mtnscott

Can you guys private message me your address and whether you would prefer a Core with a chip antenna or a Core with a u.FL connector? We’re pretty short on inventory at the moment but should have more available soon, so I will get these out to you as fast as I can

Hypnopompia · February 27, 2014, 8:57pm

Yay! Thanks @zach (and the rest of the spark team), @david_s5 and everyone else working so hard to kill this bug and improve the overall stability of the sparkcore.

BDub · February 28, 2014, 4:58am

I’m really digging the improved stability so far. Thanks everyone for all of your hard work discussing, testing and fixing!

RWB · February 28, 2014, 5:22am

@BDub Same here man, I can’t get it to jam up now

Everything that used to eventually cause a lock up no longer causes lockups! Imagine that!

“That’s one small step for man, one giant leap for mankind.”

Thanks @david_s5 ! What are you gonna do with all them new Spark Cores?

developer_bt · February 28, 2014, 8:04am

Congratulation!
This is the power of the open source community!
All honors guys!

nitnut · March 1, 2014, 10:42am

Hi guys, congratulations and a big thank you to everyone! I’ve noticed my Spark get significantly stable over the past many weeks.

To get the latest update, I just have to clone the repos and stay on the Master branch right?

bkize · March 1, 2014, 2:43pm

@nitnut Great to hear about the improved stability.

The master branch has the latest and greatest code updates by the Spark team and contributors, but those haven’t been as thoroughly tested by the community.

If you build locally, the latest stable GitHub branch is actually compile-server2. This branch matches the firmware pushed to your Core via the Web UI.

dorth · March 2, 2014, 12:58am

I was so excited to get these fixes pushed to the compile-server2 branch (i.e. what the Web based IDE uses) and try them out. You can only imagine how disappointed I was when I found my core CFoD-ing with great regularity.

But wait! I read the post listed below and discovered that if you flash the same sketch WITHOUT MODIFYING ANYTHING, it just pushes the "old" compiled binary to the Spark - it doesn't re-compile with the new changes

So I changed my sketch, hit the "Verify" button, and pushed it to the Spark. My sketch has been running for almost 4 hours now and I've had two "recoveries" which normally would have CFOD-ed the Spark, but it just re-connects and continues on. This is just awesome!

Thank you so much, @david_s5, for your solution to this. Also many thanks to so many others for helping out. Many hands make light work!

Dave O

RWB · March 2, 2014, 1:04am

Exact same thing happened to me

Its working out great for me also, I’m putting it through some pretty abnormal situations also and its doing just fine.

zach · March 2, 2014, 9:21pm

Quick update re: TI; they sent us an engineering drop for a potential firmware fix but it didn’t have any effect. @mohit has provided them feedback, so we’re still iterating on the underlying root cause with Texas Instruments. Will keep you posted as we hear more!

mohit · March 7, 2014, 4:15pm

Update: We’ll be receiving the second round of CC3K patches from TI this Sunday. These patches will focus on resolving the invalidation of the ARP table inside the module. Shall post the results upon testing!

dermotos · March 11, 2014, 10:15am

This bug doesn’t stand a chance! Its days are numbered…

altery · March 17, 2014, 4:28pm

Any news regarding the cc3k patches?

mohit · March 17, 2014, 4:39pm

Thanks for checking in @altery
The patch that was suppose to arrive last week, came out this morning. We are testing it right now as we speak. Will report shortly.

zserre · March 18, 2014, 5:36am

any luck with latest firmware from TI? @mohit

AndyW · March 19, 2014, 2:39pm

OK - finally some good news.
We received a firmware workaround to test from TI on Monday.
Results so far are good, including a stress test that has run without error for 42 hours and over 200,000 passes - numbers we have never been able to achieve before.
Important: this test has not required any of the other error recovery mechanisms, the core has not rebooted during this time, a single connection to the cloud has been maintained, no transactions have been lost etc etc.

These are good test results, but that’s not the end of it, yet.

In summary:

We should soon be able to release a patch to the CC3000 firmware that vastly reduces the probability of CFOD.
The exact mechanism for this release is currently under development.
There will need to be subsequent patch(es) to the CC3000 that will fix the root cause of CFOD.

Here are the gory details for those interested:

This workaround has not yet been put through TI’s official test and release procedure; we have a conference call with them tomorrow morning (USA time) to discuss their testing & release schedule. They are looking for a slot in their test plan so that they can release a formal patch as quickly as possible. Look for a further update after that meeting.
We here at Spark need to package up CC3000 firmware upgrades in a manner that is: straightforward; reliable; easy to comprehend; and simple for all users.
I use the word workaround above intentionally. This firmware update does not fix the root cause of the CC3000 failures that lead to CFOD; however it does appear to avoid the most common situations that lead to the CC3000 failure, and thence CFOD. Here is what we believe is going on - I’ll apologise in advance if parts of this are not explained clearly - we have been deep inside this problem for so long it feels like family.

The root cause of the CC3000 failures that lead to CFOD is buffer starvation and/or allocation issues that can result in a deadlock.
The situation is that the CC3000 has buffers to send, but finds it also needs buffers for either ARP packets or TCP protocol packets before it can proceed and transmit the data, but there are none available.
This is a complex problem and anyone who has written a TCP/IP stack on limited resource hardware is familiar with these kind of issues.
For whatever reason, fixing this in the CC3000 is proving extraordinarily difficult.
In addition, the current behaviour of the CC3000 is to continually update its ARP cache based on packets it receives, regardless of whether those packets form part of ongoing traffic through the CC3000.
With this behaviour, if the ARP cache is already full, then a random ARP cache entry is chosen and replaced.
In the Core, if that ARP entry ejected is the one for the default gateway, and there are already packet(s) enqueued in the CC3000 ready to be sent to the cloud, then the CC3000 must ARP to find the MAC address of the default gateway.
This is apparently when the CC3000 can find itself in a deadlock, needing buffers to send the ARP request and process the reply, but not having any available.
So there are a series of events, each individually contributing to the probability that the CC3000 will fail.
This explains why: busy networks made CFOD more likely (more packets to pollute the ARP cache); busy applications made CFOD more likely (more chance that the CC3000 will have pending TCP traffic to the cloud at any moment); time to failure was highly variable (the random chance that the ARP entry ejected would belong to the Core’s TCP connection to the cloud.)
The fix we have been testing stops the automated update of the ARP cache based on packets received. It is now only updated on a need-to basis.
So, while it does not fix the root cause resource allocation/starvation problem, it appears to vastly reduce the probability that this bug will be triggered in practical use.

Note: Errors in this post should be considered mine, not TI’s.

kennethlimcp · March 19, 2014, 2:43pm

Thanks for the detailed writeup on the status for CFOD!

I believe this would already bring the probability of encountering CFOD on a regular basis to a much smaller % and get more people working on the core

BDub · March 19, 2014, 3:00pm

Wow this is super great news! Thanks for your work on this @AndyW and the detailed write up

I’m pretty familiar with running the CC3000 patching firmware, and it’s not too hard for most of the local programming guys… so hopefully it can be released as a BIN for anyone that wants to attempt it the manual way, while some fancy but reliable update mechanism is being worked on for most users that probably just use the Sparkulator.

timb · March 20, 2014, 12:22am

I’m sure they can just write a sketch that updates the CC3000 over SPI, then take the compiled HEX file and use the Cloud Update mechanism to place an “Update CC3000 Firmware” button on the Sparkulator. Perhaps under each Core in the device list?

It’s not like it’s hard to update the firmware. I’ve done it manually via UART, through a program compiled in CCS and through a sketch with Energia.

The above method is similar to how TI officially does the firmware updates for the EVM with their windows software. They’ve release a compiled MSP430 program that you upload to your board and it takes updates the CC3000 over SPI. (The MSP430 binary contains the firmware.)

Topic		Replies	Views
Sparkcore lost connection after some time connected Troubleshooting	10	1813	June 10, 2014
Flashing Cyan after 5-10min Troubleshooting	6	3017	January 13, 2014
Simple LED flashing program bombs out after an hour or two Troubleshooting	4	1838	January 10, 2014
Dropping the connection to spark cloud Troubleshooting	94	12678	February 21, 2016
Core losing connection after few hours Troubleshooting	2	1880	February 7, 2014

Bug bounty: Kill the 'Cyan flash of death'

Related topics