An application that makes heavy TCPClient usage apparently locks the Spark out with respect to further OTA flash programming. There is no way to recover except manual reset.
Full source for the test case is provided, see below.
I took the @bko Bitcoin sample code (great work) as a test case. It works properly as-is, and can be repeatedly OTA flashed without a problem, because it pauses waiting for operater intervention.
To cause the fault I take out the part where it waits for an operator CR to continue. This results in it continuously looping, spending about half the time talking with the bitcoin server, and half the time spewing out results to the serial port.
If (while it is in the part of the loop where itâs talking to the server), I start the CLI flash sequence, it will lock up. LED is still breathing cyan, but the core no longer outputs to the serial port, and will no longer respond to OTA requests. It must be manually reset.
As a reminder, Iâm running on a not so fast internet connection which may be contributing. Still it shouldnât be happening.
I donât see how to attach a zip file, so Iâve âtemporarilyâ uploaded the whole project directory as a zip in order to facilitate debugging this: http://www.takenwithyou.com/SparkOTAFlashBugBitcoinDemo.zip
Note: I use Serial1 for debug, so youâll have to do a quick replace in files BitcoinApplication.cpp and rest_client.cpp.
It looks like you already found the Github issue that I think applies to this problem. I think Satishâs latest fix mostly works to resolve an OTA prep issue, but the fix you need is the one that tries to do a better job of sharing the data lines with the CC3000 / External flash. I think Satish is isolating the best fixes into a nicer / cleaner branch, but thanks for sharing your code / another test case!
Sadly, the latest firmware update of July 17, 2014 doesnât appear to fix this problem. Here is a complete set of files for an easily (10 minutes to do) replicable test case:
Please note Iâm sending debug out Serial1. When the program first starts you need to hit CR in serial console to get it going (or comment out line 49 in BitCoinApplication.cpp).
The application runs, then chokes on TCP/IP traffic (I guess) and will fail to OTA flash from then on. Before the last firmware changes the program didnât choke, it just wouldnât do the OTA flashing. So, in this case things are worse off than before.
Awesome, thanks for the test case! What files should I be looking at, just BitcoinApplication.cpp, or ? I noticed a 1 second delay in your loop, and youâre using a client instead of a server, so I donât think youâre sending an overwhelming volume of traffic. Iâm guessing itâs more likely something is running out of memory or crashing.
It looks like youâre not clearing your âresponseâ variable during each loop, so Iâm guessing itâs growing by ~400 some bytes each time until it overflows the ram? So you probably make it about 10 requests before crashing? I could certainly be wrong, just a guess.
Looking in your code, you commented out the line that clears the response string before each loop:
/*
// Press ENTER in your serial terminal to continue...
if (!Serial1.available())
return;
obj[0] = Serial1.read(); // Flush the serial buffer to pause next time through
digitalWrite(LED, HIGH);
delay(1000);
response = ""; // Clear the response String
*/
Well, BitcoinApplication is the main one. But there are also:
jsmnSpark.cpp
jsmnSpark.h
rest_client.cpp
rest_client.h
Basically this is exactly the code the BDub published. His code works great IF (and I do mean IF) you let it wait for a CR each before letting it go out and grab bitcoin data. All Iâve done is modified it so that it doesnât wait for user input, just keeps grabbing and printing, grabbing and printing. And now choking (Iâd put a smiley in here but I donât see how to do it. Double smiley).
The Rest client youâre using concatenates the new responses with your previous responses, so itâll continue to use more and more ram until you run out of memory. That line you commented out âresetsâ that variable, so it wonât inflate forever. This wouldnât be the result of anything we changed I donâ think.
One you hit the CR it will happily output bitcoin data to Serial1, but you wonât be able to OTA flash the device, and it will stop sending the data to Serial1. If the device is reset you can OTA flash it prior to the user hitting the first CR which starts the TCP/IP traffic, but after that it will hang.
I just compiled your app, let it run for 5-6 steps, and then flashed another app on top of it. I changed your âSerial1â statements to just âSerialâ since I didnât want to spin up a board to watch the hardware serial port Serial1. As far as I can tell your code works fine when I compile it using the spark-cli. Are you compiling against the compile-server2 branch, or master? I recommend people use compile-server2 if they want to match what the build IDE produces, and what is our stable standard branch. Otherwise we also have some branches tagged for release, instead of the âmasterâ branch which is not necessarily guaranteed to be stable.
Can you share other details about your setup? Did you intend to be using Serial1 and not Serial? Is there a reason you want to use the master branch and not the stable branches?
edit: testing some more, want to be sure there isnât a crashâŚ
Hmm⌠Hitting their API every second or so might be a bit too frequent, unless youâre looking for second by second changes. Introducing another 1-5 second delay would save them ~40-50 calls a minute, and still give you very current data. That small delay also ensures your core isnât busy when you want to flash. I also moved this block of variables outside your loop statement:
int i, r;
jsmn_parser p;
jsmntok_t tok[NUM_TOKENS];
char obj[MAX_OBJ_SIZE];
void loop() {
...
}
since they look like theyâre all reinitialized properly anyway inside the loop, and can safely be globals. Iâve also seen some of the OTA flashes âflash greenâ rapidly, since youâre taking control of the LED and turning it green
I could be wrong, if so I apologize, but this is how it appears to me.
With due respect, the code I provided is a test case. The point of testing is to see if things break, not to fix the test so that things donât break.
Iâm afraid the firmware still has a serious problem with OTA, in fact probably a show stopper for a commercial application.
You canât have a device that works 95% of the time, and then locks you out the other 5%.
The main effect of inserting the 5000 millisecond delay at the bottom is to make it harder to test. In other words the fault shows up much less frequently, and in fact appears to go away. But itâs still there if you initiate OTA flash approximately 2-3 seconds after Serial printing has stopped. This causes the actual cloud interaction to occur during the TCP/IP code, thus faulting the device.
After 30 minutes of playing with the 5 second delay version I was able to put the device into a state (by OTA flashing it) where it would not recover without hard reset. Not acceptable.
Developer code doesnât run in a sandbox, itâs part of the application thatâs running on the core, so itâs very common for that code to impact OTA on the core, since OTA is an intense process. Weâve been swarming on a number of OTA bugs recently, and weâll keep testing and improving things, but I spent a few hours today trying to help test and understand your use case.
I appreciate the test case! We fixed two major OTA bugs in the last week, and weâll keep working to improve it in the meantime. Iâve never seen a user firmware that couldnât be fixed to work well with OTA, so Iâd be surprised if this were the case here.
There are lots of ways to work around this. For debugging, I like to do this to make sure I can always do a OTA update:
void loop() {
int pin = digitalRead(D0);
if (HIGH==pin) {
for(;;) {
SPARK_WLAN_Loop();
}
}
...
So when pin D0 is low, the show goes on, but if you pull D0 high, you loop for ever doing the Spark loop (waiting to OTA flash).
This works great unless you are working on code for the external flash, in which case it is possible for your code to clobber important stuff so that a factory reset is required.