OTA Flashing Locked Out with heavy TCPClient usage

An application that makes heavy TCPClient usage apparently locks the Spark out with respect to further OTA flash programming. There is no way to recover except manual reset.

Full source for the test case is provided, see below.

I took the @bko Bitcoin sample code (great work) as a test case. It works properly as-is, and can be repeatedly OTA flashed without a problem, because it pauses waiting for operater intervention.

To cause the fault I take out the part where it waits for an operator CR to continue. This results in it continuously looping, spending about half the time talking with the bitcoin server, and half the time spewing out results to the serial port.

If (while it is in the part of the loop where it’s talking to the server), I start the CLI flash sequence, it will lock up. LED is still breathing cyan, but the core no longer outputs to the serial port, and will no longer respond to OTA requests. It must be manually reset.

CLI eventually responds with a ECONNRESET error.

Notes:

  1. I have incorporated @satishgn last weeks bug fix:
    https://github.com/spark/core-firmware/commit/6d0b3ad20cd7d5ecbbb64bea6c26160290ac0e61

  2. As a reminder, I’m running on a not so fast internet connection which may be contributing. Still it shouldn’t be happening.

I don’t see how to attach a zip file, so I’ve “temporarily” uploaded the whole project directory as a zip in order to facilitate debugging this:
http://www.takenwithyou.com/SparkOTAFlashBugBitcoinDemo.zip
Note: I use Serial1 for debug, so you’ll have to do a quick replace in files BitcoinApplication.cpp and rest_client.cpp.

Hi @faraday,

It looks like you already found the Github issue that I think applies to this problem. I think Satish’s latest fix mostly works to resolve an OTA prep issue, but the fix you need is the one that tries to do a better job of sharing the data lines with the CC3000 / External flash. I think Satish is isolating the best fixes into a nicer / cleaner branch, but thanks for sharing your code / another test case!

Thanks!
David

1 Like

Sadly, the latest firmware update of July 17, 2014 doesn’t appear to fix this problem.  Here is a complete set of files for an easily (10 minutes to do) replicable test case:

http://www.faradayco.com/sparkdev/SparkBitcoinRESTTest.zip

Please note I’m sending debug out Serial1.  When the program first starts you need to hit CR in serial console to get it going (or comment out line 49  in BitCoinApplication.cpp).

The application runs, then chokes on TCP/IP traffic (I guess) and will fail to OTA flash from then on.  Before the last firmware changes the program didn’t choke, it just wouldn’t do the OTA flashing.  So, in this case things are worse off than before.

Hi @faraday,

Awesome, thanks for the test case! :slight_smile: What files should I be looking at, just BitcoinApplication.cpp, or ? I noticed a 1 second delay in your loop, and you’re using a client instead of a server, so I don’t think you’re sending an overwhelming volume of traffic. I’m guessing it’s more likely something is running out of memory or crashing.

It looks like you’re not clearing your ‘response’ variable during each loop, so I’m guessing it’s growing by ~400 some bytes each time until it overflows the ram? So you probably make it about 10 requests before crashing? I could certainly be wrong, just a guess.

Looking in your code, you commented out the line that clears the response string before each loop:

/*
  // Press ENTER in your serial terminal to continue...
  if (!Serial1.available())
    return;

  obj[0] = Serial1.read(); // Flush the serial buffer to pause next time through

    digitalWrite(LED, HIGH);
    delay(1000);

  response = ""; // Clear the response String
*/

Thanks,
David

Well, BitcoinApplication is the main one. But there are also:
jsmnSpark.cpp
jsmnSpark.h
rest_client.cpp
rest_client.h

Basically this is exactly the code the BDub published. His code works great IF (and I do mean IF) you let it wait for a CR each before letting it go out and grab bitcoin data. All I’ve done is modified it so that it doesn’t wait for user input, just keeps grabbing and printing, grabbing and printing. And now choking (I’d put a smiley in here but I don’t see how to do it. Double smiley).

faraday

I had some ninja edits in my post, just making sure you saw my updates, looks like maybe it’d do better with that response resetting code?

Thanks,
David

I don’t understand your comment.

BTW, with this code runs for a while then throws a red SOS 8 (Out of heap memory).

Did the latest changes increase memory usage to where this won’t run now?

Hi @faraday,

The Rest client you’re using concatenates the new responses with your previous responses, so it’ll continue to use more and more ram until you run out of memory. That line you commented out ‘resets’ that variable, so it won’t inflate forever. This wouldn’t be the result of anything we changed I don’ think.

Thanks,
David

I just noticed your comment about my commenting out the response string. Will fix that and see what happens.

1 Like

Ok, fixed the little response string mistake. Now we’re back to where we were.

With 17 July firmware update this application code still prevents OTA flash. I’ve re-uploaded the fixed test case here:

http://www.faradayco.com/sparkdev/SparkBitcoinRESTTest2.zip

One you hit the CR it will happily output bitcoin data to Serial1, but you won’t be able to OTA flash the device, and it will stop sending the data to Serial1. If the device is reset you can OTA flash it prior to the user hitting the first CR which starts the TCP/IP traffic, but after that it will hang.

Hi @faraday,

I just compiled your app, let it run for 5-6 steps, and then flashed another app on top of it. I changed your “Serial1” statements to just “Serial” since I didn’t want to spin up a board to watch the hardware serial port Serial1. As far as I can tell your code works fine when I compile it using the spark-cli. Are you compiling against the compile-server2 branch, or master? I recommend people use compile-server2 if they want to match what the build IDE produces, and what is our stable standard branch. Otherwise we also have some branches tagged for release, instead of the ‘master’ branch which is not necessarily guaranteed to be stable.

Can you share other details about your setup? Did you intend to be using Serial1 and not Serial? Is there a reason you want to use the master branch and not the stable branches?

edit: testing some more, want to be sure there isn’t a crash…

Thanks,
David

Hi @faraday,

Hmm… Hitting their API every second or so might be a bit too frequent, unless you’re looking for second by second changes. Introducing another 1-5 second delay would save them ~40-50 calls a minute, and still give you very current data. That small delay also ensures your core isn’t busy when you want to flash. I also moved this block of variables outside your loop statement:

  int i, r;
  jsmn_parser p;
  jsmntok_t tok[NUM_TOKENS];
  char obj[MAX_OBJ_SIZE];

void loop() {
...
}

since they look like they’re all reinitialized properly anyway inside the loop, and can safely be globals. I’ve also seen some of the OTA flashes “flash green” rapidly, since you’re taking control of the LED and turning it green :slight_smile:

I hope that helps! Here are my changes: https://gist.github.com/dmiddlecamp/3affed45070bcefbfafd

I included my compiled ‘bitcoin.bin’ in there, in case you want to try it and see if it behaves any differently.

Thanks,
David

I could be wrong, if so I apologize, but this is how it appears to me.

With due respect, the code I provided is a test case. The point of testing is to see if things break, not to fix the test so that things don’t break.

I’m afraid the firmware still has a serious problem with OTA, in fact probably a show stopper for a commercial application.

You can’t have a device that works 95% of the time, and then locks you out the other 5%.

The main effect of inserting the 5000 millisecond delay at the bottom is to make it harder to test. In other words the fault shows up much less frequently, and in fact appears to go away. But it’s still there if you initiate OTA flash approximately 2-3 seconds after Serial printing has stopped. This causes the actual cloud interaction to occur during the TCP/IP code, thus faulting the device.

After 30 minutes of playing with the 5 second delay version I was able to put the device into a state (by OTA flashing it) where it would not recover without hard reset. Not acceptable.

Well, according to the progress made here:

the OTA issue is really close to being solved.

1 Like

Hi @faraday,

Developer code doesn’t run in a sandbox, it’s part of the application that’s running on the core, so it’s very common for that code to impact OTA on the core, since OTA is an intense process. We’ve been swarming on a number of OTA bugs recently, and we’ll keep testing and improving things, but I spent a few hours today trying to help test and understand your use case.

I appreciate the test case! We fixed two major OTA bugs in the last week, and we’ll keep working to improve it in the meantime. I’ve never seen a user firmware that couldn’t be fixed to work well with OTA, so I’d be surprised if this were the case here.

Thanks,
David

There are lots of ways to work around this. For debugging, I like to do this to make sure I can always do a OTA update:

void loop() {
    int pin = digitalRead(D0);
    if (HIGH==pin) {
        for(;;) {
          SPARK_WLAN_Loop();
        }
    }
...

So when pin D0 is low, the show goes on, but if you pull D0 high, you loop for ever doing the Spark loop (waiting to OTA flash).

This works great unless you are working on code for the external flash, in which case it is possible for your code to clobber important stuff so that a factory reset is required.

1 Like