Flash via Serial or OTA fails but OK via DFU

The title says it all.

A small number of devices in the fleet are exhibiting odd behaviour: they will not flash successfully via OTA (using particle flash) but are able to be flashed via DFU (using particle flash --usb).

Am using firmware 0.8.0.rc.1

Thanks! @UMD

1 Like

Fixed by downgrading to v0.7.0.rc7, flashing with Tinker, then flashing app built against the desired firmware, which in my case was 0.8.0.rc1

Thank you for putting your solution!

Im having this same issue on my red-bears and also my e-series, Can we re open ?

@elijthomas, have you tried the work around that I tried?

I get this issue from time to time and it is very frustrating, especially when it happens in the field.

I will “unsolve” the issue in the interim.

1 Like

Just wanted to mention you’re using the latest release candidates, all of which state you ought not to use them in production since there might be unresolved bugs.
I’m not saying that’s what’s happening, but using RCs out in the field has a bit of a risk to it.

Furthermore, the code you are running can interfere with the OTA process, whereas a wired DFU connection doesn’t suffer from that. Try putting Tinker on the device and see if you can flash it then. If so, there might be an issue with your code.

If you’ve got any steps to reproduce the problem, that’d be great for debugging proposes.

@Moors7, good advice re RC’s, but am happy to take the risk!

I had already done exactly as you said - loading Tinker, and that fixed the problem, but it is not clear to me why it worked.

From what you are saying, the application could be a contributor.

There are still puzzles to the issue:

  • its random nature (and therefore very hard to reproduce)
  • once it occurs, thereafter the only way to load a new app is to DFU it. Even putting it in safe mode does not allow OTA updates…

Could these facts be pointing to application memory overwrite? Is this possible to do from within an application?

It might not be random in itself but rather systematic when it comes to the contribution of the code, but maybe random along the time dimension. When - in regards to your code flow - does the OTA happen? If the OTA triggers during some part of the code that is fairly responsive it is more likely to succeede compared to sections that are a bit more consuming for the controller (e.g. during extensive I2C communication, the device is typically less responsive - also depending on implementation of used libraries).

1 Like

@ScruffR,

High probability that your theory is on beam!

Thinking more about it, I want to modify this statement:

  • once it occurs, thereafter the only way to load a new app is to DFU it. Even putting it in safe mode does not allow OTA updates…

To:

  • after an application update the only way to load a new app is to DFU it. Even putting it in safe mode does not allow OTA updates…

This squashes “has the program ‘damaged’ memory” line of thinking!

There are a number of libraries and other complications in the app that could be consuming resource - I2C, display driver, serial input, and TCP/IP intranet…

Now that I have a lead to follow, I will continue to investigate when next time I get the issue. Will report back on this ticket.

Hopefully others will kick in with their findings, eg @elijthomas who is suffering with the same.

PS - @ScruffR, why doesn’t safe mode allow update?

That would be a secondary question to clarify. But I think it also makes a difference “how” you got into Safe Mode. Usually manually entering Safe Mode does go via a reset, but AFAIK that’s not always the case.
e.g. via USB baudrate 28800 you can get into SM without a reset and I’m not entirely sure whether or not Safe Mode Healer would first reset the system.

Another thing I have noticed over time is in connection with SYSTEM_THREAD(ENABLED).
Since the download of the new firmware happens in parallel to your still running program things might get messed up during that periode. And even after an “apparent” software reset following the download I often get the impression that the old firmware is still present and gets started before the actual transfer from the download area into the program space is initiated.

@ScruffR, I know what you mean re other OTA issues! (That said, it is pretty reliable on the whole).

Good to know about the USB serial access to safe mode.

Luckily I have a client who is bringing back a Photon based device that has the OTA issue, so we have an opportunity to investigate.

Can you suggest how/what I should investigate here? Perhaps performing tests with safe mode? Good news is that I have logging to USBSerial1 - so that might help.

Hard to say.
I guess you haven’t set LOG_LEVEL_ALL in the current firmware?
Having that on would be key to also see what’s going on behind the scenes in system during OTA too.

@ScruffR, drats, have set default to LOG_LEVEL_WARN… but will change moving forward!

@ScruffR, (am still ingesting the mesh kit announcements… way to go!)

The errant Photon has come in from the customer, and confirmed that it could NOT be flashed OTA.

Performed the following experiments:

  • It has my app v3.02.5 installed
  • OTA failed <=== BAD
  • Flashed the same app v3.02.5 via SERIAL
  • OTA failed <=== BAD
  • Flashed new app v3.09.0 via SERIAL
  • App v3.02.5 still installed <=== BAD
  • Flashed new app v3.09.0 via DFU
  • Long time cycling through flashing magenta then momentarily flashing cyan
  • Eventually went into safe mode <=== BAD
  • Flashed new app v3.09.0 via DFU again
  • Went into safe mode <=== BAD
  • Flashed earlier app v3.06.0 via DFU
  • Long time cycling through flashing magenta then momentarily flashing cyan
  • Eventually went into safe mode <=== BAD

So that is the story so far.

To me it looks like the Photon is stuffed as we say in the vernacular and that it is not a misbehaved application per se.

What next?

I really want to recover this Photon if possible so as to develop a procedure for the next time this happens (which looking back on it, have suffered before a few times).

How about flashing some other firmware (e.g. Tinker) and then try the OTA?
Also adding a LOG_LEVEL_ALL logging and posting the output would be good.

BTW, is this a product device? Is it marked as developer device in the product?
If “Yes” to first and “No” to second question, I’d guess the magenta phase may be reverting to the product firmware after flashing “unauthorised” firmware.

Updating its system firmware might not hurt either, should you not have tried that already.

@Moors7,

Flashed firmware 0.8.0-rc.1 via DFU
Flashed new app v3.09.0 via OTA
Safe mode <=== BAD
Flashed tinker-0.6.0-rc2 via DFU
Safe mode <=== BAD
Flashed firmware 0.7.0-rc.7 via DFU
Safe mode <=== BAD
Flashed tinker-0.6.0-rc2 via DFU
Safe mode <=== BAD

Flashed tinker-0.6.0-rc2 via DFU
Safe mode <=== BAD but this time looked at the console:

{“data”:"{“f”:[],“v”:{},“p”:6,“m”:[{“s”:16384,“l”:“m”,“vc”:30,“vv”:30,“f”:“b”,“n”:“0”,“v”:100,“d”:[]},{“s”:262144,“l”:“m”,“vc”:30,“vv”:30,“f”:“s”,“n”:“1”,“v”:206,“d”:[]},{“s”:262144,“l”:“m”,“vc”:30,“vv”:26,“f”:“s”,“n”:“2”,“v”:206,“d”:[{“f”:“s”,“n”:“1”,“v”:206,"":""},{“f”:“b”,“n”:“0”,“v”:101,"":""}]},{“s”:131072,“l”:“m”,“vc”:30,“vv”:0,“d”:[]},{“s”:131072,“l”:“f”,“vc”:30,“vv”:0,“d”:[]}]}",“ttl”:60,“published_at”:“2018-02-13T23:47:52.341Z”,“coreid”:“xxxxxxxxxxxxxxxxxxxxxxxx”,“name”:“spark/status/safe-mode”}
spark/device/last_resetpin_resetFebruary 14th at 10:47:51 am
{“data”:“pin_reset”,“ttl”:60,“published_at”:“2018-02-13T23:47:51.077Z”,“coreid”:“xxxxxxxxxxxxxxxxxxxxxxxx”,“name”:“spark/device/last_reset”}
device came onlineno dataFebruary 14th at 10:47:50 am
{“data”:“online”,“ttl”:60,“published_at”:“2018-02-13T23:47:50.541Z”,“coreid”:“xxxxxxxxxxxxxxxxxxxxxxxx”,“name”:“spark/status”}

Flashed tinker-0.6.0-rc2 via OTA and got this console entry:

{“data”:“8BBF72A9B422C8599918B1B7B15FE5B13CB8F1125BDB13BF32629490A39CB99A”,“ttl”:60,“published_at”:“2018-02-3T23:55:07.065Z”,“coreid”:“xxxxxxxxxxxxxxxxxxxxxxxx”,“name”:“spark/device/app-hash”}

Does these console entries give us any clues? It is smelling very much like a flash memory fault…

@ScruffR, in answer to question, it is not part of a product. Witness lack of flashing magenta above.

Note am not persisting to save $ but to find a troubleshooting method!

Hi @UMD,

I cant downgrade the redbear duo as its on 3.0.0 and downgrading to 2.4.0 requires access to dfu-util, My devices are in production, in the real world.

I used to be able to update them, so i’m still not sure whats changed.

Im going to try and update the firmware to 3.1.0 and see if that makes any difference.

A bit more info,

i got stack trace logging on the device now , so see the image, The error is logged after the flash status of success is

    {  
   "data":{  
      "r":"error",
      "u":{  
         "s":524288,
         "l":"t",
         "vc":94,
         "vv":30,
         "u":"FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF",
         "f":"u",
         "n":"1",
         "v":8,
         "p":88,
         "d":[  
            {  
               "f":"s",
               "n":"2",
               "v":8,
               "_":""
            }
         ]
      }
   }
}

spark/trace/device-event-spark/device/ota_result

Any ideas ???

This is a very peculiar behaviour. These status reports do bear some info, but whether or not they indicate some incompatibility is difficult for me to say as I don’t know of a comprehensive doc that lists good vs. bad combinations of firmware modules (@rickkas7?).

Maybe you could also provide the output of particle serial inspect (forgot serial before :blush:) with your device in Listening Mode.

Another thing you could/should try is flashing the bootloader for 0.8.0-rc.1 (and 0.7.0-rc.4+) which has to be done via particle flash --serial not --usb.

However, 0.8.0-rc.1 (and I guess rc.2 too) has an open issue reported by me about connection loss detection

1 Like