Particle Boron recovered dead in the field - but power cycle reboots!

I would like to know if anybody has experience similar to the following.

I ““upgraded”” one of my remote sites (2.5 hour drive both ways) last week with a v2.0.0-rc1 Boron.

I have already highlighted the stupidity of this choice, being that Particle has ruined the Boron’s cellular reconnection stability sometime after 1.3.1-rc1, in this post:

But this is different and even worse. Instead of finding flashing-green Boron, I found a DEAD Boron with no light at all.

However, with a power cycle, it rebooted and flashed green.

The code was tested for weeks at my house in good cellular. The software watchdog was also in use.

The power supply was +5V into VUSB. I know there were no supply issues because my Teensy 3.5 was working just fine on the same board.

Why did this happen?

Why is Particle’s flagship product habitually so fragile, sensitive, and unpredictable as to whether it will work or not?

Does anybody have any ideas how the Boron could randomly go dead, but then reboot after a power cycle?

Did you try the reset button, or was a power reset specifically necessary? Those usually imply very different kinds of issues in my experience.

Broadly, if you share some code and more info on your use case we can help - with no information, kinda hard to say what might be wrong.

It’s possible for the processor to lock up in such a way that the software watchdog cannot be triggered. It’s kinda one of those things where it may not be likely to happen but no matter how good your hardware and firmware it eventually WILL happen. This is why in critical applications you really nearly always need to use an external hardware watchdog. You can do this on a power switch for the whole thing or just the reset pin.

Have you tested the software watchdog? Are you certain that it is configured properly and triggers as expected? Sharing your code may be helpful here. Again, things are gonna lock up eventually, these safety measures are necessary.

Keep in mind that testing in “good cellular” may not be a very representative use case. I usually do testing with faraday cages and other techniques for reducing signal strength in a test environment. Not hard to DIY. I’ve found the vast majority of bugs, errors, and issues come from poor signal strength environments.

Finally, anytime you use release candidate firmware, you’re asking for trouble in critical applications. Just use the latest stable release. While usually the release candidates are solid, by definition they may not be stable. EDIT: looks like they’re doing this whole release by default thing now, so I guess that it’s now no longer a release candidate. Confusing terminology. Seems reasonable that you should be using v2.0 at this point!

1 Like

Thank you @justicefreed_amper for your response.

  1. The Reset button did nothing. Thanks for eliciting this important point. The Reset button did not reboot the device. Only a full power cycle did.

  2. I believe #1 shows that it is not a code issue. The heart of my code simply loops over any incoming UART packets and uploads them over MQTT using system thread enabled and semi_automated mode. If cellular is not ready() then I call .on() in the loop. Same code as has been working without issue at my two Boron sites which have actually stayed up for months (using 1.3.1-rc1 and no external hardware watchdog)

  3. Please see this new thread describing a topically separate, though potentially related issue:
    (Video) New Boron FAILS to reboot on VUSB power cycle - external watchdog incapable of resetting device

There are clearly physical Boron electronic issues and/or deep seated software/firmware issues at play because 1) the reset button did not work and 2) the code doesn’t do anything crazy, as proven by other stable sites.

I will try to deploy the same exact setup but with a 1.3.1-rc1 Boron. It’s just unsettling that I will have to spent all that time and driving to not know what the result will be.

It’s just unsettling that I will have to spent all that time and driving to not know what the result will be.

I agree! That’s why the external hardware watchdog generally is so highly recommended. It’s possible it might not have fixed this issue (unless you hook it up to a power switch, which is easy to do), but it’s a legitimate solution that is used for pretty much any robust embedded system.

So #1 does not show it is not a code issue, really. It is however an indication that something got really deeply stuck. Sometimes stuff just happens. Cosmic radiation can literally flip an important bit in memory. Rare? Sure, but it definitely happens.

Also two sites is an insufficient sample size. I didn’t start noticing problems with my Electrons and Photons until I had hundreds in the field across 10-20 sites. Not saying you can’t catch problems early, but a lot of this reliability stuff is RNG and statistics based.

Some food for thought - most of the problems I’ve had have been with system thread. Do you really need it? I do need it, so I’ve worked around it, but if you can get away without it your system will absolutely be more stable. Worth trying if you haven’t already.

Also - is this the same Boron? You could simply have a hardware defect. It happens. Your new thread seems to be the exact same problem, yes? Not seeing the difference between behavior that and this. If so I’d merge the threads. I’ll also comment there about the UART stuff.

Thanks for that experience. The problems are separate. I realized today that I forgot with this thread to report the following additional critical detail (I was confused by the other matter):

Once recovered, power cycling will reboot it, but then it will immediately die after 2 seconds
Video proof:

In the issue from my other new thread, the Boron does not consistently and permanently die off after 2 seconds. Actually I am posting a new video condensing that thread, showing that the Boron boots normally when connected to USB but won’t restart when VUSB power switched.

This issue, in this thread, was when I found a formerly tested working (in the same circuit board) Boron in a dead state, where Reset did nothing, and where power cycling caused the result in the video above.

I know it’s not a defective Boron because it had been working fine before deployment.

Has this happened to anyone else? Has anyone had a Boron go into a state where it will always die after 2 seconds of being powered?

Dying after a consistent, short time like that sounds like a firmware issue. It could be that your application code got corrupted somehow. Definitely start by flashing tinker or an empty sketch via DFU mode over USB and verifying it is not a hardware issue.

1 Like

Good recommendation. I just tried safe mode. It crashed/died after 2 seconds, like in the video. Then I tried DFU mode to flash tinker. Indeed, it is able to get into and stay in DFU mode. I flashed the same Tinker version as the device OS on this device, 2.0.0-rc1.

Same result. It dies after 1 second when booting up with clean, bare Tinker.

Now I will try downgrading the OS and seeing if it is permanently destroyed.

Yeah that’s valuable to know. It’s possible for the bootloader or deviceOS to become corrupted. Reflashing deviceOS (and downgrading works too) will rule out the deviceOS. If you have a JTAG debugger you can reflash the bootloader but if you don’t I wouldn’t bother trying that

Here is an incredible result which further my accusation against Particle for their software updates.

After downgrading to 1.5.0 (both Device OS and Tinker, but not bootloader) and rebooting, this problem has gone away. The Boron is now loading.

Not my code, but Particle’s Tinker app on 2.0.0-rc1 had this 1-second hard crash occurring.

I feel validated in my thread about how 2.0.0-rc1 ruined the cellular reconnection because now we know this version creates a situation where the Boron will fail to load up at all, regardless of version being used.

Thanks. Now I await Particle’s response to rationalize how this happened.

Well, as I mentioned, could also have been corrupted firmware. I would personally recommend trying to reproduce once more on 2.0.0 right after a DFU update. If that fails, then yeah, seems like they borked something in that update and then we can figure that out from there!

After reflashing 2.0.0-rc1 per your suggestion, things remain normal and booting up and connecting. I am not sure this absolves 2.0.0-rc1 of guilt. It is possible that this version, unlike others, allows normal operation over a course of time (like my experience) to cause the Boron to go into the 1-second-after-boot-it-always-dies phenomenon captured on video above.

It’s also possible that it’s some hardware or firmware flaw in all versions of firmware that cause this to be able to happen.

I am not sure if I can walk away with any conclusion beyond a reinforced impression of the unreliability of the Boron:

  1. I tested the Boron for 2 weeks in “nice” cellular conditions.

  2. I drove hours to site and deployed it in not as nice, real-world cellular conditions.

  3. It connected once that day (August 1st) and never reconnected afterwards.

  4. I retrieved it 2 weeks later in the dead state above: reset did nothing, power cycling caused boot then instant crash.

  5. Now I reflash the Device OS at my laptop and it works again. That is not a reliable device. I can’t see what variables were uncontrolled for. Perhaps I am expecting too much of the Particle platform in expecting reliability. Or, perhaps there is some rare and unknown thing I did that I my fault. I doubt that because I rigorously tested the Boron on the same circuit board for weeks before deployment though.

Yeah it’s not a good feeling to not find a true root cause. I’m glad we have mostly ruled out a system firmware bug at least. However, it is clear that for whatever reason, the deviceOS program memory got corrupted in a manner that caused an issue. In my experience, that kind of thing is really rare though I saw it once or twice in early deployments.

I will say that if you’re looking for reliability, Gen 2 just has a lot of years on Gen 3. Electron LTE has a shitty lead time right now, but is probably a better choice for many people, honestly. I’m going to be using Gen 3 because of some specific reasons, but if you have that flexibility you might benefit from the extra stability.

1 Like

I would love to use Electron instead of Boron, but my understanding is that 3G will be dead and inoperable (i.e., no Electron will ever again connect to the internet) in USA in just one year. Is that correct? 2G I understand is already gone.

There is an Electron LTE Cat M1 for sale via the wholesale store. They really haven’t publicized it, but that’s partly because they are still really behind and catching up on orders (partly due to China covid shutdowns in early 2020 I’d guess). Currently a 12-16 week lead time to get your hands on them but you could prototype on an Electron 3G for now.

3G is going away and I can’t recommend it. On T-Mobile they have still not announced an end date but with AT&T starting to sunset in 2021, yes, it’s dying fast.

1 Like

Btw this assumes of course Particle is your cell provider - the standard format Electron LTE Cat M1 only has an embedded SIM, so you would have to use the E-Series if you wanted to break out a SIM slot IIRC. Edit: hence why I can’t use it :wink:

1 Like

I am happy you made me realize that the E402 I had purchased and acquired a few months ago - and subsequently abandoned - is LTE capable. I falsely thought it was only 3G. The problem is I would have to solder a USB connector and power supply in order to get started, and even at that point, it seems like it would be functionally the same as the Boron. Or, is the E402 more stable than the Boron if I go through the effort of soldering a USB connector to it and the 2 recommended VIN capacitors? And I can set it up with Particle CLI “particle setup” which does not work on Boron?

The Electron / E Series are on a completely different MCU than Gen 3. They have way more development time behind their current firmware. Definitely more stable.

Overall functionality? Similar at the high level, yeah. Definitely some little differences though.

If you only need a couple for right now, just buy a few E-Series Eval kits! ( E Series LTE CAT-M1 (NorAm) Evaluation Kit ). They are in stock, have all the connectors soldered in and it’s all ready to go. Otherwise, yeah some soldering and such will be needed. People have been really happy with the E402 from everything I’ve heard.

Pretty sure the particle setup CLI should work, but haven’t used that in years - I do it all over API, but should work.

1 Like

I appreciate the advice!