Thank you! Looking forward to hearing the results!
Hey, here’s an update on our debugging process:
- We’ve run our two 3G Electrons with our full code suite since Wednesday, and they are still going strong! Running this long without a freeze has been rare among our 2G Electrons, so great news there!
- The Tinker software has run on thirteen 2G Electrons over the weekend with unreliable cellular coverage - none have freezed This is basically unheard of with 2G Electrons in the environment we run them in.
These results should be good enough to conclude that the Tinker software works with 2G Electrons in our desired setting; meaning there’s something between the Tinker code and our “barebones” code that does not play nice. One of the differences is SYSTEM_MODE; AUTOMATIC vs SEMI_AUTOMATIC. The reason we’re not running AUTOMATIC is that we need the devices to be responsive for user interaction when offline and as fast as possible when they boot up. It is not enough for the devices to wait until connection is established for them to be usable. With that said:
- Are there any important undocumented differences between these modes that may cause our freeze?
- Are there any stock code examples (like Tinker) available with SEMI_AUTOMATIC that we can test on our devices?
In the meantime we will modify our own barebones code to run STSTEM_MODE(AUTOMATIC) and see what happens.
I so far do not have problems with freezing electron 2G. I use SEMI_AUTOMATIC mode.
You can also use the code to code for fault tolerance by @rickkas7.
And the pursuit of a bug free hardware continues…
- Again, the two 3G Electrons are still going strong. No crash after a week of continuous operation, with the Exact same code as the 2G Electron ran, and crashed.
- We tried running the aforementioned crashing code on our 2G Electrons with one change; removing all display related code, and woala, no crash! This may simply be a coincidence, since we only tried this over one night. But now suddenly the suspect seems to be the physical display, or the display related code. Have anyone had similar problems? It’s a 0.96" SPI screen, with a current draw of ~25mA, and we’re using the Adafruit SSD1306 lib from the web IDE. We’re using the pins A3, A5, D0 & D1 for SCL, SDA, DC & RST respectively.
- We’ve now run the rickkas7 “Electron Debug” code for 48 hours without any crash. This is on Electrons that have a physical display connected, but no display related libraries included.
We’ll continue to investigate the display and see if we’re able to crash a clicker without the Adafruit SSD1306 library included.
I’m having similar issues with a test module, we don’t use the screen or other screen code.
We also use a battery well in excess of our needs - it appears to happen after a certain interrupt is triggered.
The interrupt has SMS code in it as well as Particle.publishes.
Not sure yet what is causing the system hangs. but we get similar unresponsive behaviours where interrupts are ignored and the device drops off the net.
We use multi-threading too… humm
EDIT: Correction - other interrupts seem to activate functions, but don’t bring it back onto the cloud. The interrupt that triggered the issue doesn’t re-trigger. :\
A rule of thumb is not to put complex stuff into an ISR.
Set a flag and do the complex stuff in the main loop.
Can do - but why is that the case? What should I be aware of that causes that to occur?
I also can’t figure out why my Watchdog doesn’t kick in and restart the system to bring it back to life when it hangs either…
The watchdog is reset at the end of the loop. but it seems the device can’t escape the interrupt function and hangs when it runs a Publish.
That has to do with interrupt priorities, masking, preemtiveness, reentancy and timing which would lead a bit too far here.
Also the application watchdog is a software construct which relys on the controller to execute system processes which will probably be blocked due to a deadlock situation.
Dealing with interrupts needs a somewhat more fundamental understanding of the hardware than pure application programming.
Thanks - in terms of deadlocks, what would that mean?
What two components are fighting against each other?
e.g. your interrupt calls a function that uses an interface which in turn uses a lower priority interrupt than the causing interrupt, the interface will wait for the calling ISR to end while that ISR is waiting for the function to return. Adding multiple layers of indirection and your debugging nightmare is born
Adjusting some of the code now to see what happens - let’s hope for some good changes
Seems to have fixed the problem, running Particle.Publish(…) & Sending an SMS in an interrupt seems to cause the device to hang.
Surprisingly, the SMS was received successfully (using the onboard SIM to send the SMS) … but Publish never went through.
Spoke too soon… back to fast Cyan flashing and every so often going red…
Humm… what else can cause that?
Thanks for sharing your test processes quite elaborately.
We are facing similar issues (Electron 3G freezes Solid Cyan, after 2 days of operation). Starts working normally after a reset for another 2 days, requiring a reset again.
Would like to know what you guys are upto with testing ? Did you find out what was causing the freeze ?
Your advise would be invaluable for us. Here is a link to our application
i’m begining to wonder if cellular providers are growing tired of all the “things” that request a line to the network and then, even though next to no signal traffic is taking place, expect the line to be continuous with no interruptions of any kind for as long needed by the “thing” the fact that cellular providers allow such access to the network is a wonder in itself. i’m old enough to remember the messages that were given by the phone companies when a handset was picked up, a connection made, but no talking taking place. sooner or later the line was disconnected and a message about hanging up the phone handset [back when pretty much only landlines] “please hang up the phone”. or if someone picked up the handset and never makes a call that weird beeping tone at high volume would force you to hang up or go crazy listening to that tone… last month or so i have come to the conclusion that the cellular carriers in my area are growing tired of “things” and if a connection does not seem to be doing much it is dropped by the carrier. anyway,might spend hours trying to debug a issue thinking it is something wrong with the device when it has nothing to do with the device except that it has a connection and is doing next to nothing with it.
Thanks for sharing your research, I’m having the same issues on connection when I’m waking the device from sleep.
I’m running on 4 different Electrons overnight, all running the same code based on the
0.6.2 firmware with both modes enabled:
1 Electron keeps getting stuck every couple of hours, solid cyan, which looks kind of bright white?
We are clueless, our customers have returned the product because in their environment it get frozen more often, I believe due to connectivity issues.
Is there anyway to debug the Electron while he is in that state?
I have the same behaviour: after a while, Electron3G-EU freezing in green or cyan. Sometimes after resetting out of the blue for 5-20 times before freezing.
as @ScruffR pointed it’s “most likely a deadlock between […] code and the system” but I’m currently not able to find it.
- Electron3G - EU (270), 50 devices, firmware 0.7.0
- Each one on a custom PCB also using D0-D1 for I2C devices (and a lot of other IO for busses, ADC, PWM,…). - will test later without custom pcb
- Regular acces to power from a 5V 600mA source (99% of the time) + small lipo high current backup battery 200mAh 25C. PMIC current limit set at 500mA asap on startup
- interrupt on timer0 @1ms, with a handler used for scheduling (30-50 + == % operation on flags, no function call) .- will test later without it
- interrupt on LOW_BAT_UC with a small handler calling millis()
- Automatic mode, - will try semi auto but without faith
- Thread enabled - will try disabled but will probably be a pain for user experience
- uses ~10 Particle.function + ~10 Particle.variables declared at setup()
- uses if(Particle.connected() ) Particle.publish() every time
- uses max 1 Particle.publish() per 30 sec.
- no watchdog - did not test the application watchdog yet. Looking forward for the hardware one on 0.8.0-Rc5
- no sleep (yet). A7 is declared as an iput and read at ~20ms.
How to reproduce the problem:
- in one of our test location, poor network, bad rssi (2G only probably? How to know it’s only 2G by the way), when taking the device inside an old house with big walls (poor wifi coverage as well fwiw) cell RSSI drops even lower, rapidly loses connection and the “death cycle begins”. A startup sound on the custom pcb buzzer makes it easy to notice. The devices resets itself (how? low power from U270 current peak? I have no watchdog) untils it finally lands a green or cyan solid led.
- during the “death loop” the devices never reach the cloud for a spark/device/last_reset… message.
- on our 2 other test location (resp 60km and 30km away), we experienced the problem once or twice so far (12 product used h24 for 3 days)
- better network and less publish leads to “almost no problem” which is not enough :-/
How to clear the problem:
- with a manual reset on the reset pin. Uneasy due to casing. Inacceptable and unreachable for final customer
- resetting the device at the same location results in a new “death cycle” shortly after…
How to solve the problem: open
- with more decoupling on the cell module? With a battery capable of bursting enough current? Is the total parasitic inductance relevant?
- with an hardware watchdog (resetting is ok for our application but freezing obviously not)
- by finding the deadlock - can this explain a reset?
I’ll continue to post updates. Any help would be great!
Hi @tdasnoy, Unfortunatley we weren’t able to fix that issue no matter what we tried.
We ended up not using Particle anymore.
I hope you’ll be able to fix that issue, please keep me posted if you do.
- interrupt on timer0 @1ms, with a handler used for scheduling (30-50 + == % operation on flags, no function call) .- will test later without it => changes nothing
- Automatic mode, - will try semi auto but without faith => changes nothing
- Thread enabled - will try disabled but will probably be a pain for user experience => without “SYSTEM_THREAD(ENABLED);” No problem so far… but product almost unusable due to wait time. “most likely a deadlock” confirmed?
I have enabled app watchdog and logging the reset reason in an external spi flash.
Will test after posting this
Then, I’ll try logging as much as possible to find the deadlock cause. Wish me luck!