Hello! I am using a Particle M-SoM with the breakout board running DeviceOS 6.2.1 and using SYSTEM_MODE(AUTOMATIC);
I'm located in a suburban area that typically has good cell coverage. However, the building where I do my development in an industrial warehouse made of metal beams, metal siding, and a metal roof. Thus, my cell signal strength and quality drops significantly when inside. However, I can still make calls, send text messages, and browse the web on my iPhone without much difficulty despite the less-than-ideal warehouse situation.
Most of the time, my particle device is able to connect to the cell network and Particle cloud without much trouble. However, there are times when it drops and simply does not recover for hours or longer. A simple restart fixes the issue, and I'm up and running again. When I'm outside the building, the device works great and never disconnects in the first place.
The problem is that the devices I'm programming will eventually be sent to very rural areas (areas I fear will have cell reliability similar to that of my warehouse) and flipping a switch to reset them is not practical. How can I prevent the device from disconnecting from the cell network or at least reconnect quickly if a connection is lost?
I've seen rickkas7's electonsample code (GitHub - rickkas7/electronsample: Particle Electron sample code for fault tolerance and problem debugging) where he gets debug info from an electron device and resets the modem, but I'm not sure that this will work for me. I am reading and storing a history of sensor data every 5 seconds. Thus, depowering the device or putting the device to sleep is not an option. I need to stay awake to ensure the data has a consistent history. I'm also using Particle's Software Timers (i.e. multithreading) to manage some time critical sensor reads. I've read that manually managing the cell connection is often problematic especially when multithreading.
What can I do to make sure my device stays online or at least comes back online quickly? I can handle outages of 2-3 minutes, but 20+ minutes to hours is not acceptable. Thanks in advance for any help!
You should find out what the device is doing when it's in the state where it is unable to connect. Add this to your code:
SerialLogHandler logHandler(LOG_LEVEL_TRACE);
then monitor the USB serial debug. This will determine what the solution is.
Absolutely do not try the techniques in the electronsample. Almost all of the useful techniques are already implemented in Device OS and implementing them again will make things significantly worse.
Also what version of Device OS are you using? 6.3.3 is recommended for M-SoM.
Ok, I updated to device OS 6.3.3 and added LOG_LEVEL_TRACE to my code. I have not seen it disconnect from the Cell network yet, but I will keep monitoring it.
However, I am getting thousands of logs saying:
0000013595 [system.nm] TRACE: Request to power off interface 5
Here's where they begin:
0000013189 [comm.coap] TRACE: Sending CoAP message
0000013203 [comm.coap] TRACE: CON POST /E/particle/device/updates/enabled size=44 token= id=5
0000013229 [comm.coap] TRACE: Sending CoAP message
0000013243 [comm.coap] TRACE: CON POST /E/particle/device/updates/forced size=44 token= id=6
0000013268 [system] INFO: Sending application DESCRIBE
0000013283 [comm.coap] TRACE: Sending CoAP message
0000013297 [comm.coap] TRACE: CON POST /d?\x02 size=99 token=06 id=7
0000013317 [system] INFO: Sending subscriptions
0000013330 [comm.protocol] INFO: Sending subscriptions
0000013344 [comm.coap] TRACE: Sending CoAP message
0000013358 [comm.coap] TRACE: CON GET /e/particle size=15 token= id=8
0000013378 [comm.coap] TRACE: Sending CoAP message
0000013392 [comm.coap] TRACE: CON GET /e/spark size=12 token= id=9
0000013411 [system] TRACE: Waiting until all handshake messages are processed by the protocol layer
0000013438 [system.nm] TRACE: Request to power off interface 5
0000013458 [system.nm] TRACE: Request to power off interface 5
0000013480 [ncp.at] TRACE: > AT+COPS=3,2
0000013504 [ncp.at] TRACE: < OK
0000013510 [ncp.at] TRACE: > AT+COPS?
0000013513 [ncp.at] TRACE: < +COPS: 0,2,"310410",8
0000013514 [ncp.at] TRACE: < OK
0000013516 [ncp.at] TRACE: > AT+QCSQ
0000013518 [ncp.at] TRACE: < +QCSQ: "eMTC",-69,-100,89,-18
0000013521 [ncp.at] TRACE: < OK
0000013495 [system.nm] TRACE: Request to power off interface 5
0000013543 [system.nm] TRACE: Request to power off interface 5
0000013595 [system.nm] TRACE: Request to power off interface 5
0000013630 [system.nm] TRACE: Request to power off interface 5
0000013664 [system.nm] TRACE: Request to power off interface 5
0000013682 [system.nm] TRACE: Request to power off interface 5
0000013700 [system.nm] TRACE: Request to power off interface 5
0000013720 [system.nm] TRACE: Request to power off interface 5
0000013739 [system.nm] TRACE: Request to power off interface 5
0000013757 [system.nm] TRACE: Request to power off interface 5
0000013775 [system.nm] TRACE: Request to power off interface 5
0000013793 [system.nm] TRACE: Request to power off interface 5
0000013811 [system.nm] TRACE: Request to power off interface 5
Where can I find documentation on what these TRACE logs mean?
The logs are not really documented, except for the cellular debug logs.
It can see an AT&T tower (310410), LTE Cat M1 (8).
< +COPS: 0,2,"310410",8
However this is not expected. Interface 5 should be the Wi-Fi interface, and it's not clear why it would be attempting and failing to power it down. This may require someone who knows more about automatic connection management, which may be a factor here.
[system.nm] TRACE: Request to power off interface 5
Have you ever configured Wi-Fi on this device, especially for an AP in a different location?
Yes, I have configured WiFi for some applications. I have a web variable that allows me to switch between Cell and WiFi on the fly. When in Cell mode, I call WiFi.off(), but there was a bug in my code that called it repeatedly and triggered the excessive "Request to power off interface 5" logs. I have now fixed that and my logs appear more normal. Thank you for the tip on interface 5 being WiFi.
Getting back to my original question, I have yet to have any of my devices disconnect from the Cell network again so I don't have any more debug info. (I've noticed it happens every 3 days or so, but then it stays down for extended periods of time).
Nevertheless, I would like to run an idea past you. I currently have a 2 minute hardware watchdog that works well if anything goes wrong in my code. However, the watchdog is not able to reset the device if a cell connection is lost. I'm considering creating a cellular connection watchdog that resets the cell modem if Cellular.ready(); does not return true for more than 5 minutes, and then calling System.reset(); if no connection is made within 10 minutes.
Restarting my device is not desired as I mentioned in my original post, but if it won't connect, I don't know what else to do. Do you think the approach I just mentioned is a good idea and if so, how would you implement it? Specifically, how would you reset the cell modem?
Unrelated, but when using LOG_LEVEL_TRACE and switching from cell to WiFi, I encountered some TRACE logs that showed the WiFi interface starting up but failing to connect to WiFi. The device eventually crashed. After crashing, the device restarted and connected to my WiFi network without any issue. This seemed like an OS level bug to me that someone should look at.
Thank you again for all the help!
Good news and bad news. The good news is that over the weekend, I had 3 particle M-SoM's using cellular connections on device OS 6.3.3 and all 3 had zero drops in cell connectivity! The bad news is that because of that, I don't have any logging data to show you what happens when it does fail.
However, I would still love your input on my prior question. I'll restate it here:
"I'm considering creating a cellular connection watchdog that resets the cell modem if Cellular.ready(); does not return true for more than 5 minutes, and then calling System.reset(); if no connection is made within 10 minutes. Do you think this approach is a good idea and if so, how would you implement it? Specifically, how would you reset the cell modem?"
Thanks for helping me on this!
You should wait 12-15 minutes of failing to connect to cellular, not 5 minutes, to allow Device OS to do a full modem reset after 10 minutes. Also, you should go into hibernate sleep mode instead of System.reset(). System.reset() does not disconnect the cellular modem. Sleep for 15 to 30 seconds; hibernate mode will reset the device on wake.