Electron sample code for fault tolerance and problem debugging


#1

Most of the time the Electron just works, happily connecting and disconnecting from the cellular network and all is good. Except once in a while, some devices have a hard time getting reconnected. Disconnecting the battery and power usually solves the problem, but that can be a pain for remote devices.

This experiment is some code that might help recover from this situation automatically, and also log some data to better figure out what’s actually happening, so if it’s a bug, it can just be fixed so special code won’t be necessary in the future.

The code is not designed to be a drop in replacement for anything, but a set of tools you can copy and paste into your code to help add fault tolerance to your app. You can also just flash [electronsample.cpp] (https://github.com/rickkas7/electronsample/blob/master/electronsample.cpp) to an Electron and run it for fun.

The latest version of this project is here:
[https://github.com/rickkas7/electronsample] (https://github.com/rickkas7/electronsample)

If you have suggestions or other ideas for things things to check, feel free to add a comment below or a github issue!

Using the Connection Event Log

This code sets aside a small portion (522 bytes) of retained memory for the connection event log. This allows various events and debugging information to be saved across restarts and uploaded via Particle.publish once a cloud connection has finally been made.

It shows when cellular and cloud connectivity was established or lost, along with timestamps and many other things of interest.

In the Event Log in the Particle Console you might see something like:

1470912226,1642,1,1;1470912228,3630,2,1;

The fields are:

  1. The date and time (Unix time, seconds since January 1, 1970).

  2. The value of millis().

  3. The event type, in this case it’s 1, CONNECTION_EVENT_CELLULAR_READY.

  4. The event data, 1, which means cellular just went up. It would be 0 if cellular went down.

Each record is separated by a semicolon.

Connection Debugging

If it takes too long to make a connection to the cloud the code will attempt a reset. The connection event log will indicate whether it was able to make a cellular connection or not. If there is cellular but not cloud, it will also do some additional tests. It currently will ping the Google DNS (8.8.8.8) and also the Particle API server (api.particle.io, which also tests DNS). These results are included in the connection event log.

It also has listening mode detection, and will reset the Electron if the Electron somehow ends up in listening mode, blinking dark blue, for more than 30 seconds.

Smart Reboot

One of the things people have done is force a reset using System.reset(). This works great on the Photon, but on the Electron has one flaw: It doesn’t reset the modem. This is normally advantageous because it saves data and makes reconnecting much faster, but is not helpful if you’re trying to recover from an unknown connection problem.

One way to solve this is to enter SLEEP_MODE_DEEP for a short period of time, but in this case I used a technique of sending a AT+CFUN=16 directly at the modem to reset both the modem and the SIM card. Maybe that will help.

Also, because it’s more efficient to leave the modem up on a reset, this code tries to do that first, and if that fails, then resets the modem. It also does certain checks first, so if the modem is really acting strangely, it will immediately do a modem reset.

Application Watchdog

This code uses the application watchdog so if you don’t return from loop for more than a minute, it resets the Electron, adding a connection event log entry so you know you have a blocking bug in your code.


Using Electron / Photon in high volume commercial product
Electron in "Connected" state is actually dead
How to retain control of an Electron using Cellular.on and Cellular.off
Electron blinking blue... How can it reset itself?
Is 3G service down today?
Code Not Working after 15 Days in the Field
Modem Hard Reset - expose the function to users
[SOLVED] AT+UPING command not working
Photons in CH run fine, photons in NL stopped responding
Electron went silent after 10 days
Issues with Product firmware
#2

New version of electronsample!

  • The new version includes a library and all of the useful features are split out into classes so you can just include the ones you want instead of copy and pasting them into your code. All of the features are optional now.
  • Battery monitoring can sleep the Electron for a period of time with the battery is low to prevent flash corruption.
  • Modem reset does both AT+CFUN=16 and SLEEP_MODE_DEEP to maximize the chance of success.
  • New session monitoring can publish an event subscribed to on the device at a configurable frequency and reset session if the events are lost multiple times.
  • New node.js script can monitor and decode the connection events.
  • ApplicationWatchdog uses increases stack size to prevent problems on 0.7.0 and later.
  • Tester functions for periodic publish, entering safe mode, more sleep modes, and reset session.

As before, the code can be configured to record various events (cellular connect and disconnect, cloud connect and disconnect, and many more). These are stored in retained memory and uploaded when there’s a working cloud connection. But new in this version is a live decoding and printing tool:

eseries1,2018-05-10T22:39:39.000Z,62,SETUP_STARTED
eseries1,2018-05-10T22:39:41.000Z,2576,CELLULAR_READY connected
eseries1,2018-05-10T22:39:45.000Z,6388,CLOUD_CONNECTED connected
electron1,2018-05-10T22:45:42.000Z,164,SETUP_STARTED
electron1,2018-05-10T22:45:45.000Z,2912,CELLULAR_READY connected
electron1,2018-05-10T22:45:57.000Z,15169,CLOUD_CONNECTED connected
electron2,2018-05-11T00:08:36.000Z,54,SETUP_STARTED
electron2,2018-05-11T00:08:36.000Z,54,RESET_REASON RESET_REASON_PIN_RESET
electron2,2018-05-11T00:08:48.000Z,11866,CELLULAR_READY connected
electron2,2018-05-11T00:08:50.000Z,13935,CLOUD_CONNECTED connected
electron1,2018-05-11T04:33:27.000Z,20865018,SESSION_EVENT_LOST
electron1,2018-05-11T04:34:13.000Z,20913003,CLOUD_CONNECTED disconnected
electron1,2018-05-11T04:34:23.000Z,20922675,CLOUD_CONNECTED connected

In the same location as before:


Avoid limbo state for electron
#3

DAM! That’s a hell of a update! :slight_smile:

This is some much-needed code for some of the most common Electron issues that pop up on the forum.

So does this apply to the E Series just as much as the original 2G & 3G Electrons?

@Vitesze You might want to check this out.

I’ll try these out on the 3G Electron and see how it goes.


#4

Yes, I tested it on the G350 (2G), U260 (3G Americas), and E Series E310 (U201). Also with 0.6.4, 0.7.0, and 0.8.0.rc.3. Works great!


#5

Thanks, I’m putting my Electrons in Deep Sleep atm for a couple of seconds and it seems to usually fix my issues, but this looks like a lot of cool new stuff to try out :wink:


#6

This seems to be a very interesting & useful library that can certainly give some insights into connectivity issues. Some the weird little tricks that seem to be needed every now and again as suggested in these forums to give the cellular module a kick in the rear have had me concerned about the electron variant of our product and here they are condensed into one library for me to peruse and extract.

Out of curiosity I have added this library to a project and flashed it to a device in what appears to be a very poor signal area. In its current form you get much the same effect as without the library - a brick. Because the connection is always dropping & this library in its default state tries to bring it back again although there is a very brief window where loop() manages to sneak a look in between attempts which is a small step forwards.

Its remarkable how the eventual arrival of something resembling actual spring and some lush foliage can turn a just about OK signal to a totally useless one.


#7

I think that there is a bug in this library or in 0.8.0-rc.4

I actually think that the bug is in 0.8.0-rc.4 but it presents here.

// This is called when timing out connecting to the cloud. It adds some debugging events to
// help log the current state for debugging purposes.
// It returns true to force a modem reset immediately, false to use the normal logic for whether to reset the modem.
bool ConnectionCheck::cloudConnectDebug() {
	int res = Cellular.command(pingTimeout, "AT+UPING=\"8.8.8.8\"\r\n");
	ConnectionEvents::addEvent(ConnectionEvents::CONNECTION_EVENT_PING_DNS, res);

	res = Cellular.command(pingTimeout, "AT+UPING=\"api.particle.io\"\r\n");
	ConnectionEvents::addEvent(ConnectionEvents::CONNECTION_EVENT_PING_API, res);

	// If pinging api.particle.io does not succeed, then reboot the modem right away
	return (res != RESP_OK);
}

here is the log
20.537 AT send 35 "AT+UPING=“some_address.com”\r\n"
20.548 AT read OK 6 "\r\nOK\r\n"
423 :loop > ping Some Address = -2
20.548 AT send 25 "AT+UPING=“1.2.3.4”\r\n"
20.559 AT read ERR 37 "\r\n+CME ERROR: operation not allowed\r\n"
426 :loop > ping IP Address = -3
21.794 AT read + 65 "\r\n+UUPING: 1,32,“some_address.com”,“1.2.3.4”,119,909\r\n"
22.236 AT read + 65 "\r\n+UUPING: 2,32,“some_address.com”,“1.2.3.4”,119,221\r\n"
22.678 AT read + 65 "\r\n+UUPING: 3,32,“some_address.com”,“1.2.3.4”,119,220\r\n"
23.230 AT read + 65 “\r\n+UUPING: 4,32,“some_address.com”,“1.2.3.4”,119,226\r\n”

You will note that I am trying to ping one address, and then the next. the modem returns with OK immediately and then the electron issues the next ping command which fails as the first ping is still in progress,

eventually the URC for the first ping arrives.

The reason that I think that it is 0.8.0-rc.4 is because the following AT command that I issue to see if the modem is on or not. used to work in 0.8.0-rc.1 and it definitely doesn’t work now. (I have posted an issue on the forum about this previously)

   //Check that the modem is on and ready for AT commands
   SERIAL_DEBUG("Checking if modem is on");
   int response =  Cellular.command(5000, "AT\r\n");


   //while( response != RESP_OK ){
   for (i = 0; i<5; i++){
	   if (response == RESP_OK ){
		   SERIAL_DEBUG("Modem is on!");
		   break;
	   }
	   else{
  		SERIAL_DEBUG("Modem not responding... must not be on! retry %d", i);
		Cellular.on();
		response =  Cellular.command(5000, "AT\r\n");
	   }
      Particle.process();
   };

with the While loop (rather than the for loop) it will sit there for a very long time before eventually coming online. Unfortunately I haven’t run the Log on this to see what is actually happening.

All together something has changed with the way the modem handler is working in 0.8.0-rc.4


#8

Hey @rickkas7,
I wanted to confirm if you would still recommend to use this code today on Electrons, or if there was something new around.
Thanks!
Gustavo.