Way to automatically powercycle if can't connect to mesh network?

xenon
Tags: #<Tag:0x00007fe2225b0ea8>

#1

I currently have ~50 Xenons in a mesh network. Everything is running smoothly. However if there is an interruption to the gateway and it power cycles, all of the Xenons no longer are connected. All 50 devices then try reconnect to the mesh network at the same time. What I’m seeing is that 48/49 can reconnect after a minute or two. However twice I’ve had 1 or 2 get stuck in trying to reconnect (fast cyan flashes) and they never reconnect.

The solution

What I’ve found resolves the issue, is to just unplug them/ plug back in. That resets the Xenons and they reconnect no problem.

Question

Is there a way to write into the firmware ‘If not connected to mesh network after 5 minutes, power cycle’?

I am not sure if this is possible, but it seems like it would solve the issue of random edges not able to connect to the rest of the mesh network.


#2

@emile, 50 devices on a single gateway is great! Is it always the same 1 or 2 that get stuck? Are they directly connecting to the gateway or via hops through other Xenons? In terms of hops, are they the “furthest” away (most hops)? Are they all running the same DeviceOS and application?


#3

So far it seems to be different Xenons each time - however I’ll keep an eye on it.

All running 1.4.2 & same script.

I am not sure if they are the furthest in the mesh. I believe I’ve seen a method in Thread to see what devices are connected to a device, but I’m not sure if I’ve seen that exposed in the Particle docs. Is there a way to do that over the Particle API?


#4

Not that I know of currently. However, you can guess and assume from location/adjacency to other nodes what the path might be. Nonetheless, nodes should always recover they connection IMO.


#5

+1

As far as proximity, they are all in the same room right now as I continue to test the network. Soon they will be spread throughout buildings which will be really interesting to monitor.


#6

@emile, can you test with say 45 devices and see if you get reconnection failures?


#7

Just ran 2 full cycles

  • Round 1: 45 Xenons - a different one didn’t connect
  • Round 2: 51 Xenons - all connected

Its an intermittent issue. Fortunately a simple solution with just unplugging them.

@peekay123 is there a method to power cycle that you know of?


#8

Nope. That has to be done externally. I am sharing your results with the Particle folks because they really only recommend meshes of up to 10 devices and you have 50 and getting good results generally from what you have indicated. Pushing the boundaries is a good way to test stuff.


#9

@emile - first, props for fitting so many Xenons on a single Mesh! I don’t think I’ve seen this done before on such a scale. Can you establish a “control” Mesh of the supported 10 (or even 8, for my sanity) devices just so we can ascertain whether or not this condition is a matter of network size?


#10

I can, I need to flush out some issues with the Raspberry pi/Lubuntu first (serial port is giving me trouble). I’ll soon have them broken down into smaller networks.


#11

YES, at least I’ve done it with Borons in the past.
A GPIO pin can pull the EN Pin low causing a complete Power Cycle (Shutdown) and immediate reboot.
Should work the same for a Xenon.


#12

@Rftop awesome! I’ll be testing this tomorrow


#13

I connected the EN Pin to D5 with a 47k Resistor.


#14

You could also use the System.reset() function for a more managed approach :slight_smile:

here are some code snippets I use (with SYSTEM_THREAD(ENABLED) to get a handle on resets etc. You could enable the system watchdog early and have a cloud connection test in setup() or loop() that calls the reset function below - or the watchdog resets it if the cloud function hangs up.


enum enum_rebootReasons
{
    RESET_UNKNOWN,
    RESET_NETWORK_FAILED,
    RESET_TVIEW_FAILED,
    RESET_CLOUD_FAILED,
    RESET_COMMAND,
    RESET_GRACEFULL,
    RESET_REQUEST,   //reset requested manual or app/formware update
    RESET_ATTRIBUTE, // reset after attribute updates for brk or udp
    numOfResetReasons
};

const char resetReasonsText[numOfResetReasons][sizeOfResetReasonText] = {
    "Unknown",
    "Network failure",
    "tView failure",
    "Cloud failure",
    "Remote command",
    "Gracefull request",
    "Reset request",  // from button or cloud
    "Attribute reset" // reset after attribute updates for brk or udp
};

// ------------------------------------------------------------------------------
void resetDevice(int resetReason) // minimal reset called from system event or reset button
// ------------------------------------------------------------------------------
{
    Serial.print("Reset reason %s", resetReasonsText[resetReason]);
    delay(200); // allow serial buffer to flush
    System.enableReset();
    System.reset(resetReason);
}

and then call this on startup so you know what broke …

// ------------------------------------------------------------------------------
void prevResetReason() //
// ------------------------------------------------------------------------------
{
    const int maxSizeOfResetText = 59;
    const int maxSizeOfResetMsg = 60;
    char systemResetMsg[maxSizeOfResetMsg] = "Reset:Non defined reason"; // reset reason not defined in DeviceOS
    uint32_t data = System.resetReasonData();

    switch (System.resetReason())
    {
    case RESET_REASON_PIN_RESET: // Reset button or reset pin
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Button or reset pin");
        break;

    case RESET_REASON_POWER_MANAGEMENT: // Low-power management reset
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Low-power management");
        break;

    case RESET_REASON_POWER_DOWN: // Power-down reset
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Power-down");
        break;

    case RESET_REASON_POWER_BROWNOUT: // Brownout reset
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Brownout");
        break;

    case RESET_REASON_WATCHDOG: // Hardware watchdog reset
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Hardware watchdog");
        break;

    case RESET_REASON_UPDATE: // Successful firmware update
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Successful firmware update");
        break;

    case RESET_REASON_UPDATE_TIMEOUT: // Firmware update timeout
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Firmware update timeout");
        break;

    case RESET_REASON_FACTORY_RESET: // Factory reset requested
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Factory reset requested");
        break;

    case RESET_REASON_SAFE_MODE: // Safe mode requested
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Safe mode requested");
        break;

    case RESET_REASON_DFU_MODE: // DFU mode requested
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:DFU mode requested");
        break;

    case RESET_REASON_PANIC: // System panic
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:System panic %lu", data);
        break;

    case RESET_REASON_USER: // User-requested reset
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:User %s", resetReasonsText[data]);
        break;

    case RESET_REASON_UNKNOWN: // Unspecified reset reason
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Unspecified reason");
        break;

    case RESET_REASON_NONE: // Information is not available
        snprintf(systemResetMsg, maxSizeOfResetText, "Reset:Reason not available");
        break;
    default: // reset reason not defined in DeviceOS
        break;
    }
    Serial.println(systemResetMsg);
}

#15

I have tried using the AWD and System.reset() and this didn’t work because I think you will find it is not actually stuck in the application.

I haven’t yet tried this as my mesh network is only 8 Xenons large! I have still noticed that there is usually one device that flashes green for a longer time than the others. But they do all connect. This is what I plan to test:

After a sleep or at startup - call Mesh.connect() then WaitFor(Mesh.ready, MESH_READY_WAIT_TIME); and if after the WaitFor () times out Mesh.ready is false then either try the System.reset() or toggle the EN pin. I currently WaitFor() and test for Mesh.ready() before making any Mesh.publish() calls.

Question - do you have a Mesh.subscribe() called before the Mesh.connect()?


#16

I figured out a way to get those unresponsive Xenons to reset and reconnect successfully after the Boron resets its network connection or is just power cycled.

Check the string of post starting with the one post linked below to see the code I used to reset the Xenons via software by using publish and subscribe functions that still seem to work even when the Xenons are flashing green.

Electron vs. Boron – “living on the edge” – what has changed with the cellular connection?


#17

This is pretty clever! Added, will test this.

One question, did your Xenons not naturally reconnect after the Gateway came back? So far, the vast majority of mine will reconnect after I push a Gateway update. It is only after a minute or two that 1 or 2 don’t seem to reconnect. Those are the only ones I really need to send the system reset. I’ll be circling back to this tomorrow and will share what I end up doing.


#18

Here’s what I have so far, is there a more elegant way to handle the initial connection instead of a 10 second delay?

SYSTEM_THREAD(ENABLED)

int start_time;
int interval = 30; // # of seconds

void setup() {
    Serial.begin(9600);
    delay(10000); // Put in a delay so it has a chance to connect before getting into the loop
}

void loop() {
    delay(5000);
    if (!Particle.connected()) {
      // If not connected to Particle, set timer
      if (start_time == 0) {
        start_time = Time.now();
        Serial.println("Start");
      };
      if (start_time > 1) {
        // If timer too long, reset the device
        if (Time.now() > start_time + interval) {
          // Reset
      		System.reset();
      	} else {
      	    Serial.println("Offline");
      	};
      };
    } else {
        if (start_time > 1) {
            // Must reset after reconnecting
             Serial.println("Reset");
            start_time = 0;
        };
    };
}

#19

Big delays are generally bad / indicate the wrong architecture of your code. Why do you need to give it a chance to connect before you hit loop? If you really do need to do that, simply track the time with a variable.

Why are you using Time.now()? If the device hasn’t connected to internet since power cycle, it will return an invalid value. It is only set AFTER particle is connected. You should use millis() instead to get milliseconds since reset.

Also, you never initialize start_time, so while in practice it probably is zero, that is not guaranteed. Set start_time = 0; in your setup. But also why are you only starting the timer in your loop to track the duration to connect? Just start it in setup and then increase your interval.

Here is what I would do based on what I think you are trying to do:

SYSTEM_THREAD(ENABLED);

const uint32_t connectivity_message_interval = 5000;  // 1 sec
uint32_t last_connectivity_message_time = 0;  // controls how often we send the offline message

const uint32_t connectivity_timeout = 180000UL;  // 3 min b/c 30sec is pretty short, but up to you.
uint32_t last_connectivity_change_time = 0;

bool is_particle_connected; // flag to handle the moment of connectivity change

void setup() {
    Serial.begin(9600);
    
    Mesh.on();  // potentially needed due to bug when mesh module is not already powered up.
    Mesh.connect();
    Particle.connect(); 
    
    is_particle_connected = Particle.connected();
    
    Serial.println("Finished setup");
}

void loop() {
  Particle.process();
  
  if (!Particle.connected()) {
    if (is_particle_connected) {
      // handle moment of disconnection
      is_particle_connected = false;
      last_connectivity_change_time = millis();
      Serial.println("Just Went Offline");
    }
    if ((millis() - last_connectivity_change_time) > connectivity_timeout) {
      // probably an issue connecting, we'll try to fix by resetting
      Serial.println("Resetting due to connectivity timeout...");
      delay(1000);  // allow serial message to get read out 
      System.reset();
    }
    if ((millis() - last_connectivity_message_time) > connectivity_message_interval) {
      last_connectivity_message_time = millis();
      Serial.println("Still Offline");
    }
  }
  else {
    if (!is_particle_connected) {
      // handle moment of connection
      is_particle_connected = true;
      last_connectivity_change_time = millis();
      Serial.println("Just Came Online");
    }
    // We are connected!  Time for normal connected stuff...
    if ((millis() - last_connectivity_message_time) > connectivity_message_interval) {
      last_connectivity_message_time = millis();
      Serial.println("Still Online");
    }
  }
}

#20

Thanks for taking a look at my code. This looks much better than what I was doing. TBH I’m learning on the fly (Python eng tinkering in hardware), so every place you think - ‘Huh that looks odd,’ toss it up to just trying to get C++ working :slight_smile:

Yea this seemed odd to me too so figured I should ask if there’s a smarter way.

This was just a way to track how long an action has/hasn’t taken place. millis() definitely looks like the better solution. Thanks for the explanation.

This was a real headache to fight. Bc you are right, there were cases it wasn’t 0 and I couldn’t figure out why.

Thanks again for the help!