Xenon requires Particle.keepAlive when gateway is using third-party SIM

I’m having a situation where I’m constantly losing the connection to my Xenon. Both the
Boron Gateway and the Xenon are breathing Cyan and neither are reporting disconnect events. I can ping the Boron and invoke function calls, and it seems to behave quite as expected for hours or days.

The Xenon seems to think it’s connected (I have code in the Xenon to check Particle.connected() and it never detects a problem. And messages published by the Xenon seem to come through at any time. In fact whenever the Xenon publishes a message then I can ping it and invoke function calls for a short time. But after two or three minutes, it become unpingable. If/when the Xenon publishes another message, then I can ping it again.

I can somewhat work around it by having the Xenon send a “wake up” message at least once a minute, but that’s pretty clunky.

I was experiencing this issue with with 0.8.0-rc software, and was hoping that it would work better with 0.9.0. But even after putting 0.9.0 on both the Boron and the Xenon, it’s still behaving the same way.

1 Like

That's not surprising due to the connectionless nature of UDP on which the mesh devices work.

The usual first suspect is the user code, hence we need to see that to either confirm or refurte that working theory.

Here is my loop - pretty much vanilla

void loop()
{
    
    
    if ( blink ) {
        // To blink the LED, first we'll turn it on...
         digitalWrite(led1, HIGH^HL1);
          digitalWrite(led2, HIGH);

      delay(blinkRate);

        // Then we'll turn it off...
         digitalWrite(led1, LOW^HL1);
         digitalWrite(led2, LOW);

          delay(blinkRate);
    }
    
    if (Particle.connected()){ 
        last_time_particle_connected = millis();     // we are connected, so reset the timer
    } else if ( (millis() - last_time_particle_connected) > particle_connected_timeout ) {
        
        // Try to report a log event - probably won't work
        Log.error("Particle not connected for 5 min - attempt reset");
        
         // we have been disconnected for too long, so let's reset everything!
#if Wiring_Wifi
         Wifi.off();
         delay(1000);
#endif
         System.reset();
    }   
    
    if ( millis() - last_keepalive > particle_keepalive_timeout ) {
        last_keepalive = millis();
        Log.info("Xenon keepalive");
    }

  // And repeat!
}

Is your Boron on a 3rd party SIM or Particle's original?

What code is running on the Boron?
Can you try SYSTEM_THREAD(ENABLED) on both devices?

BTW, what's the meaning of this?

I put SYSTEM_THREAD(ENABLED) does a fast green flash for about 5 seconds, then a a fast cyan flash for about 1 second, the breathes cyan once, then a red SOS and repeats the process.
Fortunately, I was able to recover with safe mode.

I’m not quite sure what SYSTEM_THREAD(ENABLED) does. The docs at https://docs.particle.io/reference/device-os/firmware/photon/#system-threading-behavior are a little cryptic. Specifically, I’m not clear what changes might be needed in my setup() or loop() code to handle it. I did not make any changes to the Xenon and it seems to work OK.

  • My Boron is using a third party SIM
  • The Boron is also running very simple code. It has some function calls to read a DS18B20 temperature sensor or flash LEDs, but the problem occurs whether or not I invoke any of that code, so the loop is effectively doing nothing at all.
  • I put SYSTEM_THREAD(ENABLED) on the Xenon. I’ll let it run for a while and see if it makes a difference. When I added that line to the Boron, it broke it and would not boot.
  • The digitalWrite(led1, HIGH^HL1); is so that I can use the same code on devices where a HIGH turns on the system LED and devices (namely the Raspbery Pi) where a high turns it off. I have a run-time check that determines if I’m running on the Raspberry Pi and sets HL1=1 in that case.

SYSTEM_THREAD(ENABLED) on (only) the Xenon does not seem to help. My Xenon published a message at 13:25, but could not be pinged at 13:32. At 13:35 it published another message (I have it doing an un-solicited message every 10 minutes) and could be pinged again.

My code is all pretty simple test code, and i don’t mind sharing it, but it’s a lot to put into the forum. Is there some other way to share it directly with you?

Have you set Particle.keepAlive() then?

Yes, I’m using particle.keepAlive(200). If I use a larger value, such as 300, it will run for a while (minutes to hours) but sometimes disconnect. With 200, it will run for many days with no disconnects. The Boron seems just fine - during all that time it never becomes unpingable and never has any trouble with function calls. But all that time it’s working so well, the Xenon is losing connectivity every few minutes. Again, this is cloud-to-Xenon connectivity that is failing (pings and function calls). Xenon-to-cloud publishing continues without problems and Particle.connected() always returns success.

This is interesting. I added Particle.keepAlive(200) to the Xenon code. It’s only been running for a short time, but it looks like it’s working much better. I wouldn’t have thought would necessary, and in fact I think that Particle.keepAlive() is not even documented on the Xenon.
So, that seems quite helpful. The only downside, I assume, is that with both the Boron and the Xenon sending keepAlive messages that it doubles the data usage.

That is indeed an interesting find.
Maybe something for @mstanley or @rickkas7 to look into.

This is certainly an interesting find indeed. I’ll be sure to reach out to our engineering team and relay the info. Thanks everyone.

1 Like

Can you try this mesh test code and see how it works.

I have the Boron & 8 Xenons up and running for weeks now without any issues with the code above.

I suspect the reason is this:

When in UDP mode (Electron, E Series, and all Gen 3 Argon, Boron, Xenon), the cloud connection relies on a temporary port forwarded UDP back-channel to the device. This times out after a period of inactivity.

The cellular keep alive basically sends a CoAP packet to the cloud, which sends a DTLS UDP packet to the cloud. This is necessary to make sure the mobile carrier does not remove this port forwarding. The Particle SIM has an unusually long keep-alive of 23 minutes. Some carriers like AT&T and Google Fi requires a keep-alive around 30 seconds.

What’s not documented is that this can also occur for mesh devices. This probably should be solved at the Device OS level, but I suspect it’s not.

When a Xenon connects to the cloud it makes a UDP DTLS connection from the device to the cloud. It gets its own port forwarded back-channel on the gateway for its cloud connection, it does not piggy back on the gateway’s cloud connection.

It appears that if you have a gateway with a short keep-alive (Boron with 3rd-party SIM), you may also need to set the same short keep-alive on Xenon nodes. The reason is that each of them has their own UDP port-forwarded back-channel that needs to be independently kept alive.

3 Likes

So even if this is should happen to be remedied in device OS some time, that’s something that should be documented in the “Xenon Cloud Functions” section which currently doesn’t even feature Particle.keepAlive() as a command.

2 Likes

So even if this is should happen to be remedied in device OS some time, that’s something that should be documented in the “Xenon Cloud Functions” section which currently doesn’t even feature Particle.keepAlive() as a command.

Yes, that is correct.

It also is possible that this could be required on the Argon. It also relies on a temporary UDP port-forwarded back-channel and if your Wi-Fi network's router/firewall removes idle back-channels unusually quickly, then Argons could also need a shorter keep-alive.

2 Likes

I can see a can of worms :flushed:

3 Likes

That really does sound like it explains my issues. Thanks. @rickkas7

It really needs to be documented, however. If I understand correctly, this means that every Xenon in the mesh network has to have the shorter keepAlive time, and this might have a significant impact on data usage if you have a lot of devices.

Once you know about the issue you may be able to circumvent it by explicitly going via the gateway which then has to distribute the requests locally and vice versa.