I’m having a situation where I’m constantly losing the connection to my Xenon. Both the
Boron Gateway and the Xenon are breathing Cyan and neither are reporting disconnect events. I can ping the Boron and invoke function calls, and it seems to behave quite as expected for hours or days.
The Xenon seems to think it’s connected (I have code in the Xenon to check Particle.connected() and it never detects a problem. And messages published by the Xenon seem to come through at any time. In fact whenever the Xenon publishes a message then I can ping it and invoke function calls for a short time. But after two or three minutes, it become unpingable. If/when the Xenon publishes another message, then I can ping it again.
I can somewhat work around it by having the Xenon send a “wake up” message at least once a minute, but that’s pretty clunky.
I was experiencing this issue with with 0.8.0-rc software, and was hoping that it would work better with 0.9.0. But even after putting 0.9.0 on both the Boron and the Xenon, it’s still behaving the same way.
void loop()
{
if ( blink ) {
// To blink the LED, first we'll turn it on...
digitalWrite(led1, HIGH^HL1);
digitalWrite(led2, HIGH);
delay(blinkRate);
// Then we'll turn it off...
digitalWrite(led1, LOW^HL1);
digitalWrite(led2, LOW);
delay(blinkRate);
}
if (Particle.connected()){
last_time_particle_connected = millis(); // we are connected, so reset the timer
} else if ( (millis() - last_time_particle_connected) > particle_connected_timeout ) {
// Try to report a log event - probably won't work
Log.error("Particle not connected for 5 min - attempt reset");
// we have been disconnected for too long, so let's reset everything!
#if Wiring_Wifi
Wifi.off();
delay(1000);
#endif
System.reset();
}
if ( millis() - last_keepalive > particle_keepalive_timeout ) {
last_keepalive = millis();
Log.info("Xenon keepalive");
}
// And repeat!
}
I put SYSTEM_THREAD(ENABLED) does a fast green flash for about 5 seconds, then a a fast cyan flash for about 1 second, the breathes cyan once, then a red SOS and repeats the process.
Fortunately, I was able to recover with safe mode.
The Boron is also running very simple code. It has some function calls to read a DS18B20 temperature sensor or flash LEDs, but the problem occurs whether or not I invoke any of that code, so the loop is effectively doing nothing at all.
I put SYSTEM_THREAD(ENABLED) on the Xenon. I’ll let it run for a while and see if it makes a difference. When I added that line to the Boron, it broke it and would not boot.
The digitalWrite(led1, HIGH^HL1); is so that I can use the same code on devices where a HIGH turns on the system LED and devices (namely the Raspbery Pi) where a high turns it off. I have a run-time check that determines if I’m running on the Raspberry Pi and sets HL1=1 in that case.
SYSTEM_THREAD(ENABLED) on (only) the Xenon does not seem to help. My Xenon published a message at 13:25, but could not be pinged at 13:32. At 13:35 it published another message (I have it doing an un-solicited message every 10 minutes) and could be pinged again.
My code is all pretty simple test code, and i don’t mind sharing it, but it’s a lot to put into the forum. Is there some other way to share it directly with you?
Yes, I’m using particle.keepAlive(200). If I use a larger value, such as 300, it will run for a while (minutes to hours) but sometimes disconnect. With 200, it will run for many days with no disconnects. The Boron seems just fine - during all that time it never becomes unpingable and never has any trouble with function calls. But all that time it’s working so well, the Xenon is losing connectivity every few minutes. Again, this is cloud-to-Xenon connectivity that is failing (pings and function calls). Xenon-to-cloud publishing continues without problems and Particle.connected() always returns success.
This is interesting. I added Particle.keepAlive(200) to the Xenon code. It’s only been running for a short time, but it looks like it’s working much better. I wouldn’t have thought would necessary, and in fact I think that Particle.keepAlive() is not even documented on the Xenon.
So, that seems quite helpful. The only downside, I assume, is that with both the Boron and the Xenon sending keepAlive messages that it doubles the data usage.
When in UDP mode (Electron, E Series, and all Gen 3 Argon, Boron, Xenon), the cloud connection relies on a temporary port forwarded UDP back-channel to the device. This times out after a period of inactivity.
The cellular keep alive basically sends a CoAP packet to the cloud, which sends a DTLS UDP packet to the cloud. This is necessary to make sure the mobile carrier does not remove this port forwarding. The Particle SIM has an unusually long keep-alive of 23 minutes. Some carriers like AT&T and Google Fi requires a keep-alive around 30 seconds.
What’s not documented is that this can also occur for mesh devices. This probably should be solved at the Device OS level, but I suspect it’s not.
When a Xenon connects to the cloud it makes a UDP DTLS connection from the device to the cloud. It gets its own port forwarded back-channel on the gateway for its cloud connection, it does not piggy back on the gateway’s cloud connection.
It appears that if you have a gateway with a short keep-alive (Boron with 3rd-party SIM), you may also need to set the same short keep-alive on Xenon nodes. The reason is that each of them has their own UDP port-forwarded back-channel that needs to be independently kept alive.
So even if this is should happen to be remedied in device OS some time, that’s something that should be documented in the “Xenon Cloud Functions” section which currently doesn’t even feature Particle.keepAlive() as a command.
So even if this is should happen to be remedied in device OS some time, that’s something that should be documented in the “Xenon Cloud Functions” section which currently doesn’t even feature Particle.keepAlive() as a command.
Yes, that is correct.
It also is possible that this could be required on the Argon. It also relies on a temporary UDP port-forwarded back-channel and if your Wi-Fi network's router/firewall removes idle back-channels unusually quickly, then Argons could also need a shorter keep-alive.
That really does sound like it explains my issues. Thanks. @rickkas7
It really needs to be documented, however. If I understand correctly, this means that every Xenon in the mesh network has to have the shorter keepAlive time, and this might have a significant impact on data usage if you have a lot of devices.
Once you know about the issue you may be able to circumvent it by explicitly going via the gateway which then has to distribute the requests locally and vice versa.