Many unresponsive Xenons with Ethernet

@shanevanj Did you find any improvement with OS 1.2.1-rc.3? I upgraded a few days back a mesh gateway using the ethernet featherwing. It still appears to be dropping connection 2-3 times a day and then will flash green until reset or re-powered.

I am happier than rc.2 but then I use no mesh functions - from reading elsewhere in the forums, there still seems to a lot of discussion around stability and mesh functions. Do you have a watchdog defined? and SYSTEM_THREAD(ENABLED)? - I got better results with both (my WD is set for 60s) and so far (5 days continuous testing with a Serial message every minute with a Particle.publish()) it has been super stable. A big contributor in my view is using the PublishQueueAsyncRK library.
I firmly believe that due to the HAL abstraction and the RTOS internals - many things are happening asynchronously and there is no real way for you, as the higher level app developer to confidently know the state of the device (as you would in a single threaded, non RTOS, monolith piece of code) - using this async publish library lets DeviceOS deal with requests in its own time - granted the ā€œfire and forgetā€ nature of PublishQueueAsyncRK works in my use case (YMMV) - I dont know if this approach is relevant for Mesh.publish though - however I suspect that if DeviceOS has communications resources locked up due to multiple threads conflicting there may be a case for trying PublishQueueAsyncRK - perhaps the local Mesh.publish will dovetail better if some resource contention is being managed ?

I can see from the device diagnostics report that there are daily disconnections - however the key thing is that the device is not locking up anymore and gracefully recovers each time - so the underlying comms may still be rocky in places - but functional uptime (in my use case) is not affected.

@shanevanj is the"watchdog" your refer to the function void handle_all_the_events(system_event_t event, int param ) ? Also, Iā€™m not familiar with the PublishQueueAsyncRK library, do you have a code example with the additions that helped your disconnects, I would like to implement them on my devices to see if it helps.

Its this function from here - used in my code (before setup()) like this:

const int wdResetTime = 60000;
ApplicationWatchdog wd(wdResetTime, System.reset);

I have used the library in its simplest form

retained uint8_t publishQueueRetainedBuffer[2048];
PublishQueueAsync publishQueue(publishQueueRetainedBuffer, sizeof(publishQueueRetainedBuffer));

and then substituted all particle.publish() with publishQueue.publish("Name", "data", PRIVATE, WITH_ACK)

This allows the publish events to take place, not exceeding the publish rate limit and (upto the max retained memory allocated) will survive restarts and continue to de-queue on successful reconnection.

I am publishing to the cloud and not just to the mesh. All my Particle.publish()'s are wrapped in if (Particle.connected()) {}. I am aware of PublishQueueAsyncRK library but didnā€™t think it would be required with a wired LAN connection. I have noticed that the device is struggling to reconnect after it loses its cloud connection. It is curious that the async publish mechanism helps this?

1 Like

My understanding ( @ScruffR jump in here anytime :slight_smile: ) is that when connection is lost and the DeviceOS is reconnecting - it doesn't reply to app requests like Particle.connected() promptly and thus builds up a 'service' Q - some of the events block based on accessing shared resources and this causes the issue. I could replicate this on demand before using PublishQueueAsyncRK library. Now those disconnects still occur, however its more like a duck on a pond - serene on the top but paddling like mad underwater :slight_smile: and nothing locks up (YMMV)

In essence I agree :wink:

Even with LAN you will be looking at some latency in the device recognising the lost connection and reestablishing it, so having one place that elaborately deals with such situations will improve convenience at least.

1 Like

@ScruffR @shanevanj Thanks both for the explanation - better go add that PublishQueueAsyncRK library to my gateway !

1 Like

Hey everyone!

I wanted to reach out and thank you all who contributed to this conversation. :slight_smile: Weā€™re been monitoring this conversation as well as a few other voiced concerns with regards to Xenon + Ethernet connectivity.

Some improvements have been put into our upcoming device OS releases that should mitigate some of the bad behavior. However, it seems like some are still experiencing issues.

For anyone still be experiencing issues, I encourage you to share your experiences here. Any identifying information about your setup and network configuration will all be great info to have.

5 Likes

Good to know that this is something you are keeping an eye on!

@mstanley thanks for informing us, please keep us in the loop. Also, please check out my other post about a different kind of disconnects, How to Diagnose Disconnects with Xenon on Ethernet

@shanevanj I have implemented both the watchdog and publishQueue. I hope the watchdog helps, although I would love to know why it is dropping in the first place. I donā€™t think the publishQueue will help me much since I have devices going offline that donā€™t publish anything or have any code in the loop.

1 Like

What are the devices doing that have nothing in the loop and no Particle.publish - just Particle Function and / or variable ?otherwise what is the point in them being internet connected via ethernet.

@armor to get logging information from one of the known networks that are causing problems, I deployed a Xenon with just SerialLogHandler traceLog(LOG_LEVEL_ALL); and an empty setup and loop.

Understand now. Thanks

Hi Matthew,
I filed a bug here with my observations:
https://github.com/particle-iot/device-os/issues/1840

Thanks!
Gustavo.