Many unresponsive Xenons with Ethernet

armor · June 23, 2019, 1:52pm

@shanevanj Did you find any improvement with OS 1.2.1-rc.3? I upgraded a few days back a mesh gateway using the ethernet featherwing. It still appears to be dropping connection 2-3 times a day and then will flash green until reset or re-powered.

shanevanj · June 24, 2019, 7:32am

I am happier than rc.2 but then I use no mesh functions - from reading elsewhere in the forums, there still seems to a lot of discussion around stability and mesh functions. Do you have a watchdog defined? and SYSTEM_THREAD(ENABLED)? - I got better results with both (my WD is set for 60s) and so far (5 days continuous testing with a Serial message every minute with a Particle.publish()) it has been super stable. A big contributor in my view is using the PublishQueueAsyncRK library.
I firmly believe that due to the HAL abstraction and the RTOS internals - many things are happening asynchronously and there is no real way for you, as the higher level app developer to confidently know the state of the device (as you would in a single threaded, non RTOS, monolith piece of code) - using this async publish library lets DeviceOS deal with requests in its own time - granted the “fire and forget” nature of PublishQueueAsyncRK works in my use case (YMMV) - I dont know if this approach is relevant for Mesh.publish though - however I suspect that if DeviceOS has communications resources locked up due to multiple threads conflicting there may be a case for trying PublishQueueAsyncRK - perhaps the local Mesh.publish will dovetail better if some resource contention is being managed ?

I can see from the device diagnostics report that there are daily disconnections - however the key thing is that the device is not locking up anymore and gracefully recovers each time - so the underlying comms may still be rocky in places - but functional uptime (in my use case) is not affected.

amillen · June 24, 2019, 2:33pm

@shanevanj is the"watchdog" your refer to the function void handle_all_the_events(system_event_t event, int param ) ? Also, I’m not familiar with the PublishQueueAsyncRK library, do you have a code example with the additions that helped your disconnects, I would like to implement them on my devices to see if it helps.

shanevanj · June 25, 2019, 5:23am

Its this function from here - used in my code (before setup()) like this:

const int wdResetTime = 60000;
ApplicationWatchdog wd(wdResetTime, System.reset);

I have used the library in its simplest form

retained uint8_t publishQueueRetainedBuffer[2048];
PublishQueueAsync publishQueue(publishQueueRetainedBuffer, sizeof(publishQueueRetainedBuffer));

and then substituted all particle.publish() with publishQueue.publish("Name", "data", PRIVATE, WITH_ACK)

This allows the publish events to take place, not exceeding the publish rate limit and (upto the max retained memory allocated) will survive restarts and continue to de-queue on successful reconnection.

armor · June 25, 2019, 7:48pm

I am publishing to the cloud and not just to the mesh. All my Particle.publish()'s are wrapped in if (Particle.connected()) {}. I am aware of PublishQueueAsyncRK library but didn’t think it would be required with a wired LAN connection. I have noticed that the device is struggling to reconnect after it loses its cloud connection. It is curious that the async publish mechanism helps this?

shanevanj · June 26, 2019, 7:50am

My understanding ( @ScruffR jump in here anytime ) is that when connection is lost and the DeviceOS is reconnecting - it doesn't reply to app requests like Particle.connected() promptly and thus builds up a 'service' Q - some of the events block based on accessing shared resources and this causes the issue. I could replicate this on demand before using PublishQueueAsyncRK library. Now those disconnects still occur, however its more like a duck on a pond - serene on the top but paddling like mad underwater and nothing locks up (YMMV)

ScruffR · June 26, 2019, 7:56am

In essence I agree

Even with LAN you will be looking at some latency in the device recognising the lost connection and reestablishing it, so having one place that elaborately deals with such situations will improve convenience at least.

armor · June 26, 2019, 10:43pm

@ScruffR @shanevanj Thanks both for the explanation - better go add that PublishQueueAsyncRK library to my gateway !

mstanley · June 27, 2019, 5:12am

Hey everyone!

I wanted to reach out and thank you all who contributed to this conversation. We’re been monitoring this conversation as well as a few other voiced concerns with regards to Xenon + Ethernet connectivity.

Some improvements have been put into our upcoming device OS releases that should mitigate some of the bad behavior. However, it seems like some are still experiencing issues.

For anyone still be experiencing issues, I encourage you to share your experiences here. Any identifying information about your setup and network configuration will all be great info to have.

armor · June 27, 2019, 9:18am

Good to know that this is something you are keeping an eye on!

amillen · June 27, 2019, 2:57pm

@mstanley thanks for informing us, please keep us in the loop. Also, please check out my other post about a different kind of disconnects, How to Diagnose Disconnects with Xenon on Ethernet

@shanevanj I have implemented both the watchdog and publishQueue. I hope the watchdog helps, although I would love to know why it is dropping in the first place. I don’t think the publishQueue will help me much since I have devices going offline that don’t publish anything or have any code in the loop.

armor · June 28, 2019, 10:31pm

What are the devices doing that have nothing in the loop and no Particle.publish - just Particle Function and / or variable ?otherwise what is the point in them being internet connected via ethernet.

amillen · July 1, 2019, 6:19pm

@armor to get logging information from one of the known networks that are causing problems, I deployed a Xenon with just SerialLogHandler traceLog(LOG_LEVEL_ALL); and an empty setup and loop.

armor · July 1, 2019, 8:29pm

Understand now. Thanks

gusgonnet · July 2, 2019, 2:47am

Hi Matthew,
I filed a bug here with my observations:
https://github.com/particle-iot/device-os/issues/1840

Thanks!
Gustavo.

Topic		Replies	Views
Xenon Blinking Cyan followed by 3 Orange Troubleshooting xenon	10	981	May 8, 2019
Xenon mesh unstable BLE // NFC	19	2150	November 19, 2018
Xenons not making Cloud Connection via mesh BLE // NFC	22	3901	December 4, 2018
Argon/Xenon unresponsive/not showing offline Troubleshooting argon , xenon , mesh	1	1187	July 18, 2019
Xenon can't reconnect (blinking cyan + 3 orange) Troubleshooting xenon	7	1948	January 3, 2020

Many unresponsive Xenons with Ethernet

Related topics