Partial disconnect of mesh devices

Awesome, thanks @mstanley, that’s exactly what I was hoping to hear.

As others here, I have experienced Xenon disconnecting from Argon. This is certainly a bug given that my units are close to each other.

But I experienced another defect that may be new to the group. I had an Argon become unreachable, while a Xenon that was using it as gateway to remain reacheable. The argon showed on the Particle App as “last heard” 2 days while the Xenon showed 15 seconds. None of the particle.variable data of the Argon was available, but the particle.variable data of the Xenon was available. I reset the Argon but it did not fix, I powered cycled the Argon and the problem went away.

If you follow back up the thread you may see that I report similar… I found that some of my Xenon nodes which routed via and Argon to the Internet were “off line”.

I was unable to address them from the console / flash signal LED etc. While this was the case the “mesh” code continued to run, including sending traffic via the Argon which declared it “offline” (several Xenon). Ultimately power cycle (batter and mains) at the gateway Argon allowed us to reach the Xenon modules again.

I don’t know what may be happening behind the scenes at Particle (saw link to private forum also…) - but subjectively it seems to me the recurrence of this issue is less in the last 7-8 days than prior, and all on 0.90 firmware with Particle sample code.

I hope to learn more here, share what I know - thanks to all for teamwork / support.

Best, _Todd

Things looking better now.

I don’t rely on the particle console as a means of determining whether devices are visible - don’t ever look at it anymore. I watch incoming events from the mesh network via webhook integration and actively call functions on each device to log visibility. i.e. generate function calls every 30secs to each device, log responses in the db, keep track of how that is looking. Every 15minutes a script runs on my db server and sends me something over Slack if things start getting wonky. If visibility drops below ~95% I’ve found that bouncing my gateway device gets things back in order pretty quickly - that gets pushed out to Slack also.

Three of the devices are gathering temperature data every 60seconds and publishing 1) to the mesh, then 2) to the cloud where my database webhook is listening. The 4th device is a controller that subscribes to the mesh topics and controls a set of air handlers. Each time the controller receives temperature data from the mesh it also publishes an event to the cloud telling me what it thinks and what it is doing about it. So roughly speaking the webhook expects a call every 10seconds - picks up the data and stuffs it away in transaction log.

Since finding the solution of bouncing my gateway Argon - I don’t have any significant gaps in the db log history running around the clock in the third week. I plan to deploy a mesh proof-of-concept shortly. It will be remote - at this point I’m pretty confident this won’t be a goat.

If anyone wants to watch my Slack channel - give me an email and you’ll get an invitation.

Charlie

p.s. I’m learning not to look at the pretty flashing breathing lights either!

1 Like

Charlie,

We are doing something similar & appreciate your input. Will report on progress / lack of same.

Many Thanks, _Todd

Had an interesting situation develop over the weekend on my 4Xenon + 1Argon mesh network. It has been work very well - no gaps in data. There are a few missed pongs() (my function which gets called from the cloud every minute), and a few over-the-cloud resets in the past week.
Friday I took one of the Xenons on a field trip for a little show and tell with the customer. I knew that it would appear to be unresponsive, result in numerous forced resets of the Argon, etc. All of that made for a better show.
When I returned the Xenon happily returned to the mesh network, but I could no longer pong() the Argon. In fact it had become unresponse shortly after I took the Xenon out. All the while mesh traffic is passing smoothly and all remaining Xenon’s are able to particle.publish() successfully.

I waited 18hours before manually resetting the Argon after which everything returned to normal.

So I’m adding a 30minute software timer, which will be reset on each pong(). If the timer ever expires it will call System.reset(). So there are two ways now to reset, one external and one internal.

void watchDogTimerBounce() {
    System.reset(); 
}

// 04-01-2019 30minutes
Timer watchDogTimer(1800000, watchDogTimerBounce);

/*  04-01-2019
 *  This is the function called by the Database Node to determine if the Node is online
 */
int pong(String extra) {  
  //04-01-2019 feed the watchdog, and nobody gets hurt!
  watchDogTimer.reset();
  return 0;  
}
1 Like

@holobox, you mention “pong()”. Perhaps it is your code that is not recovering the “pongs”. Are you getting other mesh traffic between the Xenon and the Argon besides the fact that it reconnects to the Argon mesh?

@peekay123 pong is just a function that I…

Particle.function("pong", pong); in setup.

and call using the particle cloud API. It is just another way of pinging the device. It’s rock solid. As far as I am able to tell (I’m moving temperature data around the mesh and logging through the cloud) both the local cloud.publish() and particle.publish() are working well.

@holobox, so the Argon itself was not cloud publishing but all nodes were? Could you “pong” any of the nodes (assuming you have that function on the nodes)? Did the watchdog reset fix the problem?

I don’t have the Argon publishing to the cloud.

The new watchdog was put in this morning, will try to induce the same issue sometime this week.

1 Like

@holobox, can you post your “pong” code on the Argon? I am wondering why the nodes are still able to cloud publish but not the Argon in response to the “pong” function call. In my experience, when the Argon goes weird, the entire mesh loses cloud connectivity. Do you have Particle.variables() on any of the nodes you can query?

it’s up above in this thread, it’s a two-liner. Not sure what kind of weirdness you’re referring to. When my Xenon’s have temperature data, they publish that to the mesh for the other devices to see. The Argon picks that message up, and sends an Ack back to the Xenon, the Xenon then publishes that same data (+ the acknowledgement time in msecs) out to the cloud.

  • as far as I can tell, when the Argon became non-responsive to the pong() function call – all other gateway functions remained working at 100%.

@holobox, I missed that bit of code! When I say weirdness, I have had time where the Argon and all its nodes become unresponsive to the cloud but continue their Mesh comms just fine. Since your “pong” code doesn’t publish a “ping”, I’m not sure why the Argon seemingly stalls. I will have to try the experiment on my own Argon mesh.

My experience with Argon and cloud isolation (while local mesh still runs) is exactly as discussed / described by Holobox. I have gone back to “blink” code to rule out anything we are doing… After a while (hours, to days) Xenon devices on a network with an Argon gateway become unreachable from Particle.io cloud / API yet local Mesh demonstrably active & correct. I expect that to test today I will need to manually reset the Argon gateway.

I think perhaps we hold off and await new firmware. 0.90 is very mature compared to 0.87 it seems & 0.9x+ priority list I think includes two items which may help us out here… If I recall the ‘feature’/‘request’ tally I hope we can expect a Particle watchdog item as well as the ability to add multiple gateways per Network.

The multiple gateway would likely help since we do have Ethernet at a test location, it would also be easy to place a Boron as a backup gateway.

Thank you for sharing your code (holobox) and to Particle for making this community resource available.

Just a progress report and a bit on my perspective regarding stability.

I am at 100% cloud availability as far as I’m concerned. I use the particle API from my DB server to call a function on each device and log the response (pass/fail) to the DB. I keep minute-by-minute data and prune records older than 14days. I don’t look at the device list at particle.io, I’m not sure I have ever trusted the accuracy of that.

In this roughly 2month period I have learned to be very suspicious of my own code. If cloud availability is in the ditch, start investigating by reducing your loop() to Particle.process(); and see if that makes a difference. If it does, then experiment with some other approaches to doing the business you need in loop(). Oh, by the way - I’ve become particularly suspicious of the software timer (Timer). I just don’t trust them anymore although I know many are using them without issues.

At any rate, the code I’m writing for Xenon/Argon is significantly more streamlined and, I guess I’d have to say - simpler than many of my Electron projects.

3 Likes

On Monday ~1400ET started to see some ping failures. I was careful not to make any changes to my mesh network that could affect things. The failures were somewhere around 5% but they persisted for more than a day. Finally started clearing up around 1800ET Tuesday. Everything within the mesh was unaffected.

1 Like