Xenon mesh unstable

I received my xenons and ethernet featherwing this past friday and have been unable to keep a mesh network up. It comes up fine and works for a while, but never more than 10 minutes. When it fails, all mesh connected nodes blink green. The gateway stays in breathing cyan state, and responds to pings (both ICMP pings from the command prompt and pings from the particle.io web gui).

Removing and reconnecting the network connection to the gateway brings the mesh network back up.

I have tried:

  • 3 different xenons on the featherwing
  • 5 different xenons as clients
  • deleting everything and starting from scratch
  • deleting everything and starting from scratch 4 days later (today), with 2 virgin xenons and a new network name
  • multiple USB power sources on both devices
  • plugging the featherwing into 2 different network devices
  • static ip reservation for the ethernet featherwing with all edge firewall rules disabled

Any ideas?

Tomorrow I will take these two xenons and the featherwing to work and try my work internet. It seems very strange that the xenon on the featherwing remains connected to particle.io and breathing cyan, but removing/reconnecting its network cable brings the mesh network back up for a few minutes.

What are the gateway xenon and edge xenons doing? I mean what is the user firmware doing? Pub/sub, delay etc…

Nothing. All i’ve done with them so far is open the boxes, power them up & config them with the app.

Ah I see. I assume that the edge xenons are placed very close to the gateway xenons?

Yep, they’re about 6 feet apart right now. I’ve had them both on my desk inches apart, and up to about 20 feet when I had the gateway connected to a different ethernet switch. No difference.

I just tried changing the DNS server handed out by my dhcpd to 8.8.8.8 instead of my internal DNS server. This had no effect.

@rickkas7, do we have sample code to get logs via USB serial?

You can flash any code which features SerialLogHandler logger(LOG_LEVEL_ALL); to get logs from the device via USB Serial.

I’ve got the two test devices with me at work today, and they work fine on the network here. I have comcast at home with an arris surfboard modem. I’ve seen the other threads where people are having problems with arris modems.

I’ve flashed code with the SerialLogHandler to both of these xenons. I’ll try them at home this evening and see what the logs say. Thanks @kennethlimcp and @ScruffR for your help so far!

3 Likes

Ok, I’ve got logs of the failure.
I don’t see a way to attach text files here, so this is a snippet of the log file from the xenon acting as the gateway right when the client loses it’s connection:

0000339098 [comm.protocol] INFO: message id 26 complete with code 0.00
0000339098 [comm.protocol] INFO: rcv'd message type=13
0000350012 [hal] INFO: DNS server list changed
0000350013 [system.nm] TRACE: br_enabled=0
0000350018 [hal] INFO: DNS server list changed
0000350108 [net.br] INFO: Stopping
0000350109 [net.br] TRACE: Disabling NAT64
0000350109 [net.br] TRACE: Stopping DNS64
0000350300 [net.br] TRACE: Disabling border routing
0000350302 [system.ot] TRACE: Thread network data changed
0000350302 [net.th] TRACE: OpenThread state changed: 200
0000350303 [net.th] TRACE: OT_CHANGED_THREAD_NETDATA
0000350304 [system.ot] TRACE: IPv6 address was removed
0000350305 [net.th] TRACE: OpenThread state changed: 2
0000350305 [net.ifapi] TRACE: Netif th1 ipv6 addr state changed
0000350307 [system.nm] INFO: State changed: IP_CONFIGURED -> IFACE_LINK_UP
0000350308 [net.ifapi] TRACE: Netif th1 ipv6 configuration changed
0000350309 [net.th] TRACE: OT_CHANGED_IP6_ADDRESS_REMOVED
0000350310 [system.nm] INFO: stopping disabled gateway
0000350711 [comm] INFO: Forcing a cloud ping
0000350813 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0000350814 [comm.protocol] INFO: message id 27 complete with code 0.00
0000350815 [comm.protocol] INFO: rcv'd message type=13

And here’s the client’s log:

0000338308 [comm.protocol] INFO: message id 17 complete with code 0.00
0000338309 [comm.protocol] INFO: rcv'd message type=13
0000343904 [system.ot] TRACE: Thread network data changed
0000343905 [net.th] TRACE: OpenThread state changed: 200
0000343906 [net.th] TRACE: OT_CHANGED_THREAD_NETDATA
0000343907 [system.ot] TRACE: IPv6 address was removed
0000343907 [net.th] TRACE: OpenThread state changed: 2
0000343908 [net.ifapi] TRACE: Netif th1 ipv6 addr state changed
0000343909 [system.nm] INFO: State changed: IP_CONFIGURED -> IFACE_LINK_UP
0000343910 [net.ifapi] TRACE: Netif th1 ipv6 configuration changed
0000343912 [net.th] TRACE: OT_CHANGED_IP6_ADDRESS_REMOVED
0000344115 [comm] INFO: Forcing a cloud ping
0000344116 [comm.dtls] WARN: mbedtls_ssl_write returned ffffffff
0000344118 [comm.protocol] ERROR: Event loop error 3
0000344118 [system] WARN: Communication loop error, closing cloud socket
0000344119 [system] INFO: Cloud: disconnecting
0000344120 [system] INFO: Cloud: disconnected

I believe you are experiencing the Arris modem issue - it causes frequent cloud disconnects from the cloud and is caused by a DNS issue.

A workaround has been found but there isn’t a build with the fix yet (0.8.0-rc.25 does not have the fix).

3 Likes

Interesting. I do have an arris modem, however I don’t use it for dns or dhcp or anything. I have a separate firewall that handles that stuff, and i have ipv6 disabled there. I wonder if i have anything on my internal network that is sending out nd6 packets. I’ll have a look tonight.

1 Like

I’ve solved this. I run openwrt on my access points, but don’t run dhcpd on them, they’re just bridges onto my wired network. odhcpd was still running on them, even tho there was nothing for it to do. but apparently it was still sending out nd6 advertisements. I didn’t notice before, because I have nothing actually using ipv6. Killed odhcpd and the mesh has been up for an hour. Now on to the fun stuff! Thanks @rickkas7, @ScruffR, and @kennethlimcp!

6 Likes

@sean13 can you share your code with the SerialLogHandler? It could come in handy with troubleshooting my mesh network.

You can just take the Tinker example code (or even an empty project) and add this line to the top of it

SerialLogHandler traceLog(LOG_LEVEL_ALL);
1 Like

Is there a delay I need to add? When adding this, the xenon restarts but when trying to open putty, I get an error saying it cannot communicate with that serial port. I tried a few things last night based on some other posts, if I add this to void setup() and add a delay after, it does give me time to open and gives me a few seconds of output, but then seems to disconnect from serial.

You can’t delay the system with a delay in your code.
The easiest way to capture the incoming data as soon as possible is by use of
particle serial monitor --follow

1 Like

Got it, I’ll give that a shot and report back.

Looks like it may still be something with IPV6 routing.

Serial connection closed.  Attempting to reconnect...
Serial monitor opened successfully:
0000000671 [system.ot] INFO: Role changed: router
0000000672 [system.ot] TRACE: RLOC was added
0000000672 [system.ot] TRACE: RLOC was removed
0000000674 [system.ot] TRACE: Partition ID changed
0000000676 [system.ot] TRACE: Subscribed to IPv6 multicast address
0000000679 [net.th] TRACE: OpenThread state changed: 10e4
0000000681 [net.ifapi] INFO: Netif th1 link UP
0000000683 [system.nm] INFO: State changed: IFACE_UP -> IFACE_LINK_UP
0000000686 [net.ifapi] TRACE: Netif th1 ipv6 addr state changed
0000000687 [net.ifapi] TRACE: Netif th1 ipv6 addr state changed
0000000688 [net.th] TRACE: Added FE80::F40F:D968:F4D9:3CA8 0
0000000689 [net.ifapi] TRACE: Netif th1 ipv6 addr state changed
0000000691 [net.ifapi] TRACE: Netif th1 ipv6 addr state changed
0000000693 [net.th] TRACE: Added FD21:76C1:2053:0:E95:CE81:FD24:F294 0
0000000694 [net.th] TRACE: OT_CHANGED_IP6_ADDRESS_ADDED
0000000695 [net.th] TRACE: OT_CHANGED_THREAD_ROLE
0000000696 [net.th] TRACE: OT_CHANGED_THREAD_RLOC_ADDED
0000000698 [net.th] TRACE: OT_CHANGED_THREAD_RLOC_REMOVED
0000000700 [net.th] TRACE: OT_CHANGED_THREAD_PARTITION_ID
0000000700 [net.th] TRACE: OT_CHANGED_IP6_MULTICAST_SUBSRCRIBED
0000000715 [system.ot] TRACE: Thread network data changed
0000000715 [net.th] TRACE: OpenThread state changed: 200
0000000716 [net.th] TRACE: OT_CHANGED_THREAD_NETDATA
0000000717 [net.th] TRACE: Candidate preferred prefix: FD0D:E520:66DC::/64, preference = Medium, RLOC16 = 9c00, preferred = 1, stable = 1
0000000720 [net.th] INFO: Switched over to a new preferred prefix: FD0D:E520:66DC::/64, preference = Medium, RLOC16 = 9c00, preferred = 1, stable = 1
0000000722 [hal] INFO: DNS server list changed
0000000723 [net.th] INFO: DNS server on mesh network: FD0D:E520:66DC::1
0000000725 [system.ot] TRACE: IPv6 address was added
0000000727 [net.th] TRACE: OpenThread state changed: 1
0000000729 [net.ifapi] TRACE: Netif th1 ipv6 addr state changed
0000000732 [net.ifapi] TRACE: Netif th1 ipv6 addr state changed
0000000735 [system.nm] INFO: State changed: IFACE_LINK_UP -> IP_CONFIGURED
0000000737 [net.th] TRACE: Added FD0D:E520:66DC:0:FCD9:7A96:92C:A557 0
0000000738 [net.th] TRACE: OT_CHANGED_IP6_ADDRESS_ADDED
0000000739 [system] INFO: Cloud: connecting
0000000742 [system] INFO: Read Server Address = type:1,domain:$id.udp-mesh.particle.io
0000000745 [system] INFO: Loaded cloud server address and port from session data
0000000747 [system] TRACE: Address type: 1
0000000749 [system] TRACE: Cloud socket=0, family=10, type=2, protocol=17
0000000752 [system] INFO: Cloud socket=0, connecting to 64:FF9B::36E3:ABAB#5684
0000000754 [system] TRACE: Cloud socket=0, connected to 64:FF9B::36E3:ABAB#5684
0000000757 [system] TRACE: Updating cloud keepalive for AF_INET6: 30000 -> 30000
0000000760 [system] TRACE: Applying new keepalive interval now
0000000763 [system] INFO: Cloud socket connected
0000000765 [system] INFO: Starting handshake: presense_announce=0
0000000768 [comm.protocol.handshake] INFO: Establish secure connection
0000001155 [comm.dtls] INFO: (CMPL,RENEG,NO_SESS,ERR) restoreStatus=0
0000001157 [comm.dtls] INFO: out_ctr 0,1,0,0,0,0,0,103, next_coap_id=41
0000001159 [comm.dtls] INFO: app state crc: cached: 371e1039, actual: 38e5db72
0000001160 [comm.dtls] INFO: restored session from persisted session data. next_msg_id=65
0000001163 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 2
0000001165 [comm.protocol.handshake] INFO: Sending HELLO message
0000001354 [comm.protocol.handshake] INFO: Handshake completed
0000001356 [system] INFO: Send spark/hardware/max_binary event
0000001361 [system] INFO: Send subscriptions
0000001362 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000001363 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000001365 [comm] INFO: Sending TIME request
0000001367 [comm.protocol] INFO: Sending 'M' describe message
0000001369 [comm.protocol] INFO: rcv'd message type=1
0000001370 [system] INFO: Cloud connected
0000001371 [system] TRACE: Updating cloud keepalive for AF_INET6: 30000 -> 30000
0000001372 [system] TRACE: Applying new keepalive interval now
0000001374 [comm] INFO: Forcing a cloud ping
0000001455 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0000001456 [comm.protocol] INFO: message id 67 complete with code 0.00
0000001458 [comm.protocol] INFO: rcv'd message type=13
0000001462 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0000001463 [comm.protocol] INFO: message id 68 complete with code 0.00
0000001464 [comm.protocol] INFO: rcv'd message type=13
0000001506 [comm.protocol] TRACE: Reply recieved: type=2, code=69
0000001507 [comm.protocol] INFO: message id 69 complete with code 2.05
0000001508 [comm.protocol] INFO: Received TIME response: 1542551736
0000001511 [comm.protocol] INFO: rcv'd message type=12
0000001512 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0000001514 [comm.protocol] INFO: message id 70 complete with code 0.00
0000001516 [comm.protocol] INFO: rcv'd message type=13
0000003082 [comm.protocol] INFO: Sending 'S' describe message
0000003086 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000003472 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000003472 [comm.protocol] INFO: rcv'd message type=1
0000003474 [comm.protocol] INFO: Sending 'A' describe message
0000003476 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000003477 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000003478 [comm.protocol] INFO: rcv'd message type=1

although I may also be reading this completely wrong.

@Mjones I have switched from PuTTY to TeraTerm for most things Particle, because even if the device disconnects (e.g. due to rebooting), TeraTerm will automagically reconnect when the device comes back up and its serial interface comes on line. I could not figure out how to make PuTTY do this.

1 Like