Degradation of the Mesh Network Reliability during OTA Flashing

I see that RC27 was released last night and so this morning I immediately started flashing my mesh network with my Marco-Polo mesh heartbeat testing code to keep this mesh stability testing going.

First, I update my Argon gateway. Of course the mesh drops and all the nodes are blinking green; that’s to be expected. After a lengthy magenta blinking session on the Argon, the device comes back up and starts running user code again.

The mesh did not recover automatically. All the nodes went blinking green. Not until I reset one of the Xenons did the rest of the devices almost immediately give a quick cyan flash and then breathing cyan. Strange that the mesh didn’t recover by itself and strange resetting only one Xenon recovered the whole mesh network.

Then I start updating the nodes. The first node I flashed caused the entire mesh network to drop. The Argon was running the heartbeat code, the nodes were responding (indicated by the “beating” D7), but the Argon wasn’t receiving the responses. I assume this was because of the code flashing to the updating node bogging down the limited bandwidth of the mesh network. On a subsequent flash, I accidentally flashed one of the nodes with RC26 (which it already had) because I wasn’t paying attention to switch the OS target in the WebIDE. When I did that RC26 flash, the response latency went up to about 400ms, but the mesh stayed up and all node responses were received. Flashing RC27 on subsequent nodes didn’t seem to degrade the mesh network as much. On those subsequent flashes, it seemed that 3 out of 4 nodes were responding as I expected and continued to respond while a single node was going through the flashing magenta sequence.

The point is, if you are running critical mesh communications, an OTA flash of new user code will cause some latency. However, an OTA flash of new user code that also requires a device OS change may cause significant degradation of the mesh network while the self-healer updates the device.

This also poses an interesting problem for updating a mesh network that is part of a “product” (when that functionality comes around). It wouldn’t be advantageous for every mesh node to try to flash OTA all at the same time. I am currently doing a “rolling upgrade” by manually selecting a single node to flash one-at-a-time. The product updates for mesh might have to use a similar approach to make sure every node gets its OTA while not causing excessive failures due to bandwidth limitations.

This observation ties into some of the mesh network stress testing that @syrinxtech was doing.

2 Likes

I’ll just quickly comment that IMO, rc.27 is the first checkpoint where we can start using words like stability and reliability.

Maybe performing the same testing of flashing rc27 system firmware to a Mesh fully running rc.27 will be a better benchmarking approach.

1 Like

@ninjatill, my Argon/5 x Xenon mesh running rc26 showed similar behaviour during OTA updates but none of the Xenons crashed or brought down the mesh. Updating the Argon, which is running Marco, did drop the mesh and all but one Xenon came back up without needing a reboot. Three Xenons are running Polo, one is running Weather code (webhook, parsing, sending to Mesh) and the last is running a Waveshare display to show the weather broadcast by the other Xenon.

It is important to note that system_part1 is a 600KB+ size file so at 250Kbps, an OTA will absolutely affect the mesh traffic. However, it never failed on any of my Xenons. Once all the devices were updated, doing a typical user app OTA was MUCH faster and less disruptive on the mesh. I am finding rc27 VERY solid now based on the Marco-Polo results and the rock-solid stability of the Weather apps.

1 Like

I concur. I'm highly optimistic.

Also to note, I did flash my home network with RC27 (single Argon gateway with single Xenon node, running Marco-Polo without system thread enabled). After the Argon update, the single Xenon node responded to the very first heartbeat and I was able to flash the Xenon to RC27 successfully as well. Good thing since I don't have physical access to that network at the moment.

I'm not so worried about what I experienced on the multi-node network update except to serve as a warning for network/product/code design. Not that I'm looking for full Quality of Service (QoS) but it might make sense to be able to flag certain mesh traffic as high priority. For example, if you have a sensor network, and receiving data from a particular sensor is of utmost importance, you could flag that traffic as high-priority. Then, when an OTA comes down, the sensor data will not be dropped/delayed in favor of the OTA. Something like a Mesh.publish(event, data, HIGH_PRIORITY). (Of course, as some network admins forget from time to time, if you flag every publish as high priority, then essentially all traffic has the same priority and therefore none of it is really high priority.) I know nothing about OpenThread so that might not feasible within the protocol framework.

@ninjatill, it is still early days in regards to mesh performance, management, etc. Because of the ephemeral nature of Mesh traffic (UDP), managing “priority”, aka congestion, will become an area of focus for sure. However, without mesh metrics and (ideally) mesh monitoring tools, this will remain in the medium future IMO.

1 Like

@kennethlimcp Friday is my IoT day so I will perform the same tests that I used on rc26 and will upgrade to rc27.

1 Like

have fun :slight_smile:

I have no idea of how the Particle’s implement OTA, but I’m assuming it’s based on UDP. Just curious, does any of the inherent Particle architecture make use of multicast addresses? If so, is it possible to use a different multicast address for OTA and like someone above suggested, make it a lower priority? Just shooting spitballs…no flames please if the question is way out of left field.

@syrinxtech, since none of us are familiar with Mesh traffic and OTA-over-mesh functionality, I suggest not making any assumptions. @rickkas7, any comments?

It is UDP. However, OTA is done in CoAP over DTLS over UDP. Since each cloud CoAP connection is encrypted uniquely for each device, it’s not possible to multicast.

Since the gateway can’t even decode the CoAP packets destined for the mesh nodes it would be non-trivial to do QoS or packet prioritization at that layer.

1 Like

@rickkas7 Thanks for the clarification.

One other off-the-wall question…much like the Microsoft world depends on SMS, Altiris or other software products to patch a large quantity of devices, would it be possible for Particle to build software that would allow one node to act as the patch repository on a local network? That way, the one device could be responsible for patching the end nodes and it wouldn’t have to be burdened with also acting as a gateway?

I know that would probably radically alter the way you’re doing it now, ie, the cloud-based way, but it was just a thought. In the traditional network world there just isn’t another good way to patch hundreds to thousands of devices in real-time without having a dedicated platform that wasn’t cloud based. Of course, if you wanted to get fancy and if was feasible, you could have multiple “patch servers” on each mesh network, similar to having multiple DNS or DHCP servers.

Like many things, of course it could be done. Where it fits in the priority of features and whether it’s worthwhile and can be done securely is another question.

The most obvious case to me would be the Boron gateway; in particular for device OS upgrades it if could download it once then send it to all of the Xenon nodes, that could be really useful and save a lot of time and data usage. But because of the way updates are done it would require major changes so don’t expect that any time soon.

Not until I reset one of the Xenons did the rest of the devices almost immediately give a quick cyan flash and then breathing cyan. Strange that the mesh didn’t recover by itself and strange resetting only one Xenon recovered the whole mesh network.