[Mesh] Almost 100% of messages are being dropped

I am witnessing a problem where I have two xenons communicating via mesh and one is dropping almost 100% of the messages it receives. (It’s not a subscription problem.)

Xenon A:

  1. while (buffer.size() > 0) {Mesh.publish the packet} // Typically 5-100 messages queued up for sending
  2. If there’s a provisioning request, send the provisioning packet

Xenon B

  1. receive packet
  2. if packet descriptor is unrecognized, request provisioning

Xenon B captures 100% of the packets sent by Xenon A. Xenon A, however, seems to receive almost none of Xenon B’s provisioning requests. Maybe one out of several hundred gets through.

I have this problem in both single-threaded and multi-threaded modes. I theorize that what is actually happening is that Xenon B is somehow not servicing the incoming mesh packets and they are being discarded. I don’t understand the mechanism for this, though. I’m guessing that maybe it’s because Xenon A’s publishing loop runs without interruption, so it is certain that Xenon B is publishing its provisioning request while Xenon A is still dumping its buffer.

I call Particle.process() as well as os_thread_yield() between every Mesh.publish() on Xenon A, so if my theory that I’m not servicing something quickly enough is right, then I’m confused why neither of those commands clears the logjam.

Any ideas?

Generally SYSTEM_THREAD(ENABLED); is best.
It might be that your event handler in Xenon B is doing Mesh.Publish() from the handler which you shouldn’t? Perhaps you could share the code - you shouldn’t have to call Particle.process() or os_thread_yield() IMHO.

1 Like

Handlers are a stripped down “Copy the data out of the buffer, set a flag, and get outta there” kinda things.

Xenon A message handler: https://gitlab.com/kubark42/afloat/-/blob/master/firmware/BilgeSensor/BilgeSensor.cpp#L142, which calls https://gitlab.com/kubark42/afloat/-/blob/master/firmware/Libraries/AfloatAbstractNode.cpp#L109

Xenon B message handler: https://gitlab.com/kubark42/afloat/-/blob/master/firmware/Nexus/Nexus.cpp#L75 which in turn triggers https://gitlab.com/kubark42/afloat/-/blob/master/firmware/Nexus/Nexus.cpp#L338

The message being sent is whoami, and you can see that all it does on Xenon A is set the mustSendProvisioning flag to true, and then returns. On Xenon B I see the console spew indicating that the handler successfully set the flag and the main loop is correctly requesting the provisioning. I can’t guarantee that the message is going out of Xenon B, though.

I’m also not seeing the serial console spew (unlike publish commands, there should be no problem with dumping to the serial port during a callback), so I’m very confident the data is never getting to the callback of Xenon A. Since an occasional message gets through, I know that the subscription is correct.

So somewhere between what looks like a correctly called Mesh.publish() and a correctly configured Mesh.subscribe(), the message is going missing. I don’t know how to get any more insight into the problem.

Can you provide details on the topology of your mesh, how many Xenons, etc. Are the two devices in question adjacent or seperated by other nodes?

The mesh has three units, a gateway Argon and two Xenons. All are located on the same desk, so presumably the two Xenons have direct access. AFAICT everything’s set up for success and in practice there’s absolutely no problem with Xenon B receiving messages from Xenon A.

Crazy thought. Is it possible that there’s a race condition driven by my clearing the requestProvisioning immediately after publishing the mesh packet? If the subsystem isn’t making a deep copy of the topic string, that could well be the problem.

Serial.printlnf("[MESH][PROVISIONING] request provisioning from `" + requestProvisioning + "`");
Mesh.publish(requestProvisioning, "whoami");
requestProvisioning = "";

I’ve checked and it doesn’t look like that’s what’s going on. But it wouldn’t be the craziest thing I’ve ever seen happen.

Eh, it really looks like this isn't what's happening. https://github.com/particle-iot/device-os/blob/dbdc3763373d9d3c292a896898b76404804b6e6c/wiring/src/spark_wiring_mesh.cpp#L222 shows what's likely to be a deep copy. GitHub doesn't make it easy to dive into the code and figure out what happens inside _udp.write(), but based on the call signature I would be surprised if it isn't copying those bytes to a UDP buffer.

@kubark42, remember that ALL mesh messages are implemented via multicast UDP meaning “fire and forget” type messages. I don’t know at which rate you are sending messages but what happens if you slow things down, do messages get received by A?

Is the subscription callback on A being called? You may want to test using a blinking LED instead of Serial.print() or use a SerialLogHandler which may be safer to use. Your hunch on the buffer is interesting. We know that for the Cloud Publish and Subscribe functions, they share a common buffer so precautions must be taken.

Perhaps you need to strip down the code to a test case which simulates the messages going between A and B so you can post it for others to test.

BTW, you seem to use String throughout your code which is a known cause of heap fragmentation. You may want to convert to c-strings and associated functions.

The flag is digested in loop() only once ever couple seconds and then it emits a single packet. However, the incoming data can be faster and so in theory it's possible that the output is getting lost in the overall usage of the bandwidth. However, that feels a little weird because typically radios don't transmit unless they can get a slot. This would indicate to me the message would get delivered but simply not handled on Xenon A. But as you point out, with UDP there are no guarantees.

If the callback were firing then the flag would be set and the provisioning would occur. The presence of Serial commands wouldn't/shouldn't interfere with that.

Maybe easier for the inclined is that the full source is available online as well as a nice make command to accompany it. For instance, make bilgesensor_program will compile/flash Xenon A and make nexus_program will compile/flash Xenon B. And if you're on macOS, I've also got a nice python routine to pick out the serial port so that it's not always jumping around.

I'll be happy to walk anyone through it, and in fact would love to have people give it a shot. Despite having 6 people contributing to the open-source project, I'm the only one doing firmware and so my coverage for other computers/OSes is severely lacking.

I know this is getting offtopic, but I've given that a lot of thought over the years and I've concluded that String is just too useful to pass up. It has a deterministic and controllable behavior, but abusing it will result in non-deterministic results. In a similar way, C-style char arrays also suffer from a defect, and IMO buffer overflow errors are just as hard to diagnose as memory fragmentation.

My middle ground has become String.reserve() as a way to control the likelihood of memory fragmentation, while preventing segfaults because I miscounted by a single byte. Strings just relieve a ton of development effort all while greatly enhancing readability, and I'm prepared to pay the price of being exactingly careful with heap allocation.

@kubark42, the lack of mesh diagnotics (including a sniffer for example) was one of those indicators that told me mesh was in trouble. Once a message goes into the “pipe” there is no way to know where it is until it successfully arrives at its destination.

I believe there are other threads dealing with seemingly dropped messages though I don’t recall what their outcomes were.

As for using String (yup, still off topic), though using .reserve does help, you haven’t accounted for all the temporary string products that are generated during a set of operations. Unlike c-strings which can be allocated globally (heap) or locally (stack), Strings (and their temporary byproducts) are only allocated on the heap.

In regards to c-string index overflow, most String operations can be replaced with the bounds-limiting snprintf() function which will never go out-of-bounds. Also, small c-strings can be created locally (on the stack), are ephemeral and won’t “dirty” the heap.

The String object was created in the early Arduino UNO world to simplify things for learning programmers. As the embedded world has now evolved to use RTOS, OS and other tools, the String object has become a source of uncertainty. If you dig into Arduino and other forums, you will see the warnings. So for me, the price of reliability far outweighs the use of String.

Yeah, alas, you're spot on. The writing was on the wall if even after a year we still didn't know how to do low-level debugging. It doesn't help that the processor segfaults when you use hardware debugging!

I went hunting but never found forum posts which jumped out at me as describing a similar problem. If you come across any, would you mind updating this thread? I figure it's one of the last chances we have to fully document mesh's behavior before Particle closes the book on it.

(This would be an amazing conversation to continue in another thread, because I agree with certain points and disagree with others. After a lot of digging, and realizing that the warnings are derived from specific sets of predictable (and admittedly common) circumstances, I've concluded that using the String class is like putting metal in the microwave: it's absolutely 100% okay to do routinely and safely.

The problem is that describing the edge cases to people can be more effort than it's worth, and because there's a decent alternative it's easy just to tell people never to do it.)

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.