UDP Broadcast listeners stop listening over time

patrickcentral · December 28, 2017, 12:49am

Hi All, bear with me, this is a long one-

I have a couple of Core and Photon devices driving custom lighting fixtures I've made for my home.

For a long time, I've simply made multiple Particle API calls through either an Alexa skill or web-based controller UI, and it works pretty well. Color/mode changes happen at roughly the same time, and the code isn't insanely messy, though it's not pretty either.

I wanted to be able to do some more precision orchestration/transitions, as well as be able to demo this stuff in other locations, so I thought I'd build a "hub" of sorts that served as a single web connection, and controlled the rest of the devices locally.

So, I built a controller called "MCP" which sends UDP broadcast packets to a certain port, and then got all my existing devices (and some simple testers) to listen to that port.

Here's what works:
Photon receives web function call, sends UDP packet to 192.168.1.255:11111 (Broadcast)
UDP broadcast packet is observed in Wireshark.
Cores/Photons listen to packet, and obey correctly....for a while.

Here's what doesn't:
Somewhere between 2-12 hours after all devices are freshly reset and reconnected to the network (seriously, I've tried to narrow it down but it sometimes worked fine for 10+ hours) I start losing listeners, and this gets worse over time.

Example:
3 hours after resets: 4/4 devices obeying the broadcast packets, instantaneous switching, exactly what I want!
3-6 hours after resets: 3/4 devices obeying - the 1 that has dropped off still responds to direct web functions, hooks, subscribe events, and code flashes, so it's running just fine, it has just stopped listening to UDP
6-12 hours after resets: 2/4 devices obeying - same as the last one, the 2 devices that no longer follow the leader are working just fine in every other respect, they just stop hearing the UDP
12-14 hours after resets: 1/4 devices obeying.

I've not tested to see if the final one falls, because the amount of time involved, and I keep wanting to try fixes.

Some of the fixes I have tried:
Making sure the Controller and the Devices are bound to different UDP ports. This extended the dropoff from 15 minutes to 10+ hours.
"Kicking" the UDP server every time a command packet is received - ie, running a function that calls Udp.stop(), waits a second, then calls Udp.begin(), thus restarting the UDP listener. (adapted from UDP listening stop after some time - #2 by dermotos)

This last "kicking" part actually works quite well - as long as I am sending commands frequently, everything works perfectly for as long as I have tried it. Similarly, if one listener drops off, and I "kick" that particular device via an Event in the Console (no reset, no flashing, just have it run stop and begin) it rejoins the flock perfectly after that. So it's definitely something with prolonged listening to UDP, and can be remedied by destroying and restarting the UDP listener, without any harm to the other device functions. The kick process even seems pretty fast.

The hiccup comes when I go to bed, and then work, and no commands are sent for a while. It's annoying to have to kick all the devices manually in order to restore desired behavior.

Now, I could set up a timed watchdog that just periodically runs the Kick function, and I am actually very confident this would work, though it could create scenarios where it's kicking right as I am trying to give it a command, which I am not wild about, and it also just seems very sloppy, like a bucket on the floor instead of fixing the leak.

My hope is that one of you can tell me why UDP listeners stop listening after some amount of time, and I can fix it in a more elegant way. I've tried to do as much troubleshooting on my own to narrow the problem scope. I don't know if listening to UDP chews memory in some way that the Stop/Begin frees up, and if there is any way to stop that from happening. I notice that the UDP.flush() method in the docs says it currently does nothing - is this the issue? Would flushing clear the memory in a less aggressive way than stop/begin?

Thanks in advance!

EDIT: Forgot to mention, these devices are on 0.6.2. If this is all fixed up in 0.7.x release candidate, I can try that.

Here's some of my code in case it helps:

MCP code:

RGBClass RGBLED;
char szArgs[13];
long lastChange;
int r = 0;
int g = 0;
int b = 0;
unsigned int localPort = 8888;
UDP Udp;
IPAddress remoteIP(192,168,1,255);
int remotePort = 11111;

void setup() {
    Udp.begin(localPort);
    RGBLED.control(true);
    RGBLED.color(255,255,255);
    lastChange = millis();
    Particle.function("setcolor",setRGB);
}

int setRGB(String args) {
    //communicate to devices that it's a color
    args.toCharArray(szArgs, 12);
    sscanf(szArgs, "%d,%d,%d", &r, &g, &b);
    RGBLED.color(r,g,b);
    lastChange = millis();
    //broadcast a command
    Particle.publish("MCP-commands", NULL, 60, PRIVATE);
    Udp.sendPacket((const char *)szArgs, sizeof(szArgs), remoteIP, remotePort);
    return 200;
}

void loop() {
    long elapsed = lastChange + 4000;
    if(millis() > elapsed) {
        RGBLED.color(255,255,255);
    }
}

Listener code:

RGBClass RGBLED;
char szArgs[13];
long lastChange;
int r = 0;
int g = 0;
int b = 0;
unsigned int localPort = 11111;
UDP Udp;
IPAddress remoteIP(192,168,1,255);

void setup() {
    Particle.subscribe("udpKick", udpKick);
    Udp.begin(localPort);
    RGBLED.control(true);
    RGBLED.color(255,255,255);
    lastChange = millis();
}

void loop() {
    // Check if data has been received
    checkNetwork();
    long elapsed = lastChange + 4000;
    if(millis() > elapsed) {
        RGBLED.color(255,255,255);
    }
}

void checkNetwork() {
    int size = Udp.parsePacket();
     if (size > 0) {
        char data[size];
        Udp.read(data,size);
        Particle.publish("Sark-obeys", String(data), 60, PRIVATE);
        sscanf(data, "%d,%d,%d", &r, &g, &b);
        RGBLED.color(r,g,b);
        lastChange = millis();
        Udp.flush();
        kickUDP();
    }
    delay(1);
}

void udpKick(const char *event, const char *data) {
    kickUDP();
}

void kickUDP() {
    Udp.stop();
    delay(100);
    Udp.begin(localPort);
    Particle.publish("listener-kicked", String("KICK IT"), 60, PRIVATE);
}

rickkas7 · December 28, 2017, 5:08pm

There are a few things I would do:

First, switch to using UDP.receivePacket instead of parsePacket and read. ReceivePacket directly returns an error even when there’s no data, which makes it much better for detecting when a problem has occurred.

Second, if you use UDP listeners you basically have to use SYSTEM_THREAD(ENABLED) and take note of when WiFi becomes disconnected (WiFi.ready() returns false) from loop.

The reason is that when WiFi disconnects, all of the listeners are removed, and are not re-created for you. So when you detect that you disconnect, set a flag and then when you reconnect, do your UDP.begin() calls again.

I find if I do those two things it’s reliable.

patrickcentral · December 28, 2017, 10:30pm

Thanks, I’ll try those things and report back.

Couple follow up questions:

Do you have some example code for receivePacket? It’s mentioned in the docs, but doesn’t have its own section detailing what args it expects, etc. I’m working off an Arduino example I found right now, and hoping its the same basic syntax.

Are there WiFi disconnects that don’t show in the Console? I am assuming that if it hiccups and connects again within a short time, this would not show up in Console, but I have one device (not one of the ones related to this post) that is constantly disconnecting and reconnecting and it still shows in the Console.

I definitely want to add some debug logging to see if it is that kind of network hiccup that is causing the listeners to stop, this is a really good lead - thanks!

ScruffR · December 29, 2017, 6:38am

I can't even see it in the docs
There is a pull-request on GitHub you can look at tho'
Added documentation for UDP.receivePacket() by Walter-Stroebel · Pull Request #706 · particle-iot/docs · GitHub
(Update: I just confirmed the merge for this pull request, so it should show up in the docs soon)

Short answer: Yes.
Since a disconnect as such won't be monitored by the cloud, just the missing "regular" ping which would need some time to get noticed.
So if the outage is too short the cloud would never know.

patrickcentral · December 29, 2017, 7:44am

It's mentioned in the doc entries for .remoteIP() and .remotePort() - but it was a little perplexing because those items gave info about how they are used in receivePacket...which had no other reference

Thanks for the quick add! Makes implementing this change much easier for me.

The connection info makes sense, as I'm hoping that ping is not aggressive enough to notice a 20-30 second gap. While I have your attention, I would love a filter on the console so I can tell the connect/disconnect messages to STFU while I am working on debugging other devices I think the device that is spamming my console must be just on the outskirt of a good wifi signal - it never fails to do anything I've asked, but it is connecting and disconnecting a ton.

Back to my original problem, for the time being, I am trying out the watchdog approach rather than try to work in a whole bunch more logic that manages the connection hiccups manually. So far, so good. If this proves stable I'll implement it on connection drops vs my own recurring timer.

Topic		Replies	Views
[Solved] UDP listening fails after several hours (Photon) Troubleshooting	8	3012	May 29, 2016
UDP receiving stops unexpectedly Firmware	2	586	October 22, 2018
Sending real-time audio data using particle photon General	6	4701	May 1, 2016
UDP read fails as Photon restarts Troubleshooting	9	2363	February 26, 2016
Particle PHOTON USB problem after reset General	4	901	August 6, 2017

UDP Broadcast listeners stop listening over time

Related topics