UDP problem 0.4.9

Hi,

0.4.9 is generally FAR better than its predecessors, faster and more stable.

I am however still experiencing UDP failure - which are NOT being fixed by a re-initialisation of my UDP.

My code sends UDP packets out (approx every 10 seconds at the moment), and I am using Serial1 as a debug port (when running locally).

I currently have 2 photons running an app now rebuilt on 0.4.9. One local and one a LONG way away. Both seem to exhibit the same issue (although I canot see the remote serial port :-(.

The system runs for days or hours, then starts returning UDP error -26 (no socket). When I detect ANY failure - I force a UDP reset -

static Udp udp;
void MyUdp::Reset()
{
  Udp.stop();
  delay(1000);
  Init();
}

 void MyUdp::Init()
{
  Serial1.print("\r\nUdp.Init()\r\n");

//  if (localPort != 0)
//    return;

  localPort = 56100;
  destIP = -1;

  Udp.begin(localPort);
  udpMsgcount = 0;
}

problem is that the fault persists, and the only way to get UDP working again seems to be a reboot :-((.

My send is quite simple

  sent = Udp.sendPacket(localbuf, strlen(localbuf), ipad, port);

NB I check packet len is not too big outside this code, and split it if it ever is.

sent is returning -26

NB I had seen this effect prior to 0.4.9 but was told (in earlier postings) that various UDP fixes were included in 0.4.9…

So - I only have one UDP object - defined statically, and accesses are controlled externally via a ‘state machine’ in the main loop.

Anyone like to comment on what (I guess) I am doing wrong please ???.

Thanks

Graham

1 Like

Sorry forgot to say that everything else seems to be working just fine. Cloud is UP, publishes happening on a regular basis etc. I have a built-in reboot command I can send and that works perfectly well, reboots and the UDP issue ‘goes away’ again for quite some time (sometimes many hours, sometimes days :-O.

G

Hi @GrahamS

Have you checked your router logs for something that is logged around the same time as the failure? There are many potential failure points here that you can investigate.

It could be something simple like your Photon does not have its IP address after a DHCP hiccup. Have you tried increasing the one second delay after udp.stop() to be more like 60 seconds to allow for DHCP problems?

Hi @bko

As mentioned in my second post - everything else is running just fine. The cloud is connected (I can read variables, send commands, and it is ‘publishing’ my data whilst UDP is reporting errors.

So yes - I am confident that this is not a network issue. It also happens on two distinctly different systems (5,000 or so miles apart ;-).

I ‘suspect’ (as I did some weeks back prior to 0.4.9) that the issue might be triggered by a loss of connectivity, but even when everything else has come back again - UDP stays reporting -26. When it happens, my code automatically does the Udp.Reset() call, and I can see it reported on the Serial port every time it tries to use UDP - until I issue a reboot (which is a hidden Cloud ‘function’).

Thanks

Graham

Does anyone know where are the error codes are defined? I thought someone posting a list before Christmas but despite searching I cannot find the post, or the list of them. I’m running my Core on 0.4.7 at the moment and actually trying to get it to fail so that I can test how my code recovers from it. At the moment I close the socket after every few hundred sends, and reopen it. I’m sending a packet every second or so. My plan is to increment an error count when the response from udp.* isn’t what I expect and automatically reboot it when it exceeds a threshold.

I assumed that ‘no socket’ meant the socket had closed. On my Core I get -1 from sendPacket if I’ve executed udp.stop() just prior to it. I’m wondering what -26 means, assuming that the response codes are the same on the Core and the Photon.

I found the post. It's here.

The response codes are not the same between the core and the photon since they are defined by the network stack, which is different on each device.

1 Like

Hi @timx,

Glad you found the codes and my earlier posts on this issue ;-).

-26 is the only error code I have (yet) seen from my UDP socket. It was present in 0.4.7 and seems to still be there in 0.4.9. As mentioned, mine fails after some hours - of two operational systems, both rebooted yesterday about 22 hours ago. One is still running just fine - the other stopped (sending UDP) after 11 hours or so.

My code calls the udp begin function whenever it fails, so will have been calling that for 11 hours already :open_mouth: - every 10 seconds :-O.

Other cloud functionality is fine (so cloud is UP and not dropped since last rebooted!! - I created a Variable and set it with the date/time when the cloud comes up again after a failure) and is publishing to Librato every 10 seconds as well. So this is pretty much a UDP only issue as far as I can see…

Guess I might try using a dynamic UDP socket (its a single static at the moment), then killing and re-creating it after a failure :open_mouth: - see if that works…well -26 says INVALID_SOCKET so something must be killing it :-O.

Difficult to track as it takes many hours between random failures with an unknown cause…

Graham

Hi,

Well two systems now been running for a while with the new code - which 'delete[]'s then 'new’s the UDP if it fails. Seems to be more stable so far…Pity I am only monitoring the serial port and not actually logging, as I would expect to miss any UDP fail events :-(.

On another (might be related) issue. The same test system appears to be fully operational, UDP sending every 10 seconds, variables respond to ‘get’ requests - BUT - the particle cloud seems to think that the local device is off-line :-O.

particle list says its off-line
build.io says its off-line
particle dev environment says its offline (well not breathing) in its dropdown list BUT if I select the device, it shows in the bottom line as breathing :-O.
If I send a variable request it responds as I would expect it to :-O.

I suspect that the cloud has become confused :-O.

BR

Graham

This sounds like something I’ll need to be able to do in the future so I’ll try to set up a test rig this weekend and then a little longer to gather data; hopefully I’ll find something useful for solving the problem.

Hi,
Further feedback…
‘Touch wood’ these two systems are now WAY more stable, been running since 19th with code which deletes the UDP object on a fail code and creates a new one…

A single static object really ‘should’ be re-usable but hey ho…

Hope this is useful to someone out there ;-))

Graham

I ran my test. I was sending 512 bytes of UDP data from the Photon every 15 seconds. Each packet started with a byte value that increased each time, so I could tell if a packet is lost or corrupted. The server kept track of the time the packet was expected. And the server had a serial connection to the Photon so it could see the serial debugs in real time, and can also keep track if the Photon rebooted (it never did). And it checked a cloud variable every minute as well. A big log file was the result.

I was able to reproduce this problem once after about 18 hours. I got error -26 on the Photon for every udp.sendPacket(). After error, I did a stop() and begin() the UDP object (code below).

There were zero errors of any kind until that point, but once I got the -26, stop/begin were not sufficient to clear the error condition. I left it in the error state for about an hour and it was stuck. Though the cloud was still connected. Not only was the status LED breathing cyan (mostly), but I was polling a variable and it was updating as expected, even though udp.sendPacket was failing.

From looking at the Particle source I wanted to see if udp.begin() was returning true, as originally I was previously ignoring the result code. I modified my source to check that, and also Network.ready().

I loaded up the new code and started the test again. This time I got a failure after 6 minutes! And udpBegin() was returning true (success).

Curiously, this time it started to work again after about 45 minutes. There were a few more -26 errors, but then things stabilized. In fact, I got only 6 more errors, lasting only a single sendPacket each, in the next 48 hours.

I don’t know what this means. I was able to reproduce the problem, but not reliably enough that I would be particularly sure any fix really fixes the problem. Also, I reached a dead-end examining the source code because eventually it drops into the WICED code which I don’t have access to.

I didn’t try the deleting the UDP object. But since when it fails I also notice the status LED blinking cyan instead of breathing sometimes, I think I would be inclined to actually reboot the Photon from software for now on a -26 error, at least with my code that doesn’t mind the few seconds it takes to reboot.

SYSTEM_MODE(AUTOMATIC);
SYSTEM_THREAD(DISABLED);

UDP udp;
IPAddress remoteIP(192,168,2,6);
int remotePort = 7123;
const size_t bufferSize = 512;
unsigned char buffer[bufferSize];
unsigned char startValue = 0;
int cloudVal = 0;

void setup() {
    Serial.begin(9600);
    Particle.variable("val", cloudVal);
    uint8_t res = udp.begin(0);
    Serial.printlnf("udp.begin returned %d", (int)res);
}

void loop() {

    unsigned char value = startValue++;
    cloudVal = (int)startValue;

    for(int ii = 0; ii < bufferSize; ii++) {
        buffer[ii] = value++;
    }
    int result = udp.sendPacket(buffer, bufferSize, remoteIP, remotePort);
    if (result == bufferSize) {
        Serial.printlnf("sent %d", (int)buffer[0]);
    }
    else {
        Serial.printlnf("error %d", result);
        udp.stop();
        delay(1000);
        uint8_t res = udp.begin(0);
        Serial.printlnf("udp.begin returned %d Network.ready=%d", (int)res, (int)Network.ready());
    }
    delay(15000);
}

@rickkas7
Just to chime in I’ve been seeing the exact same issues since 0.47 as well and it’s been very hard to debug.
I haven’t been able to reproduce the condition that causes the issue, other than leaving the photon/P1 running.
I’m currently testing code that does a stop on any UDP.sendpacket() error ( result < 0 ) and will call begin only when WiFi.ready() == true.
Also I randomly assign the UDP socket port, as I’ve seen cases where a port number gets blocked and UDP.begin() will return 0/false.

Things seem stable so far, but I’ve found the issue is worse in areas with flakey WiFi.

@GrahamS
Glad to hear that solved the issue for you. Keep us posted :+1:
It may be the solution we’re looking for.

1 Like

I have seen UDP issues also with the new firmware, I send packages of around 200bytes every 10seconds, that stalls after a while, but it may be hours to days before it stops sending.
Once it stops sending the device is still online, and you can flash new/same firmware to it to have it resume.

Wanted to update everyone on my findings. I’m still on 0.47 but I have found the following socket management keeps things running stable.

When sending a packet close the port if there is ever a socket error. re-open the socket if you need to send again and it’s closed. For example a send function might contain the following:

int32_t err ;
uin16_t port ;
UDP my_sock ; 
bool my_sock_open = false ;

if( my_sock_open == false ){
    err = coap_socket.begin( port ) ;
    if( err > 0 ){
        my_sock_open = true ;
    }
}

if( my_sock_open == true ){
    err = my_sock.sendPacket( packet_buffer , packet_length , target , port ) ;
    if( err < 0 ){
        my_sock.stop() ;
        my_sock_open = false; 
    }
}

I also increment my port number randomly every time I have to reopen it, and only call my_sock.begin() if WiFi.Ready() == true. The only downside is that you can’t listen on the socket after you close it, so you may lose some incoming messages, but UDP protocols should be able to handle that.

Hope the paraphrased code above makes sense. I’ve had several test units running stable on this for a few days now. I’m testing on a dd-wrt router that resets itself every couple of minutes and has a few scheduled blocks of time where the wifi access gets cut off, and things always come back online and send udp messages to my remote server.

Hope this helps :smile:

Guys,
Sorry for no response from me - been away from my office for a few days, but in the meantime, my system out in Florida has been running just fine now since Feb 23rd (when I updated the code). If is IS experiencing UDP failures, then it recovering gracefully such that unless I add extra code to report failures (difficult with no UDP and being 5000+ miles away ;)) - I have no way of knowing.

To me - the fact that its running and probably cleaning itself up is good enough until the underlying bug - if there is one :wink: - is fixed.

      udpOK &= controller.sendRules();
      udpOK &= controller.sendDiags(cloud_reboot, cloud_ver);
      udpOK &= controller.sendNetworkData();
      if (udpOK != true)
      {
        Serial1.print("\r\nUDP Fail - resetting...");
        controller.udpReset();
      }

controller.sendXXX is a UDP send operation

void MyUdp::Reset()
{
  if (pUdp != NULL)
  {
    pUdp->stop();
    delete[] pUdp;
  }
  delay(1000);
  Init();
}

static UDP* pUdp;

void MyUdp::Init()
{
  Serial1.print("\r\nUdp.Init()\r\n");

  localPort = 56100;
  destIP = -1;
  pUdp = new UDP;
  if (pUdp != NULL)
  {
    pUdp->begin(localPort);
  }
  udpMsgcount = 0;
}

Hope this is useful…

BR

Graham

Was this problem solved in newer versions or is it still present? I don’t seem to read anything about this issue in the firmware bugfixes after 4.9.

Do you mind sharing your UDP code? I’m trying to implement a reliable UDP connection as well. I already did a flow for the WiFi connection, but I’d like to do something like this for UDP as well.

I have stress test program for [TCP] (Sending large data off Photon using TCPClient), but I just realized I don’t have one for UDP. It will take a few days to set one up and run it, but it seems like a good thing to be able to test.

3 Likes