[v0.4.8-rc1 / v0.4.9] WiFi Reconnection Issue

OK - have added

if (!WiFi.connecting()) { WiFi.connect(); }

Memory still drops… although it does appear to recover after some time… it’s a bit odd.

On my test photon with ONLY this code, the photon stays active despite the memory ups and downs. Not sure what else in my code on the other units could be causing the lock-up, but it’s appears to be linked in some way (maybe…)

There’s nothing too unusual in my full code.
There some temperature readings from DS18xxx, a ten second log via httpclient to a flask instance (local networking) and a one minute log to Particle Cloud. There is also a watchdog to reset the photon if the main loop hasn’t run for 15 seconds.

This behaviour is not too surprising as the "reclaiming" of previously freed space is done asynchronously whenever there is enough idle time and since traversing the heap map to calculate free memory is time consuming this is also not done permanently.

1 Like

OK. Is there any way to trigger the reclaim (or encourage it at least)?
How long would it typically run for?

The asynchronous nature is because the memory for each function call request is allocated by the application thread and then disposed of by the system thread when the system thread has pulled the request form the queue and executed it. While the system thread is blocked, e.g. waiting for WiFi to connect, then it’s not servicing requests. There’s a bounded limit to the number of outstanding messages, so the system will not continue pushing function call requests to the system thread queue.

I’ve not looked into the details of the memory allocator, but I’m certain it operates synchronously - there is no background cleanup thread. https://github.com/32bitmicro/newlib-nano-2/blob/0c5e24765fb745dc7c59f00248680c22357ffd55/newlib/libc/stdlib/mallocr.c

2 Likes

That’s interesting… so, in theory, by overly hassling the photon to connect could I in fact block vital operations?

Am trying to figure out how to maintain a bomb proof execution under bad conditions…

I don’t think it will block vital operations - the system is blocked once wifi goes down but the application thread will keep running, so long as you don’t keep pushing requests to the queue. To have code run completely independently from the system, don’t call any system APIs.

It shouldn’t be necessary to call WiFi.connect() and similar functions if you call Particle.connect() in setup. The system will then endeavor to keep wifi and the cloud connected without any prompting from the application.

Hey guys, I tried this code, rebooted the router 10 times, and each time the photon recovered. I saw the memory decrease from ca 60,000 bytes to 58,000 and it eventually recovered that memory when the WiFi was restored. I left it disconnected for a longer period, and you see the application thread slow down as it blocks waiting to push messages to the queue, which aren’t delivered, but WiFi still recovered.

If anyone experiencing this issue could provide a small app and steps to reproduce it that would be a huge step towards us being able to address it.

Hey, was reading what you said @mdma:

It shouldn't be necessary to call WiFi.connect() and similar functions if you call Particle.connect() in setup.

So I came in this morning and ripped all the code out of my checkConnectionStatus() function and just left:

if (!Particle.connected()) {

  if (!cloudConnecting) {
      Serial.println("Connecting to cloud!");
      Status::SetDeviceStatus(DEVICE_CLOUD_CONNECTING);
      Serial.println("Particle.connect()");
      Particle.connect();
      cloudConnecting = true;
  } else {
    if (cloudConnecting) {
      Serial.println("Connected to cloud!");
    }
    cloudConnecting = false;
  }
}

I just ran through 15 cycles of the WiFi network (using v0.4.9) and everything seems to be working ok, no blocking of the system thread at all and the memory seems to recover fine.

I am going to soak it overnight with a script to drop the network a few 100 times.

1 Like

So, I also, after the above, have stripped out loads of “stay alive” code.

My resulting code, now appears to be working brilliantly since last night even with (purposely) rubbish signal strength, I will continue to monitor…

my basic pseudo-code now has:

system thread & automatic mode

  loop
  {
      do_stuff that doesn't require connectivity whenever it's required
      if ( WiFi.ready() && time_to_do_something ) 
      {
         do_stuff that needs local networking
         if (Particle.connected() )
         {
            do_stuff that needs the P. Cloud
          } else { Cloud not connected
            waitFor(Particle.connected(),8000)
          } 
      } else { // Wifi not ready
      if ( !Wifi.connecting() ) { WiFi.connect() }
      }
   }

I also use PhotonWdgs to ensure a HW reset if things die…but I don’t think it’s being triggered much now.

As a side note, I am still trying to figure out what exactly was the issue that causes/caused the lockups though… if I can get the simple reproduce I’ll post here.

1 Like

So I ran this last night, I scripted a DD-WRT Access Point to ifconfig eth1 down for 60 seconds every 10 mins.

What I found was that the photon reconnected successfully about 13 times, after which it failed to reconnect. The light was flashing green and my loop() code was running away, printing to screen, but no reconnection.

I am going to try and strip it down to a basic application.cpp and upload a replication.

It would be great to know if it does this in safe mode also.

Cool, I can run this as well now - I assume I can tell it’s connected in safe mode just via the breathing magenta?

yep, that’s correct. Breathing magenta means it’s connected to the cloud.

So I stripped it down to a bare application.cpp and ran multiple re-connections as before. In true software bug fashion, it worked perfectly fine.

Colour me confused!

I am slowly putting bits of my application back in to see if I can pinpoint where it starts to show the behaviour again - anything glaringly obvious that might be causing this problem for me?

Nothing comes to mind. I’d like to try this test myself over the coming week. If you could post application code that definitely exhibits the issue I will try to replicate then dive in. (I have the WICED sources so I can see what’s going on in the networking stack.)

Cool - thanks for that! I should be able to share application code with you ok, will play for another bit here to see if a pattern emerges. Will have something with you by start of the week.

Thanks - no hurry from my side - please take the time you need!

OK… The following code does appear to eventually end up with a lock-up…

SYSTEM_THREAD(ENABLED);

bool debug_serial = true;
long unsigned int last_reconnect = 0;
IPAddress remoteIP(192, 168, 2, 1);

void setup() {
    if (debug_serial) {
        Serial.begin(9600);
        delay(250);
        Serial.println("Starting.");
    }
    last_reconnect = millis();
}

void loop() {
    if (WiFi.ready())
    {
        if (millis() - last_reconnect > 1000) 
        {
            int replies=WiFi.ping(remoteIP); 
            Serial.print(Time.timeStr());
            Serial.print(" - Connected to WiFi. At least 1s since last check.");
            Serial.printf(" System Memory is: %d, RSSI is: %d. ",System.freeMemory(), WiFi.RSSI());
            if (replies == 5) { Serial.printlnf("Ping OK (%d).", replies); } else { Serial.printlnf("Ping Failed (%d).", replies); }
            last_reconnect = millis(); 
        }
    }
    else if (millis() - last_reconnect > 1000)
    {
        Serial.print(Time.timeStr());
        Serial.print(" - NOT connected to WiFi. At least 1s since last check.");
        Serial.printlnf(" System Memory is: %d",System.freeMemory());
        //if (!WiFi.connecting()) { WiFi.connect(); }
        WiFi.connect();
        last_reconnect = millis();        
    }
}

@specialcircumstances - I have noticed lock ups when calling WiFi.RSSI() - try your above example with it removed.

OK can do.

Is there a bug open for the RSSI thing?