Algorithm for managing Wifi and Cloud connections

I’ve had mixed issues with trying to manage my wifi, cloud and mqtt connections manually. I’m hoping someone can share some thoughts or provide some advice. This code works fine under normal circumstances, but when the network gives out or I reboot the modem I often have to go to all my devices and force a reset. Managing the connection to handle outages has been frustrating so far.

The requirements I have for my device are that a connection to the local Wifi my local MQTT broker are required to run application code. The connection to the cloud is optional. I obviously want a could connection for flashing code and debugging tools, if my home internet is down then the devices should still operate on the LAN with no internet access needed.

To accomplish decoupling the WiFi connection from the Cloud connection I’m going to use SYSTEM_MODE(MANUAL). To keep the Cloud connection from blocking the device I’m going to run SYSTEM_THREAD(ENABLED). Because I’m running a separate thread I’ll need to take care that I’m not using long delays and instead using short delays in a loop while calling Particle.Process()

Algorithm Summary

  1. If the Wifi is in listening mode then just delay 50, run Particle.Process() and get out.
  2. If we don’t have a Wifi Connection then call connect and wait for up to 60 seconds.
  3. If Wifi failed to connect 60 consecutive times then reboot the device
  4. If we established a Wifi connection then call Particle.connect(), but only once every 30 seconds. And no penalty if it doesn’t work.
  5. Connect to the MQTT broker. If it fails then wait 60 seconds before trying again
  6. If MQTT failed to connect 60 consecutive times then reboot the device

Notes:
In all cases we only run the application code if we have Wifi & MQTT.
In any delay loop we run Particle.Process() and also check if the device en tered listening mode (hold setup for 3 seconds).
In one version of my code the connect to the MQTT server became blocking infinitely. I think it had to do with running out of sockets. Any comments?

The Code

unsigned long now = millis();
connectionState = (WiFi.ready() ? 1 : 0) + (MQTT::isConnected() ? 2 : 0) + (Particle.connected() ? 4 : 0);

// 5 seconds pass, log connection state
if(	lastConnectionLog == -1 || (now - lastConnectionLog) > 5000)
{
	lastConnectionLog = now;
	log("boot", "Connection State: ",  String(connectionState));
	if(WiFi.connecting())
		log("boot", "WiFi.connecting...");
	if(WiFi.listening())
		log("boot", "WiFi.listening...");
}

// Particle is in setup mode.  don't run app code.
if(WiFi.listening())
{
		Particle.process();
		delay(50);
		return;
}

// Connect to Wifi or restart
if (!WiFi.ready())
{
		log("boot", "Connecting to WiFi...", String(connectionState));

		lastParticleConnect = -1;
		lastParticleCloud = false;
		WiFi.connect();

		int waitCounter = 60000; // ~60 seconds
		while(waitCounter > 0 && (!WiFi.ready()))
		{
			Particle.process();
			if(WiFi.listening()) return;
			delay(50);
			waitCounter -= 50;
		}

		if(WiFi.ready())
		{
			log("boot", "Connected to local WiFi...", String(connectionState));
			log("boot", "WiFi localIP", WiFi.localIP().toString());
			log("boot", "WiFi subnetMask", WiFi.subnetMask().toString());
			log("boot", "WiFi gatewayIP", WiFi.gatewayIP().toString());
			log("boot", "WiFi dnsServerIP", WiFi.dnsServerIP().toString());
			log("boot", "WiFi dhcpServerIP", WiFi.dhcpServerIP().toString());
			log("boot", "wifi signal online ", WiFi.SSID());
		}
		else
		{
			log("boot", "No Wifi Connection");
			wifiFailConnections++;
			if(wifiFailConnections > 60)
			{
				log("boot", "No Wifi in 60 minutes - restart");
				Particle.process();
				System.reset(); //Epic fail bro
			}
		}
		return;
	}

// Try a Particle Connect every 30 seconds
// Ok to run without cloud, but keep trying
if ((!Particle.connected()) &&  (lastParticleConnect == -1 || ((now - lastParticleConnect) > 30000)))
{
	log("boot", "Connecting to Particle Cloud...", String(connectionState));
	lastParticleConnect = now;
	lastParticleCloud = false;
	Particle.connect();
	Particle.process();
}

if (Particle.connected() && (!lastParticleCloud))
{
	log("boot", "Connected to Particle Cloud...", String(connectionState));
	lastParticleCloud = true;
	lastParticleConnect = -1;
}


if (!MQTT::isConnected()) {

	log("boot", "Connecting to MQTT Server...", String(connectionState));

	if (MQTT::connect(devicename.c_str(), mqttusername.c_str(), mqttpassword.c_str(), statusTopic, QOS0, 0, "{\"value\": false}", 0))
	{
		log("boot", "Connecting to ", this->mqttserver.c_str());
		mqttFailConnections = 0;
		// Subscribe to commands for all sensors for this device
		MQTT::subscribe(subscribeCmdTopic, QOS1);
		MQTT::publish(statusTopic, "{\"value\": true}");
		MQTT::loop();
	}
	else
	{

		log("boot", "No MQTT Connection");
		mqttFailConnections++;
		int waitCounter = 60000; // ~60 seconds
		while(waitCounter > 0)
		{
			Particle.process();
			if(WiFi.listening()) return;
			delay(50);
			waitCounter -= 50;
		}

		// Can't connect to MQTT Server.  reboot and try again
		if(mqttFailConnections > 60)
		{
			log("boot", "60 MQTT Failures - restarting...");
			System.reset(); //Epic fail bro
		}
		return;
	}
}

APPLICATION CODE HERE...

If you are in listening mode then you may not have any WiFi credentials. Prior to step 2, you may want to ensure you have some WiFi credentials that you can attempt connecting with. If no credentials are there, go back to listening mode. If, after 60 WiFi connection attempts it doesn't connect, you may want to reset your credentials if that makes sense (thus forcing listening mode again). It all depends on why the connection may have failed and how you want to deal with that. With invalid credentials your code will loop forever.

The rest of the steps look right.

Good call on forcing listening mode if there are no credentials stored. If after 60 fail connects I reboot just to rule out any device specific issues (e.g. memory, sockets, poor use of millis() etc.). The more likely scenario is the device has power but the router it’s connect to does not. This is a likely case on my battery operated devices during a power failure.

The intent of this algorithm is to survive those power outage scenarios.

@AndrewWeiss, I would put the WiFi.listening() test as the first thing. Not sure the delay and the Particle.process() are necessary. I have a Photon that uses SoftAP so any client can setup the local credentials. I don’t want my main code running while SoftAP is running. One thing you could consider is after 60 tries is to go to deep sleep for a fixed amount of time which is equivalent to a low power timed reset.

It’s happened again. I rebooted the router and all of my Particles are stuck blinking blue (connecting to cloud) and application code is blocked. I know that the app code is blocked because I connect to the serial port to monitor and there is nothing coming out. It should at a minimum drop a log every 5 seconds.

So it appears that my Particle.connect() call is blocking indefinitely.

This is the root of my problem.

Any idea on how Particle.connect operates with system thread enabled? It doesn’t seem to be working as expected.

Simplified Particle.connect() only sets a flag that the system should connect to the cloud and returns. So that command won’t block your code at all.
The actual connection process won’t fully block your code either although the will be phases where the controller will not service your own code quite as regularly as it should.

But we do know, that user code can prevent Particle.connect() from actually connecting and one of the most common mistakes is to call that function again while a previous connection attempt is still running, knocling the whole process back to the beginning.

Just for checks, try adding a waitFor(Particle.connected, 60000) after your connect call and toggle the D7 LED whenever this times out.

I’ll give it a try tonight ScruffR. I had originally removed using waitfor because I wasn’t sure if that would call particle.process in the background or not. Since you’re suggesting it I assume it’s ok.

That’s what it was created for :wink:

@ScruffR thank you for the help on this previously. If you have a moment I would appreciate some direction if you have any advice.

I'm revisiting this issue and giving it another go. Everything seemed perfectly logical in my attempts, but after a few router reboots some photons would go into an indefinite blinking blue. Seems that this was an issue for others and I'm thinking that the culprit is an out of sockets issue related to the MQTT library. I haven't been able to prove this but it seems all to similar to others (links below) who were using TCPClient and had photons "lock up" on them. If I use automatic mode then I have no issues, but in my manual mode with threads enabled I'm will encounter this issue within a few days on most photons.

I'm using this fairly well known MQTT library GitHub - hirotakaster/MQTT: MQTT for Photon, Spark Core. Although this library seems to properly call client.stop as needed, perhaps it's worth a review.

Any help would be appreciated.
-Andy

If things work fine in AUTOMATIC mode but not otherwise I’d assume that the application or library code does not deal well with asynchronous connection losses.
Calling some functions that require an active connection but don’t bail gracefully when it’s not there may leave the object or even the socket in an in-between-state which may be hard to get out of.

But for a definitve answer I’d have to dive into 3rd party code for which I just haven’t got the time. But maybe @hirotakaster can assist you with his library.

Thank you @ScruffR.

Assuming this is the case and it’s an issue with the library or something else mishandling the socket, is there any preventative code that you could suggest to detect and avoid the lockup? For example prior to making any connections check if there are any free sockets and if say after 60 seconds one doesn’t free up then force a reboot.