WiFi but no Cloud causes SOS

DeviceOS 1.2.1.rc-1 on P1.

Am trying to make my system more robust to specific communications faults.

Has anyone tested the following scenario or have comments:

  • Am using modified @rickkas7 strategy (state machine following the state of the WiFi connection) as alluded to here Particle.connect() disconnects WiFi - can this be stopped? (to allow the device to work on an intranet)
  • Set up an Android 9 phone as a hotspot
  • Device connects with cloud ok (breathing cyan)
  • Turn off DATA on the mobile but keeping the hotspot active
  • Device is still breathing cyan (because it has not detected a drop in the Cloud connection as yet)
  • Call publish() a number of times (which is proceeded by a Particle.connected() check which would pass initially whilst the DeviceOS determines that it is in fact offline)
  • After some ten or so publish attempts, and whilst flashing cyan (expected), the logger outputs:
    {noformat}
    000039621 [hal.wlan] TRACE: connect cancel
    0000039628 [hal] TRACE: 20015b04 socket list: 0 active sockets closed
    0000039628 [hal] TRACE: 20015b0c socket list: 0 active sockets closed
    {noformat}

and then the Device goes into an immediate one flash SOS panic (bad).

Primary question : do the above log lines give some clue as to what is going on?

Recompiling to not use the modified @rickkas7 strategy did not give the SOS panic… but the same log lines were seen.

Needless to say, am finding it very difficult to chase this down… investigations are continuing…

Sorry to be “pedantic”, but could you try 1.2.1-rc.2 instead?
Although I couldn’t see any specific change between the two RCs that would fix that, it’s still less productive to test with some outdated RC where the newer version may not exhibit this issue any more.

If you needed to test with an official version (e.g. 1.1.0) I’d see the point since you may need to find a workaround for that version.
But being on RC already you should always opt for the most recent one as this will more likely become an official release than any predceeding RC and you don’t want to have a workaround but the issue addressed before the official release.

@ScruffR, agree 100% (I normally always work with the latest).

Compiled against DeviceOS 1.2.1.rc-2, but As you might expect, the problem is still evident...

Here is log output before the SOS:

0000292355 [app] WARN: => !P.conn
0000293359 [app] WARN: => !P.conn
0000294362 [app] WARN: => !P.conn
0000295366 [app] WARN: => !P.conn
0000296370 [app] WARN: => !P.conn
0000297374 [app] WARN: => !P.conn
0000298378 [app] WARN: => !P.conn
0000299382 [app] WARN: => !P.conn
0000300386 [app] WARN: => !P.conn
0000301390 [app] WARN: => !P.conn
0000302184 [hal.wlan] TRACE: connect cancel
0000302188 [hal] TRACE: 20015b04 socket list: 0 active sockets closed
0000302188 [hal] TRACE: 20015b0c socket list: 0 active sockets closed

I have made some progress this afternoon. Reading through an earlier post looking for clues: No WiFi blocks loop() - specifically Serial1 reception, @Elco's comment regarding "overflowing the system thread" caught my eye.

I was calling my "WiFiStateMachine()" routine every 200 ms. Am now calling it every 1000 ms.

This change has improved:

  • the time it takes for the device to pick up the change mobile's data status (enabled or disabled).
  • connect to the Wifi hotspot

But unfortunately the problem still occurs (but takes longer to reproduce...)

Here is the WiFiStateMachine() function call "as is" for the record (as stated, based on @Rickkas7 code):

#ifdef WIFI_RECONNECT_STRATEGY
///
/// v1.16.0 2016-08-04
/// Strategy overcomes issue where once WiFi is disconnected,
/// it does not reconnect
/// Refer to "TCPCLIENT intranet connection fails if no cloud connection
/// https://community.particle.io/t/tcpclient-intranet-connection-fails-if-no-cloud-connection/24684
///
///
/// Refer https://community.particle.io/t/no-wifi-blocks-loop-specifically-serial1-reception/43227
/// Might be overwhelming the system thread if performing the following
/// on every loop() iteration, so only checking every 200 mS
///
///	NOTE is called every 200 mS
///
void WiFiStateMachine(void)
{
		enum WifiState { WIFI_STATE_NOT_CONNECTED, WIFI_STATE_CONNECTING,
									 WIFI_STATE_CONNECTED, WIFI_STATE_RUNNING };

		static WifiState wifiState = WIFI_STATE_NOT_CONNECTED;
		static uint16_t nTickCountDownParticle = 0;
		static uint16_t particleConnectRetryMins = 0;
		static bool bLoggedWiFiDetails = false;		// Only want to report once per session

		switch(wifiState)
		{
			// ++++++++++++++++++++++++++
			case WIFI_STATE_NOT_CONNECTED:
			// ++++++++++++++++++++++++++
				Log.info("W connecting..");

				// NOTE: Following is not necessary because of prior calls to
				// WiFi.connect() and/or Particle.connect()

				// WiFi.connect();  // NOTE: No need to ever call again, it keeps on trying
				wifiState = WIFI_STATE_CONNECTING;

				break;	// WIFI_STATE_NOT_CONNECTED:

			// ++++++++++++++++++++++++++
			case WIFI_STATE_CONNECTING:
			// ++++++++++++++++++++++++++
				// Checking IP is a workaround for bug
				// where WiFi.ready() returns true with IP 0.0.0.0
				if (WiFi.ready() && WiFi.localIP())		// same as (WiFi.localIP()[0] != 0)
				{
						wifiState = WIFI_STATE_CONNECTED;
				}

				// The WiFi.connect() call never times out, it will keep trying forever
				// so there's no need to call WiFi.connect() again here.
				break;	// WIFI_STATE_CONNECTING:

			// ++++++++++++++++++++++++++
			case WIFI_STATE_CONNECTED:
			// ++++++++++++++++++++++++++
				// Do any one-time initialization here like calling udp.begin() or tcpServer.begin()
				Log.info("W conn");

				// Refer https://community.particle.io/t/udp-broadcast-listeners-stop-listening-over-time/38391/2
				// When WiFi disconnects, all listeners are removed, and not re-created.
				// So when a disconnect is detected and you reconnect,
				// call UDP.begin() and TCPServer.begin() again.


#ifdef TCP_SERVER_ENABLED
				if (configs.fMode2 & MODE_2_FLAG_TCP_SERVER)
				{
						// TCP Server - start listening for clients
						Log.info("tcpSrv.bgn");
						tcpServer.begin();
				}
#endif	// TCP_SERVER_ENABLED

				// 2018-07-21 v4.06.2 Removed as not required/may be bad practice
				// Also connect to the Particle cloud
				// Particle.connect();

				// Kick off a check for INET connectivity, if good,
				// kicks off Particle.connect()

				nTickCountDownParticle = 8 * WIFISTATEMACHINE_TICKS_PER_SEC;
				particleConnectRetryMins = 0;	// reset Nagle algorithm

				wifiState = WIFI_STATE_RUNNING;
				break;	// WIFI_STATE_CONNECTED:

			// ++++++++++++++++++++++++++
			case WIFI_STATE_RUNNING:
			// ++++++++++++++++++++++++++

				// Checking IP is a workaround for bug
				// where WiFi.ready() returns true with IP 0.0.0.0
				if (!(WiFi.ready() && WiFi.localIP()))
				{
						Log.warn("As W.rdy=%d localIP=%d, disc!",
											WiFi.ready(), (bool)(WiFi.localIP()));
 						WiFi.disconnect();											// <<<<<<-----------
						Log.info("W.conn");
						WiFi.connect(WIFI_CONNECT_SKIP_LISTEN);	// <<<<<<-----------


						wifiState = WIFI_STATE_CONNECTING;

						// May not be absolutely necessary, but perhaps good practice
						// to close TCP listeners

#ifdef TCP_SERVER_ENABLED
						if (configs.fMode2 & MODE_2_FLAG_TCP_SERVER)
						{
								tcpServer.stop();
						}
#endif	// TCP_SERVER_ENABLED

						if (configs.fMode2 & MODE_2_FLAG_INTRANET_ONLY)
						{
								// No need for a cloud connection
								// Connecting to other services on the *intranet* and
								// therefore don't want Particle.connect() dropping WiFi.
								Log.warn("P.disc");	// Does NOT drop WiFi
								Particle.disconnect();
						}

						// No need to call WiFi.connect(), it continually keeps retrying
				}
				else
				{
						// We have WiFi, therefore we can check for Particle connection
						// (at regular intervals)

						if ((configs.fMode2 & MODE_2_FLAG_INTRANET_ONLY) &&
								(nTickCountDownParticle == 0) &&
								!Particle.connected())
						{
								// Only call once for this outage
								Log.warn("W !cloud ->P.disc");
								Particle.disconnect();		// Does NOT drop WiFi

								// Try a Particle.connect() in 30 secs
								nTickCountDownParticle = 30 * WIFISTATEMACHINE_TICKS_PER_SEC;
								particleConnectRetryMins = 0;	// reset Nagle algorithm
						}
				}

				break;	// WIFI_STATE_RUNNING:

		} // switch(WiFiState)

		// -----------------------------------------------
		// Check for Particle connected every now and then
		// -----------------------------------------------

		if (nTickCountDownParticle > 0)
		{
				if (!Particle.connected())
				{
						Log.warn("=> !P.conn");

						if (!--nTickCountDownParticle)
						{
								// Give up trying to Particle.connect()!
								// Just check for internet connectivity from now on
									// NOTE: CheckForInternet() blocks for 1 sec
								//       if there is WiFi but no internet connection
								if (CheckForInternet())	// also checks WiFi.ready()
								{
										Log.info("INET->P.conn");
										Particle.connect();	// ****** NOTE ******

										// Need to do this so that Particle has time to come up
										// otherwise Particle.disconnect() will be immediately
										// called in WIFI_STATE_RUNNING
										nTickCountDownParticle = 8 * WIFISTATEMACHINE_TICKS_PER_SEC;
										particleConnectRetryMins = 0;	// reset Nagle algorithm
								}
								else	// No internet
								{
										// Implement simple "Nagle" back off algorithm
										// 1,2,3,4,5,5,5... mins
										particleConnectRetryMins =
													(particleConnectRetryMins < 5)
																					? particleConnectRetryMins + 1
																					: 5;
										Log.warn("No INET retry tst %dm",
										 						particleConnectRetryMins);

										nTickCountDownParticle =
																		particleConnectRetryMins * 60
																				* WIFISTATEMACHINE_TICKS_PER_SEC;
								}
						}
				}
				else	// Particle is connected
				{
						Log.info("P.conn OK");
						nTickCountDownParticle = 0;	// stop the count down
						particleConnectRetryMins = 0;			// reset Nagle algorithm

						if (!bLoggedWiFiDetails)
						{
								// Only want to report once per session
								bLoggedWiFiDetails = true;
								LogWiFiDetailsAndPublish();	// Calls Chariot.log()
						}
				}
		}

}	// WiFiStateMachine()
#endif  // WIFI_RECONNECT_STRATEGY


////////////////////////////////////////////////////////////
/// Returns true if WiFi is up and connected to the internet
////////////////////////////////////////////////////////////
bool CheckForInternet(void)
{
		bool rc = false;
		unsigned long time = millis();

		if (WiFi.ready() && WiFi.localIP())
		{
				rc = (WiFi.ping(ipInternetCheck, 1) == 1);
		}

		Log.info("ChkINET=%d %lums", rc, millis() - time);
		return(rc);
} // CheckForInternet()

Looking forward to any assistance that can be offered here....

1 Like

@rickkas7, have you any comment on this ticket? It is not inherently obvious what the problem is…

What I can say is that I can handle the WiFi but no Cloud condition by not calling WiFiStateMachine() …

I created an issue 20 days ago which might be related:

So far I have had no response from Particle…

1 Like

@elco, excellent pick up!

As you can see from my code in the earlier post, WiFi.ready() is being called all the time (because WiFi is ready):

if (WiFi.ready() && WiFi.localIP())

This points to a high probability that our issues are one and the same.

This is evidenced by when I don’t use my strategy (ie the line above is not called), the problem goes away…

Given your new insight, I will track this down from my end and report back.

@Elco have completely confirmed that my issue is most likely the same as yours. It is caused by calling WiFi.localIP().

By not calling WiFi.localIP(), the system no longer SOS panics when I turn off data on my mobile hotspot.

Unfortunately a side effect (which has yet to be felt) is that calling WiFi.localIP() was required for the WiFi.ready() issue where it sometimes returns true (ie ready) when it does not have an allocated IP address (ie it is not ready).

Hopefully your bug report will be resolved by the DeviceOS developers.

Thanks once again for the insight!

Let me know if you have a resolution to the WiFi.ready() bug…

I switched to using events to manage the IP address to display on the LCD.

Because these event handlers are executed on the system thread, there is less or no chance of accessing dangling pointers because the system thread changed the network config.

All of my WiFi handling is set up in this file:

The main handler is at line 157. I respond to particle system events to update connected status and ip to avoid thread safety bugs in device-os.

I still added a timeout on no connectivity, because in sometimes the events would not trigger and the system would be disconnected but not reconnecting (breathing blue), see line 122.

@avtolstoy fixed the network bugs I reported last time. I think he would be the guy to look into this one too. All the network handling code uses raw pointers, I think some shared and weak pointers could help with the thread safety. I’m glad some fixes to support newer versions of gcc are finally being implemented, because more modern STL usage and better compiler warnings would help the platform.

@Elco that is a very interesting point about being on the system thread vs the main thread.

You use of network events is very interesting… had not thought of using them.

I had implemented my code based on @rickkas7 code to get around “intranet only” issues (no cloud but wifi) to stop WiFi being automatically dropped because of cloud.

Hey Elco. Thanks so much for filing the issue! Things have been hectic for engineering in terms of travel and other housekeeping items recently. Your issue filing may have coincided with some of the hustle going on.

I'll go ahead and try to circle this internally and see if I can't catch some folks and get their input on this. :slight_smile: