WiFi channel switching stops WiFi connectivity and is only recoverable by reboot (BUG)

I ran into a situation where I wanted the TCPClient connect() call to be asynchronous so I wrote an alternate implementation. I haven’t used it in a long time, but I’m pretty sure it should still work.

2 Likes

System seems to be stable now that it is connected to the cloud. Even when the connection is lost, it is restored without problems. I have one test system with terrible reception.

This is what I have (it switches automatically between USB and TCP):

I have updated my findings in this issue:

The problem is still not resolved, 0.8.0-rc.2 doesn’t fix it.

Seems even worse than before, because even though I have

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);
The application is completely blocked when WiFi is lost.
WiFi connectivity handling still seems to be absolutely terrible.

I am getting unhappy about my choice of Particle as a platform for my products.
Is it too much to ask from an IOT platform that:

  • Losing WiFi does not completely block the system
  • Wifi disconnects and channel switches are handled correctly and when WiFi returns
  • it always reconnects
  • A TCP server ran by the device does not get into an unrecoverable error state when WiFi is bad or temporarily lost?

I have given you test code. I reported this bug over 6 months ago. I feel unheard and ignored.

Is it perhaps possible to add some system event hooks to handle WiFi disconnects?
I’m really trying to find a solution because I have a lot of customers affected by this. Why can’t someone at particle provide some example code that:

  • Handles WiFi connecting/disconnecting without blocking the main thread
  • Handles running a TCP server on the photon that in a robust way
2 Likes

@Elco and anyone else

I too am having some trouble when my P0 devices go offline. We have 2 issues that seem related to what you are describing. The scenarios are: the router disappears (users switch routers or take the product to a different location) or the internet is intermittent or down but they are still connected to the router. In both cases, the P0 successfully determines that we are offline and we get into our “offline state” which provides some tips on our display of what to do. However, we cannot seem to get OUT of that state even if the router comes back or the internet comes back.

the P0 seems to behave differently in each case: with wifi.ready reporting 0.0.0.0 when the wifi network is not available (which should be correct) and the previous ip address seems to stay if the network cable is unplugged from the router.

one reason we switched to thread enabled mode was to allow our users to press buttons on the device and to update the screen for reassurance ESPECIALLY when offline. However, this seems to have made it difficult for the hub to come BACK online. Do I need to call the wifi.connect routine again? Will it block my system for large chunks of time and if so, any suggestions on how to allow users to interact with the device to get more information?

we are trying not to rely too heavily on the particle cloud to reduce heavy loads or timeouts so our conditions for getting back online include: is there internet?, and is our server available?

other info: I’m using firmware 0.6.3 since our product is in production. Our shipping system is not using a threaded mode, but we would like to go there to enable more interaction for the user.

Have you tried Particle.connect() periodically when you’re offline (or triggering it from user input)?

This should not block application code because you’re in threaded mode.

@avtolstoy is working on fixes and made a few in this PR:
https://github.com/particle-iot/firmware/tree/fix/photon-tcp

This improved the situation, but we still see some photons not getting back on the network.
I was also able to trigger an SOS in certain conditions with the fixes.

I don’t know when Andrey scheduled time to work on it again, but it does not seem to be complete yet.
I have posted my findings in this thread:

But I’m waiting for Andrey for an actual fix. It seems to be deep in the system layer.

Once I have a combination of my own code for disconnect handling combined with fixes from particle that works, I’ll post it here.
But I don’t think you’ll get a reliable connection on 0.6.3. Just look at the bugs that were uncovered in that PR.

1 Like

Just for the record, this seems like correct behavior to me. You are still connected to the router and could in theory access other hosts on the same subnet, you just can't get to the internet.

I know there are problems in this area and fortunately they are being addressed, but I don't think this is one of them.

3 Likes

I can try particle.connect, but we always feared particle.connect would slow down our application code because significant resources would be spent trying to connect if no connection was available. This is why I was hoping to use the Wifi.ready to indicate when we would be ready to attempt to make particle connection. Also since we don’t expressly need particle to be connected to run our application, we don’t want to have to check particle if we don’t need to. We can try to resolve to google.com first before attempting particle, but will we need to turn on and off the wifi module?

Would Wifi.connect achieve the same thing and not be blocking in the same way?

This seems odd that the behavior of wifi.ready is different from online to offline vs offline to online mode.

for example:
wifi.ready returns true when online but turns false when the photon goes offline.
but when offline, wifi.ready also returns true if connected to a router?

Do we have to expressly turn off the wifi module if we lose connectivity and turn it back on expressly when we detect we are offline?

Sorry, I should have started here: Can you describe what you're doing to try to reconnect?

WiFi.ready() should return true if the photon is connected to a WiFi network: it can reach the router.
If that router doesn't provide a connection to the Internet, that's a different matter. That's what Particle.connected() is for. That's what @bko said and he is right.

You'll have to be clear what unplugging the router means: the power plug, or the WAN connection?

1 Like

The IP address your device has after an online->offline transition is (usually) still valid as it’s (commonly) a local IP to your LAN.
But you can try to ping a remote address to check whether your LAN has also an internet connection and only if that’s the case trigger a Particle.connect().

1 Like

Bumping again for visibility!

WiFi + TCP is still buggy and crashy on the photon!

Copy of my reaction on the PR:

So… it looked like Particle finally noticed that WiFi + TCP was really unstable. Finally after 6 months, fixes were incoming!
But not everything is fixed yet. And I have not seen an update in 8 days.

Can someone at Particle tell me, is there a plan to fix the rest? Is someone assigned?
Is it scheduled on a sprint? Does Particle even acknowledge that there are still problems?

I have unhappy end users and I am totally dependent on Particle for a fix. I’m not the only one having these issues. There are plenty of other users posting about them on the forum. This is really getting on my nerves.

Please, keep us informed.

Do I need to bump up all the threads about WiFi and TCP bugs before action is taken? :angry:

1 Like

I have come to this situation later than you and others - my issue is similar in that my devices might have bad credentials or more commonly operate in an environment where the WAPs are turned off outside school hours. The device should keep working and has an SD card based event message buffer to handle this but even more recent releases (0.8.0-RC.11) seem to be tenaciously trying to reconnect to a WAP that isn't there and are slowing the application to a crawl.

Did you get a response from Particle? I have several support tickets open and they all go to Andrey who is clearly maxed out on the Mesh development so I never had a reply.

I was able to get it stable in the latest version of our firmware.
This is the code I use:

SYSTEM_THREAD(ENABLED);
SYSTEM_MODE(SEMI_AUTOMATIC);

// in setup:
    System.disable(SYSTEM_FLAG_RESET_NETWORK_ON_CLOUD_ERRORS);

I think one of the crucial changes is that I am not dropping any clients until they are disconnected. Only then I start looking for a new client.

+1 on this issue and solution…

Running 0.7.0 and my simple TCPServer app (below) it’s pretty easy (just a few trials of resetting WiFi or WAN on the router/AP) to get the Photon into a state where it responds to pings but will no longer connect with TCP clients.

Disabling SYSTEM_FLAG_RESET_NETWORK_ON_CLOUD_ERRORS seems to be the ticket to avoid this state.

On the other hand, Particle had a reason for resetting the network on cloud errors:


so disabling this flag may create other problems. I’m hoping that the improvements/bug fixes in 0.7.0/WiFi will be enough to move forward with something stable.

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);

void setup() {
    particleConnected = false;
	serverOnline = false;
    clientConnected = false;
	Particle.connect();
	// Do not reset network connection on cloud errors
    System.disable(SYSTEM_FLAG_RESET_NETWORK_ON_CLOUD_ERRORS);
}

void loop() {
    if (Particle.connected()) {
        if (!particleConnected) {
            LOG(TRACE, "particle connected");
        }
        particleConnected = true;
		Particle.process();
		if (!serverOnline) {
			server = TCPServer(PORTNUM);
			if (server.begin()) {
				serverAddr = WiFi.localIP();
				LOG_CONN("server", serverAddr, "online");
				serverOnline = true;
			} else {
				LOG(ERROR, "server.begin() failed");
			}
		}
    } else {
        if (particleConnected) {
            LOG(TRACE, "particle disconnected");
		}
        particleConnected = false;
		if (serverOnline) {
			client.stop();
			server.stop();
			LOG_CONN("server", serverAddr, "offline");
        	serverOnline = false;
		}
		Particle.connect();
    }
    if (client.connected()) {
        if (!clientConnected) {
            clientAddr = client.remoteIP();
            LOG_CONN("client", clientAddr, "connected");
        }
        clientConnected = true;
        // handle client socket
    } else {
        if (clientConnected) {
            LOG_CONN("client", clientAddr, "disconnected");
        }
        clientConnected = false;
		if (serverOnline) {
			client = server.available();
		}
    }
}

I would recommend using 0.8.0-RC.11 - it seems to have solved the issues I had. Devices now just sail through disconnections and reconnect using SYSTEM_MODE(SEMI_AUTOMATIC); see this thread Handling WiFi out of Range for my findings as to why it was going slow when the WiFi network was out of range.

1 Like

Am wondering if I am suffering from the same issue as reported by @elco in this ticket.

At random and rare times, a P1 based board loses connectivity (flashes green), then eventually quickly flashes cyan, but is never able to recover from this state and so must be rebooted.

Logger reveals:

0001336641 [hal.wlan] TRACE: connect cancel
0001336642 [hal] TRACE: 20015b28 socket list: 0 active sockets closed
0001336642 [hal] TRACE: 20015b30 socket list: 0 active sockets closed
0001346731 [app] WARN: W.rdy=0 IP=1 DISC
0001346731 [app] INFO: W.con

My debug line “W.rdy=0 IP=1 DISC”, means "WiFi.ready() has returned false and it has an IP address. The “WiFiStateMachine” strategy being used acts upon this condition by issuing a WiFi.disconnect() and then WiFi.connect(). As stated, it never connects again.

This routine is being called approximately every second.

DeviceOS v1.4.0-rc.1, SYSTEM_MODE(SEMI_AUTOMATIC).

I think there has been a lot of noise on this issue over the years, I may be repeating myself as well here. emphasized text

@mterrill, do you have any thoughts on this problem? It seems that it is not the 20Mhz vs 40 MHz WiFi channel bandwidth issue as our Access Points are not configured that way.

No specific thoughts on the issue, though we seem to be getting customer reports very similar to your symptoms. We’ve even had orange/red flashes that were solved when the customer went through wifi setup again with the device. Which is bizarre. Same network.

Thanks @mterrill. It is a real issue and the requirement to reboot is really bad.

The logger line:

0001336641 [hal.wlan] TRACE: connect cancel

might give a clue to someone who knows the internals....

1 Like