Long Term Photon Connection Stability

Hmm… I double checked the interrupt code.
No returns, loops, or anything that could branch out or take a long time
between disable and enable.
A 4 case switch statement, a few counter decrements.
I doubt it is that.

Hardware peripherals are a possibility.
I have two tinyAVR devices as slaves on an I2C line.
I did not program or develop those, so short on exact details.
There is a checksum on communication though, and that has never
reported an error, so fairly reliable under normal cases.
But if one of those hung, I2C may fail badly.
I would hope an I2C failure would time out, but perhaps not?

I also have some simple pin outputs -> darlington pair -> relays.

Very good power supply. Pretty good grounding. Doubt it is electrical.

Also: I switched to static IP a week or so ago.
No major difference from DHCP, but I “feel” timeouts were faster.
But not measured in any meaningful way.

My ISP seems to go down in general quite a bit. A few minutes here and there
several times per day. But the wireless router (airport extreme) is good.

Firmware is 0.4.9

@mdma I have a set of steps that can reliably get the devices stuck in the flashing cyan state. Here’s my setup in case you would like to try to reproduce it yourself.

Setup:

  • I have 20 photons set up to make the problem more likely to happen. There are about 3 that never seem to show the problem, but most devices show it most of the time.

  • All the photons have static IP addresses.

  • The wifi is WPA2 and comes from a home router. I think the router is set up with factory default settings aside from the network name/password and admin password.

  • The photons are running the following code:

    SYSTEM_THREAD(ENABLED);

    void setup() { }

    void sendMessage() {
    TCPClient client;
    if(client.connect(“google.com”, 80)) {
    // this call to client.print() hangs when the devices are stuck in the flashing cyan state
    client.print(“GET / HTTP/1.0\nHost: google.com\n\n”);
    }
    client.stop();
    }

    void loop() {
    sendMessage();
    delay(1000);
    }

Steps to reproduce:

  1. Power on all the photons and wait for them to get to the breathing cyan state.
  2. Disconnect the router from the Internet, without powering off the router. I did this by unplugging the ethernet cable from the modem. All the photons will go to the flashing cyan state, as they should.
  3. Leave the cable unplugged for about 65 seconds, then plug it back in. In my setup, usually about 5-10 devices will go back to the breathing cyan state, as they should. The rest will get stuck in the flashing cyan state indefinitely. Once, I left them for several hours and none of them recovered.
  4. On a power cycle, all devices should go to breathing cyan, as they should.
3 Likes

As per my previous post, I reproduced a situation where the user thread locked up due to a wifi connection problem, even though I had SYSTEM_THREAD(ENABLED). In this situation, my call to client.print() hung indefinitely while the system thread failed to reconnect to wifi. I wrote some code that uses software timers to reset the device when this problem occurs, which I will share here in case anyone else is having the same problem.

SYSTEM_THREAD(ENABLED);

// Timestamp of the last successful TCP call
// Will be updated frequently so should always be recent
// If this timestamp isn't recent, the application thread is probably hung
unsigned long lastTCPSend = 0;

// Call the watchdog() function every second
Timer timer(1000, watchdog);

void setup() {
  lastTCPSend = millis();
  timer.start();
}

void sendMessage() {
  TCPClient client;
  if(client.connect("google.com", 80)) {
    // this call to client.print() hangs when the devices are stuck in the flashing cyan state
    client.print("GET / HTTP/1.0\nHost: google.com\n\n");
    // just called client.print() without hanging. record the time.
    lastTCPSend = millis();
  }
  client.stop();
}

void watchdog() {
    // Reset the device if the last successful call to client.print() was more than 2 minutes ago
    if(millis() - lastTCPSend > 120000) {
        // Reset the photon
        System.reset();
    }
}

void loop() {
  sendMessage();
  delay(1000);
}
2 Likes

I tried the watchdog timer (IWDG) with some interesting results.

When connected, watchdog works fine. Runs fine until I stop “tickling” it, then resets.
And I am tickling it with a separate thread, not the loop()

As soon as the wifi goes down, it immediately hits the watchdog and resets.

As above, I am using:
SYSTEM_MODE(SEMI_AUTOMATIC)
SYSTEM_THREAD(ENABLED)
With a separate thread for key code.
Also, I removed all my disable-interrupt sections.

My wild guess: Somewhere in the core code, it is disabling interrupts,
and when wifi goes down, there is a long delay in that section.

Is anyone from Particle or the elite picking up on @ehart01’s experimentation here? It seems to me that @ehart01 has definitively shown that some photons can be put into an unrecoverable hung state by merely losing connectivity with the internet. And that the state can only be “recovered” from by a watchdog that restarts the device.

If this can be demonstrated with such a straightforward experiment involving a handful of photons and a simple few lines of code, will this be investigated and hopefully fixed in a future firmware update? Or is there some way of avoiding this state?

Thank you for the very good steps to reproduce the problems you are seeing. I’m going to be focusing on these issues as my work at Particle. I will use the programs shared here as test cases to find the root cause of these Wi-Fi issues.

11 Likes

Just an FYI tidbit from further experimenting. May or may not be useful.

If I do a lot of I2C accesses from a different thread at the same time as network activity,
the system hangs pretty quick.
Hang = lock up, neither my thread or the main loop run. Cyan led stays at
whatever it was last set to.

I can see where if the main OS access hardware IO, and I access hardware IO,
at the same time, on different threads, that could conflict rather badly.
But I would have thought I2C was separate enough for that to work.

This sounds a little like resource locking / deadlock behavior I have seen in other venues. These kind of bugs can be very hard to find/fix.

I haven’t looked, but was wondering if anyone has checked the locking scheme carefully?

-Steve

@jvanier Any news of progress on this issue. I am held up on releasing a product as currently I am experiencing lockups once a day on each device! :frowning:

Another FYI for those interested.

I decided to shut down the cloud+wifi when my critical code ran.
Every hour, it shuts off at minute 0, and turns back on at minute 10.
That worked well of course. No more conflicts with key code.

But it worked even better than expected!!
I have been running solid for almost a week now, with no drops, timeouts, hangs,
or anything.
Even loop() has been running rock solid. (I check dropped seconds per day)

So turning it off and on periodically seems to clear out the random issues.
(Hopefully. One week is not that long yet)

So @rygh, are you doing the following? Or what is your sequence (source appreciated)?

  1. particle.disconnect() followed by wifi.off()
  2. wifi.on() followed by particle.connect()

So first the bad news, it did hang again even with this code.
(total-lock up hang)
But it is still a lot better.

I don’t use particle.disconnect before wifi.off. Perhaps I should.

// Run semi automatic mode
SYSTEM_MODE(SEMI_AUTOMATIC)
SYSTEM_THREAD(ENABLED)

bool        wifi_down = 0;
bool        wifi_up = 0;

This function is called once every second:

void    control_enet(void) {
    if ((mtime.curr_hour >= 9) && (mtime.curr_hour <= 20) &&
        (mtime.curr_minute == 0) && (!wifi_down)) {
        WiFi.off();
        wifi_down = 1;
        wifi_up = 0;
    }
    if ((mtime.curr_hour >= 9) && (mtime.curr_hour <= 20) &&
        (mtime.curr_minute == 10) && (!wifi_up)) {
        WiFi.off();
        WiFi.on();
        Particle.connect();
        Particle.process();
        wifi_up = 1;
        wifi_down = 0;
    }
    if (((mtime.curr_hour < 9) || (mtime.curr_hour > 20)) &&
        (!wifi_up)) {
        WiFi.off();
        WiFi.on();
        Particle.connect();
        Particle.process();
        Particle.process();
        wifi_up = 1;
        wifi_down = 0;
    }
}

Dang! Sorry to hear it hung, I was hoping this was the silver bullet - I really need my photons to just work…

Yeah. I have tried many things.
Static IP, putting code on a different thread. Manual/Auto/system thread.
Watchdog resets, and so on.
Nothing gets it to 100%.

I am really down to 3 options:

  1. Wait and hope some firmware update fixes things.
  2. External hardware watchdog reset.
  3. Go back to Arduino. I would be so sad though…

Have you tested 0.5.0-rc.1 already?
It should have some WiFi tweaks on-board.

No, I have not tried it.
I did notice it, but was nervous about the Pre-Release status.
But little to loose, so sounds like a plan.

1 Like

I haven't gotten to the root cause of this. I'm hoping to patch the issue this week.

I’ll be testing with 0.5.0-rc2 this week using my before-mentioned code. I’ll feed back with what I find.

I really like the Particle platform, so someone please tell me I’m wrong about the following…

I came across a posting about Core stability and noticed that the network issues called out are startlingly similar to those called out in this thread. Given the similarities, I’m concerned there may be a fundamental flaw in the Particle products’ design where it’s reliance on the cloud may be it’s undoing - especially in settings where network connectivity is less than reliable.

To me, the inability to keep a Photon running reliably over time is a major issue and one which may require me to abandon the platform. Particle, it’s my sincere hope that this issue can be resolved - please reach out if help is needed to debug.

@jvanier
With the update to 5.0, I appreciate the 60 second timeout feature that was added to WiFi.connect. I’ve had issues with Photon connectivity due to poor RF reception of the WiFi signal. An improvement to this would the following code:

int found = WiFi.scan(ap, Max_number_of_APs_scanned);           //scanning all available SSID's in range
int stored_SSIDs = WiFi.getCredentials(stored_WiFi_creds, 5);   //getting all the stored SSID's
for (int i = 0; i < found; i++){
        for (int x = 0; x < stored_SSIDs; x++){
            if (strcmp(ap[i].ssid, stored_WiFi_creds[x].ssid) == 0) {    //2 loops are needed here. taking each scanned available AP found against each of the stored SSID's
                OK_to_connect_WiFi = TRUE; //then both strings are the same
                goto label_to_leave_wifi_identification_loop;
            }
        }
    }
label_to_leave_wifi_identification_loop: ; //exited the loop early because there's a saved WiFi network available
if (OK_to_connect_WiFi == TRUE){ //only running if saved SSID's match any available SSID's in range
WiFi.connect();
}

Using code like this, then the Photon wouldn’t keep trying a 60 second long timeout if the stored WiFi SSID wasn’t in range. In the real world where spotty WiFi connectivity is a real issue, having a 60 second long blocking piece of code can be debilitating.
Sound like a good addition?