Network hangs with slow cyan blink

photon
Tags: #<Tag:0x00007fe21ffd0968>

#1

I have 8 Photons that are running 24x7 in a wake/sleep cycle (awake 60s sleep 60s) sending telemetry. About once every three to four weeks they all end up (simultaneously) in a network hang state where the status LED blinks cyan slowly at about 0.25s rate - this is not the rapid blink normally seen when connecting to the cloud. When in this state, the code continues to run and the wake/sleep cycle continues but there is never a cloud connection. They always wake in this state and the code to disconnect from cloud or turn WiFi off seems to have no effect. Here is a video (GIF) of some stalled Photons:

So first of all I cannot find a definitive reference that tells me what this slower cyan blink actually means. Anyone know for sure?

And of course the most important question is how do we get there? In this state, the only way to get the device back is a hard reset. Here is the framework of the code:

SYSTEM_MODE(SEMI_AUTOMATIC);
SYSTEM_THREAD(ENABLED);

void setup() {
  // ... SNIP
  WiFi.on();
  WiFi.selectAntenna(ANT_INTERNAL);
  WiFiAccessPoint ap[5];
  int iFound = WiFi.getCredentials(ap, 5);
  // ... code to set default credentials
  // ... SNIP
}

void loop() {
  bool bCloudConnected = false; 
  unsigned long int uLoopStartTime, uRunningLoopTime, uWaitTime, uConnectStart;
/*
 * TIME CRITICAL ITEMS
 */
  System.disableUpdates();
  // ... SNIP-time critical measurements
/*
 * INITIATE CLOUD CONNECTION
 */
  uConnectStart = millis();               // so we can time-out from a connection
  if (Particle.connected() == false) {
    WiFi.on();                            // does not block
    Particle.connect();                   // this does not block
  }
  // ... SNIP-less time critical measurements
  // wait for the connection to the cloud if not already connected
  while (Particle.connected() == false) {
    if (millis() - uConnectStart > MAX_WAIT_FOR_CLOUD) break; // check if we've been waiting too long
  }
  if (Particle.connected()) {
      bCloudConnected = true;
  } else {   // not connected
      // if wifi is ready but not connected, cloud failure, not wifi failure
      if (!WiFi.ready()) diagnosticData.uWifiConnectFailures++;
      Particle.disconnect();  // stop trying to connect
      delay(20);
      WiFi.off();             // abort this attempt and save power
      bCloudConnected = false;
  }
/*
 * PUBLISH TELEMETRY 
 */
  if (bCloudConnected) {
    bRetVal = pushTelemetry();
  }
  else {
    // *** not connected so ...
  }
/*
 * WAKE WAITING
 */
  // ... SNIP-wait until wake time has expired while doing some variable monitoring
  
/*
 * START of SLEEP TIME
 */
  if (Particle.connected() == true) { // need to shutdown the cloud so when returning from sleep we don't wait for cloud connect
    Particle.disconnect();
    delay(20);  // make sure there is time to disconnect
  }
  WiFi.off();
  delay(100);   // to make sure there is enough time for actually to turn off
  System.sleep(WAKE_UP, RISING, uWaitTime);
}

Using Device OS 1.4.0.


#2

…sorry, can’t seem to upload the 20MB GIF file of the devices in this state…

Here’s a link to the video -> http://www.bigwindow.net/wp-content/uploads/IMG_0306_Trim.mp4


#3

Iiiiinteresting! Can you PM me a device ID and a timestamp or two at which this occurred? The simultaneity of this issue bespeaks some network condition, but it could be something funkier. I’m assuming you reset all of these devices at the same time as well (so their sleep/wake times are synced-ish)?

Would be great to add some logging to one of these devices to try and catch it in action.


#4

Is this blink not the normal speed before the rapid cyan handshake follows?
Usually that’s only very short so you won’t see it for a long time.


#5

I assume probably so, but for us that never lasts any more than like a tenth of a second. It is the same frequency as the green blink while the device is attaching to the WiFi. You can see it in the video link I posted above.


#6

I agree that the simultaneity points to some network condition, perhaps the inability to connect to the cloud after attaching to WiFi. We purposely keep them out of sync with wake/sleep cycles because that random sort of operation will be typical when we hit field trial in a month or so. I’ll try to find a timestamp for the occurrence - and I’ve left one Photon in this hung state so I should be able to go back to its last report in time/date.


#7

@marekparticle - not sure how to PM you; seeing if this is how to accomplish it…nope, that’s not it :frowning:


#8

When you click on Marek’s avatar you should see this

Then hit the Message button


#9

Yep - looks exactly like that EXCEPT there is no [Message] button. Even tried a different browser (Chrome and Firefox) and get the same thing. And did the inspect element to see if it was hidden in there; but it’s not there at all. Geez; everything I’m working on is breaking :wink: but at least I’m not crazy after looking for an hour how to do a PM :slight_smile: because it seems I really can’t!


#10

Did you receive my message yesterday??


#11

Can you try again?


#12

Sorry, I did not, only your post above. I have the time stamp when all this occurred and the IDs ready to send to you but no private way to do it. For some reason, the [Message] option does not show up in your pop-up card.

Haha - just now it does - magic! I’ll send the info over…


#13

Hi, thanks for your helpful and detailed message! I checked your devices and I really see no unusual behavior in our backend. The device still in this state is not (visibly) interacting with the Cloud, so I again defer to a network condition here. Logs would be amazing, the sooner the better just in case. Grab a test device and add a loghandler. I assume DFU-flashing to the device will just resurrect it, right? (Which means that instituting a new session solves the issue…)

(Something may be going on with Sleep - though I’m just working on two Sleep-related connectivity issues and that’s where my head’s at today. :slight_smile: I suspect the disconnect protocol prior to calling Sleep might be potentially insufficient, hm.)


#14

Yes, loading new firmware will certainly resurrect the remaining Photon stuck in this mode. I’ll add a loghandler and fire everything up again and hopefully whatever the event is that triggers this will show up again - sooner rather than later. And, yes, strangely, the device that continues in this hung state never makes it to the cloud.

So the slower blinking of cyan is just the precursor to the rapid blinking which indicates connecting to the cloud which @ScruffR mentioned? Is there something in DeviceOS that cloud possibly keep it from moving from this slower blink to the rapid blink?

Also agree about the assessment about tearing down the network before sleep. Obviously, in this hung mode, that teardown code does not work because the code executes and the hang mode stays in place when the device returns from sleep. Is there a better teardown sequence that will force a disconnect and WiFi off?


#15

Some more information on the network hang mode:

  1. During the 4 days of being stuck in this mode, I can verify that the device never connected to the WiFi router while going through over 2000 wake/sleep cycles. Even though the slower cyan blink would seem to indicate that the WiFi connect part has already occurred.
  2. After approximately 26s of wake time, the cyan blink goes back to the green blink at the same frequency. It stays blinking green for about 6s and returns to the cyan blink for the remainder of the wake time. During the green blink, the device does not connect to WiFi. And, the 26s does not correlate to our abort cloud connect which is set to 30s.

#16

Have you tried Tinker at all?
Starting to debug such an issue would be easier with knowing whether a known to be good application can connect. This would help tick off any external factors.
As a next step posting a minimal test case - which can be posted here in its entirety - that exhibits the erronous behaviour would help us to test the same code (not a maybe close enough version) on our own devices.

Also

This would better be done via

  if (waitFor(Particle.connected, MAX_WAIT_FOR_CLOUD)) {
    ...
  }
  else {
    ...
  }

Your bRetVal = pushTelemetry(); and the respective else part could also fit into the two waitFor() branches.

You are using a flag bCloudConnected but even after setting it true you are still using Particle.connected() for checking, why? That seems somewhat inconsistent.


#17

At this point we are way beyond Tinker. We’ve had 10+ devices running 24/7 since July. Early on, we pushed out OTA FW updates 1 or more times a week and thus never saw this issue. As we get close to beta, the devices run for 2+ weeks and now we just start to see the issue (only twice since middle of September.) One device makes 720 WiFi/cloud connections a day so we see 50,000+ successful connections in a week. If we see this once in 3 weeks then we’re looking for a bad attempt out 150,000+ or the proverbial needle in a haystack.

To tell you the honest truth, we don’t need to find out the disruption that causes this. (Although it sure would be nice to know what DeviceOS is doing during the 0.25s cyan blink.) We know the field is going to be full of disruptions. We just have to operate through them. I totally agree with your observations on the code and using waitFor(Particle.connected, MAX_WAIT_FOR_CLOUD), but I’m not sure that will keep this hung condition from happening.

To operate through this disruption, we are concentrating on the code in two areas; immediately before sleep that currently does not tear down the network making it so that we wake able to make a fresh attempt (a guarantee that the network is down and off):

/*
 * START of SLEEP TIME
 */
  if (bCloudConnected) { 
    Particle.disconnect();
    delay(20);  // make sure there is time to disconnect
  }
  WiFi.off();
  delay(100);   // to make sure there is enough time for actually to turn off
  System.sleep(WAKE_UP, RISING, uWaitTime);
}
  1. Do we need to call Particle.disconnect() before turning off WiFi? Maybe there is no point because WiFi.off() will do this.
  2. There doesn’t seem to be a way to check that WiFi is actually off - is that true? That precludes us from hanging out here in a loop waiting for it to occur and if it doesn’t give ourselves a hard reset.
  3. Is there a known time it takes for WiFi.off() that we should wait that will guarantee it is actually off before sleeping?

The second area is waiting for the cloud connection after the time critical code has executed. Thinking we could check if we haven’t made it to the cloud for some number of cycles and if not, give ourselves a hard reset.

  1. Would Particle.timeSyncedLast() give us that last time we were connected to the cloud? Assuming that time sync happens during each fresh connection. Then Time.now() - Particle.timeSyncedLast() gives us how long since last being connected to the cloud.

Thanks for your attention, we really appreciate the insight.


#18
  1. normally not, WiFi.off() should suffice - but there have been releases where the shutdown process wasn’t executing quite as clean as intended and regressions are possible (although that would be a candidate for a default test case).
    However, in case of a regression or other causes for the radio not being torn down properly, I’d have to defer to support (i.e. @marekparticle) to investigate together with you.

  2. & 3. nope, there isn’t but 1. should also not require it either

About Particle.timeSyncedLast(): Your reasoning sounds right, but I can’t promise how relyable it is (e.g. when calling Time.setTime() or reusing a previous cloud session with only short disconnects in between).


#19

This turns out to be repeatable for me when I remove the internet connection on the Sierra Wireless AirLink MP70. So WiFi is still intact but there is no internet. (I simulate removal of internet by removing the cellular antenna.) When the Photons come out of wake, take measurements and then try to connect to the cloud, they can’t and stay in this slower blinking cyan state right up until sleep time and go to sleep in this mode. So the Particle.disconnect() and WiFi.off() lines don’t do the job.

So I built code that strips everything out except the cloud connect and wake sleep cycle and this goes into the same network hang state and is also repeatable for me - mostly - as maybe 1 out of 6 or 8 tries it manages to recover on its own. Here is the code:

#include "particle.h"

SYSTEM_MODE(SEMI_AUTOMATIC);
SYSTEM_THREAD(ENABLED);

#define MAX_WAIT_FOR_CLOUD 30000
#define HARDCODED_WAIT_TIME 60000
#define HARDCODED_SLEEP_TIME 60

void setup() {
  // ... SNIP
  pinMode(D7,OUTPUT);
  WiFi.on();
  WiFi.selectAntenna(ANT_INTERNAL);
  //WiFiAccessPoint ap[5];
  //int iFound = WiFi.getCredentials(ap, 5);
  // ... SNIP-code to set default credentials
  delay(10);
}

void loop() {
  bool bCloudConnected = false; 
  unsigned long int uLoopStartTime, uRunningLoopTime, uWaitTime, uConnectStart;
/* TIME CRITICAL ITEMS */
  digitalWrite(D7,HIGH);
  System.disableUpdates();
  uLoopStartTime = millis();
  // ... SNIP-time critical measurements
  delay(100);
  digitalWrite(D7,LOW);
/* INITIATE CLOUD CONNECTION */
  uConnectStart = millis();
  if (Particle.connected() == false) {
    WiFi.on();                            // does not block
    Particle.connect();                   // this does not block
  }
  // ... SNIP-less time critical measurements
  delay(750);
  while (Particle.connected() == false) {
    if (millis() - uConnectStart > MAX_WAIT_FOR_CLOUD) break; // check if we've been waiting too long
  }
  if (Particle.connected()) {
      bCloudConnected = true;
  } else {
      // not connected
      Particle.disconnect();  // stop trying to connect
      delay(20);
      WiFi.off();             // abort this attempt and save power
      bCloudConnected = false;
  }
/* WAKE WAITING */
  uWaitTime = HARDCODED_WAIT_TIME - (millis() - uLoopStartTime);
  uRunningLoopTime = millis() - uLoopStartTime;
  while (uRunningLoopTime < uWaitTime) {
    uRunningLoopTime = millis() - uLoopStartTime;
    // SNIP-just hanging out doing simple variable monitoring
    delay(50);
  } 
/* START of SLEEP TIME */
  if (Particle.connected()) { // need to shutdown the cloud so when returning from sleep we don't wait for cloud connect
    Particle.disconnect();
    delay(20);  // make sure there is time to disconnect
  }
  WiFi.off();
  delay(100);   // to make sure there is enough time for actually to turn off
  System.sleep(WKP, RISING, HARDCODED_SLEEP_TIME);
}

I suppose you could simulate the same thing on any wireless router by removing the WAN connection. I removed the connection while the device was asleep to force it into this network hang mode. Interesting that the FIRST time after waking, there is a short yellow blink after several seconds of trying to reach the cloud. But once it goes to sleep in the network hung state, that yellow blink doesn’t re-occur.

I’m curious if you can get the same hung state. I’m using DeviceOS 1.4.0.


#20

This sounds very much like an issue I’ve seen in the past

Although this was solved and closed does the description there resemble what you are seeing?
I thought there was a test for this introduced after that.