0.8.0-rc.11 hard fault on wifi reconnect unstable network (p1)


#1

Hi,

I was wondering if anyone have encounter this issue before or have any insight why this is happen.

Version: 0.8.0-rc.11 on a P1 module

Test case code: test.ino

/*
 * Project test
 * Description:
 * Author:
 * Date:
 */

SYSTEM_THREAD(ENABLED);
SYSTEM_MODE(SEMI_AUTOMATIC);

#define disconnectDelay 13000 // After seventeen seconds of no WiFi, disconnect
#define checkInterval 45500 // How often to check for WiFi once disconnected

unsigned long lastAttempt, lastDisconnect;
bool isDCOnce = true;
bool isWifiOn = true;

// setup() runs once, when the device is first turned on.
void setup() {
  // Put initialization like pinMode and begin functions here.
  Serial.begin(9600);
  Particle.connect();

}

// loop() runs over and over again, as quickly as it can execute.
void loop() {
  // The core of your code will likely live here.

  if (Particle.connected())
  {
    isDCOnce = true;
  }

  if (!WiFi.listening() && !Particle.connected())
  {
    delay (100);  // <---comment out will create hard fault

    checkConnection();
  }

}

void checkConnection()
{
  if (Particle.connected() == false && millis() - lastDisconnect >= disconnectDelay)
  {
    if (isDCOnce)
    {
      WiFi.off();
      Serial.printlnf("System version: %s",System.version().c_str());
      Serial.printlnf("(checkConnection) wifi disconnected lastDisconnect %d",lastDisconnect);
      isDCOnce = false;
      isWifiOn = false;
    }
    lastDisconnect = millis();
  }

  if (Particle.connected() == false && millis() - lastAttempt >= checkInterval)
  {
    if (!isWifiOn)
    {
      WiFi.on();
      Serial.println("(checkConnection) wifi on");
      isWifiOn = true;
    }
    if (WiFi.ready())
    {
      Particle.connect();
      Serial.println("(checkConnection) partile connect");
    }

    Serial.printlnf("(checkConnection) check wifi lastAttempt: %d",lastAttempt);
    lastAttempt = millis();
  }
}

Test procedure:

  1. Bond P1 with WiFi credentials
  2. Wifi connected on P1
  3. Disable WIFi network (P1 disconnect with the WiFi network)
  4. Enable WIFI network ( hard fault SOS happen when delay (100); comment out)

Can any one reproduce this issue?

My questions:

  1. Why do I need a delay (100); to prevent sos hard fault?
  2. Is putting a delay(100); prevent the sos hard fault 100% for this case or the delay varies?

Thanks


#2

I was able to reproduce the SOS but it’s not obvious why it happens. Since it’s so easily worked around and in fairly uncommon use case (manually managing the connection in that way) it’s probably not a high priority to be fixed at this time, so I’d just add the delay.


#3

@rickkas7, glad you can reproduce it. What will be the suggested delay value to be use? Is 100 a good delay for production use? Do you see any problem releasing product with pre-release 0.8.0-rc.11?

I just don’t want any more product returns due to unstable Wifi network, hard fault SOS. It happen with the WIFi.RSSI() in 0.7.0 and now this with 0.8.0.rc.11.

Thank You.


#4

0.8.0-rc.11 should be fine. Since the exact values is not known, and it would only occur when the Wi-Fi went away, I’d probably make the 500 milliseconds just to be safe. The extra 400 ms. won’t be noticeable and you’d have some extra safety buffer just in case.


#5

@sheng @rickkas7 Hey guys, I am trying to solve a similar issue (i think). I have production units going offline after a few days and becoming really laggy. However i have noticed that the only thing working in those cases is the interrupts. Is this a case of “hard faulting” ? I am running 0.7.0 on my products. and using RSSI .


#6

hard faulting would be when the led starts blinking red with SOS error codes (1 for hard fault i believe).


#7

In my product, I dont have access to the RBG lights without powering it off. I am finding it hard to recreate the bug in office.


#8

im in a similar sitatuion where i cant seem to reproduce the problem reliably in my office, but I was suggested a nice way to recover. Check out:


#9

@Div_IoT, in 0.7.0, WiFi.RSSI() will cause a hard fault with a unstable wifi network. It should be easily reproduce with a simple test case. I did that before if you call RSSI with out Wifi.Ready and Particle.connected this will happen. Even system event will not catch this error. To mini this happen, check Wifi.ready and Particle.connected, that what I used. It prevent it from happen 99% time. That why, I was moving to 0.8.0-rc.11. Below have more information on 0.7.0. I hope this help, this is in SEMI_AUTOMATIC mode.


#10

Thanks for your input, I manage to recreate the issue with RSSI on a dev photon. Is there a possible situation where the hardfault would get stuck and not automatically reset? Because I am experiencing that with my products.The products would require a physical reset (my products need to run reliably for 2-3 weeks at a time). The only thing working if they are in hard fault is the interrupts. (Also side note: i cant tell if they are in fault mode because i do not have access to the RGB but it sounds like it is the issue.). my products are P1.

What is the final workaround did u use? upgrade to 0.8.0 or did your implement RSSI differently?

Thanks for the help everyone else.


#11

@Div_IoT, from my debugging for 0.7.0 with WiFi.RSSI() hard fault, it will automatically rest. But if you have unstable network, it may enter this loop continuously.

I think it would be a good idea, if you have a product at hand, solder the RBG on to reproduce the problem and observe the RGB SOS to make sure it is the hard fault. It could be some other issue.

The test procedure is above.

For 0.7.0, work 99% time:

if (WiFi.ready() && Particle.connected())
{
      WiFi.RSSI();
}

You may want to try 0.8.0-rc.11, the WiFi.RSSI will not create a hard fault but you may need a delay describe above. Also, if WiFi.off(); is not called, you will not need the delay. Depend on your product need.

Hope this help.


#12

Thanks for reporting this issue. It is likely caused by a bug in our sockets implementation and is not directly related to delays in a user application. Could you please try this branch and check whether it helps? We will include the fix in the upcoming release.