Electron Triggers Watchdog on TCP client.connect in Poor Cell Service

TLDR
Sometimes, when my device is trying to establish a TCP client connection in a situation where the cell service is spotty/not available, calling _client.connect(…) appears to block forever, triggering ApplicationWatchdog after a minute. A subset of those times, this occurs in a seemingly infinite loop that requires a manual reset. I’m trying to figure out what the conditions are that might cause this and how to mitigate.

Relevant Configuration
System_Thread(ENABLED)
System_Mode(AUTOMATIC)
All network related functionality is called within the context of loop()

General Device Functionality
The basic functionality of the device is that it accepts data via a serial input, and then sends packets of data regularly (every few seconds) to our servers. Essentially a data logging device. The serial input is handled by a dedicated thread to ensure responsiveness, with a ring buffer to pass data to the main thread. The main thread essentially does 6 things:
0) Configure one-time network initialization (keepAlive, cloud functions, last reset notes)

  1. Connect to our MQTT broker via MQTT if not already connected
  2. Service the MQTT connection (acknowledgement msgs, detect disconnection)
  3. Attempt to publish a packet of real time data if MQTT connected
  4. If unsuccessful or not connected, save to a larger buffer
  5. Attempt to publish a packet of data from the buffer if MQTT connected
    (then go back to 1)

How the Problem Arises

  1. Turn on device
  2. Allow it to connect to the cloud/MQTT as normal
  3. Put the antenna in a faraday cage to attenuate the signal to the point where it loses the cellular connection after a few minutes.
  4. Allow it to reconnect again, but only partially in the faraday cage (aka not as strongly attenuated)
  5. Repeat 3-4 every once in a while until the error condition happens in this block of code, which is called by loop():

Error condition details

void checkConnection() {
    // Update connectivity LED and reconnect to MQTT if necessary
#if Wiring_WiFi
    if (WiFi.ready() == false)
#elif Wiring_Cellular
    if (Cellular.ready() == false)
#endif
    {
        client.isConnected();
        #if Wiring_Cellular
        led_cell.setColor(black.withBrightness(default_brightness));
        #endif
    }
    else
    {
        updateCellularLED();

        // Check Particle Cloud
        if (Particle.connected())
        {
            lastTimeParticleCloud = millis();
        }
        else
        {
            debugPrint(MSG_TYPE_DEBUG, "Reconnecting to Particle Cloud");
            Particle.connect();
            if (Particle.connected()) debugPrint(MSG_TYPE_DEBUG, "Successfully Reconnected to Particle Cloud");
            else debugPrint(MSG_TYPE_DEBUG, "Reconnection to Particle Cloud Failed");
        }


        // Check MQTT
        if (MQTT_enabled)
        {
            // Only attempt to reconnect to MQTT if we want to be doing that.
            // Used with the cloud function to turn MQTT connectivity on and off.

            if (client.isConnected())
            {
                updateMQTT();
            }
            else
            {
                debugPrint(MSG_TYPE_DEBUG, "Connection regained, attempting to reconnect to MQTT");
                client.connect(MQTT_ID);
                if (client.isConnected()) debugPrint(MSG_TYPE_DEBUG, "MQTT Successfully Reconnected");
                else debugPrint(MSG_TYPE_DEBUG, "MQTT Reconnection Unsuccessful");
                debugPrint(MSG_TYPE_DEBUG, "MQTT looping");
                updateMQTT();
                debugPrint(MSG_TYPE_DEBUG, "MQTT loop complete");
            }
        }
        else
        {
            // Don't want to be connecting to MQTT, disconnect if connected
            if (client.isConnected())
            {
                debugPrint(MSG_TYPE_DEBUG, "Turning off MQTT...");
                client.disconnect();
                debugPrint(MSG_TYPE_DEBUG, "MQTT looping");
                updateMQTT();
                debugPrint(MSG_TYPE_DEBUG, "MQTT loop complete");
            }
        }

    }
}

At this point the device is blinking Cyan.
This function is called and Cellular.ready() == true
and Particle.connected()==false
After an attempt at Particle.connect();
then Particle.connected()==false (so nothing has changed)
Moving down into the MQTT code, the client isn’t connected, so we move into the else block.
My serial terminal outputs: Connection regained, attempting to reconnect to MQTT
and executes line client.connect(MQTT_ID);
And then, 60 seconds later, the ApplicationWatchdog triggers and resets the device.
In that 60 second window the device continues to blink Cyan.

Here is the first bit of the code called by client.connect(MQTT_ID):

bool MQTT::connect(const char *id, const char *user, const char *pass, const char* willTopic, EMQTT_QOS willQos, uint8_t willRetain, const char* willMessage, bool cleanSession) {
    if (!isConnected()) {
        int result = 0;
        // debug_print("Starting MQTT connect code - TCP _client.connect()...\n");
        setCB(2001);
        if (ip == NULL)
            result = _client.connect(this->domain.c_str(), this->port);
        else
            result = _client.connect(this->ip, this->port);
        // debug_print("finished TCP _client.connect()\n");
        setCB(2002);
        if (result) {
            // debug_print("TCP _client.connect() successful\n");
            setCB(2003);
            nextMsgId = 1;
            uint8_t d[9] = {0x00,0x06,'M','Q','I','s','d','p',MQTTPROTOCOLVERSION};
            // Leave room in the buffer for header and variable length field
            uint16_t length = 5;
            .
            .
            .

The code specifically hangs on_client.connect(this->ip, this->port);

_client is a TCPClient object.

I’ve had this scenario trigger after a couple hours, after 20 minutes, and even once within 2 minutes after flashing new firmware (but with no power cycle).

Concluding Questions
It seems like there is some edge case that is causing the TCPClient::connect(domain, port) function to hang >60seconds, maybe forever. This seems to happen in the state where the device thinks it is connected to the cellular network and is seemingly attempting to connect to the particle cloud (the flashing cyan). This only occurs when cell signal is poor (~-110dB or lower) or nonexistant. This condition sometimes persists through a software reset (without a modem reset).

Has anyone seen anything like this before? Am I making any bad assumptions in the way I’m structuring my network management? I have some suspicions of my calling Particle.connect in multi-threaded mode, but I was seeing similar behavior previously when I did not have that line of code, though I never explicitly tested as I have above.

Update:

It appears from some other posts that the maximum timeout for the TCP Client connect function is 60 seconds, the same as my watchdog, which explains the reset. I’d put it on a wishlist that these timeouts for blocking functions would be documented, but my expectations are low.

New question:
That being the case, is there any possible way to connect to the TCP Client in a non-blocking (or minimally blocking) manner? Is there a more surefire way of checking connectivity ahead of time such that I don’t attempt that connection when it’s likely to fail? I already check for Cellular.ready() == true, but I have a Particle.connect() attempt before - would a second Cellular.ready() check be warranted immediately before the TCP client connect?

I heard @ScruffR say Particle just hired a guy specifically for keeping the documents updated so that may help raise your expectations :wink:

2 Likes