Particle continuously loops trying to connect: "Event loop error 21"

In one particular situation our Electrons just go around in a loop when we call Particle.connect().

Context:
Electron (3 different devices)
Firmware 0.8.0-rc.11
SYSTEM_THREAD(ENABLED);
SYSTEM_MODE(MANUAL);

Here’s the log which shows it happening:

0000352734 [system] INFO: Cloud: connecting
0000353005 [app.deferredUpdate] ERROR: Couldn't clear session
0000353006 [system] INFO: Read Server Address = type:1,domain:$id.udp.particle.io
0000354005 [app.deferredUpdate] ERROR: Update timed out...
0000354006 [app.deferredUpdate] ERROR: Will force handshake and try with a fresh boot
0000354199 [system] INFO: Resolved host 320033001047373334363431.udp.particle.io to 18.206.150.141
0000354219 [system] INFO: Cloud socket connected
0000354220 [system] INFO: Starting handshake: presense_announce=0
0000354220 [comm.protocol.handshake] INFO: Establish secure connection
0000354231 [comm.dtls] INFO: (CMPL,RENEG,NO_SESS,ERR) restoreStatus=2
0000361673 [comm.protocol.handshake] INFO: Sending HELLO message
0000362961 [comm.protocol.handshake] INFO: Handshake completed
0000362962 [system] INFO: Send spark/device/claim/code event
0000363187 [system] INFO: Send subscriptions
0000363398 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000363398 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000363399 [comm] INFO: Sending TIME request
0000363692 [comm.protocol] INFO: Sending 'M' describe message
0000363937 [comm.protocol] INFO: rcv'd message type=1
0000363937 [system] INFO: Cloud connected
0000364222 [comm.protocol] INFO: rcv'd message type=13
0000364404 [comm.protocol] INFO: Sending 'S' describe message
0000364704 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000364722 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000364724 [comm.protocol] INFO: rcv'd message type=1
0000364888 [comm.protocol] INFO: rcv'd message type=13
0000365050 [comm.protocol] INFO: Received TIME response: 1541804046
0000365052 [comm.protocol] INFO: rcv'd message type=12
0000365964 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 1
0000365966 [comm.protocol] INFO: rcv'd message type=9
0000365966 [comm.protocol] ERROR: Event loop error 21
0000365966 [system] WARN: Communication loop error, closing cloud socket
0000365968 [system] INFO: Cloud: disconnecting
0000365968 [system] INFO: Cloud: disconnected
0000366068 [system] INFO: Cloud: connecting
0000366341 [app.deferredUpdate] ERROR: Couldn't clear session
0000366342 [system] INFO: Read Server Address = type:1,domain:$id.udp.particle.io
0000366375 [system] INFO: Resolved host 320033001047373334363431.udp.particle.io to 18.206.150.141
0000366395 [system] INFO: Cloud socket connected
0000366396 [system] INFO: Starting handshake: presense_announce=0
0000366396 [comm.protocol.handshake] INFO: Establish secure connection
0000366407 [comm.dtls] INFO: (CMPL,RENEG,NO_SESS,ERR) restoreStatus=2
0000367341 [app.deferredUpdate] ERROR: Update timed out...
0000367342 [app.deferredUpdate] ERROR: Will force handshake and try with a fresh boot
0000372792 [comm.protocol.handshake] INFO: Sending HELLO message
0000373479 [comm.protocol.handshake] INFO: Handshake completed
0000373480 [system] INFO: Send spark/device/claim/code event
0000373695 [system] INFO: Send subscriptions
0000373906 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000373906 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000373907 [comm] INFO: Sending TIME request
0000374201 [comm.protocol] INFO: Sending 'M' describe message
0000374425 [comm.protocol] INFO: rcv'd message type=1
0000374425 [system] INFO: Cloud connected
0000374487 [comm.protocol] INFO: rcv'd message type=13
0000374763 [comm.protocol] INFO: rcv'd message type=13
0000374905 [comm.protocol] INFO: Received TIME response: 1541804056
0000374907 [comm.protocol] INFO: rcv'd message type=12
0000375087 [comm.protocol] INFO: Sending 'S' describe message
0000375389 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000375407 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0000375409 [comm.protocol] INFO: rcv'd message type=1
0000375573 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 1
0000375575 [comm.protocol] INFO: rcv'd message type=9
0000375575 [comm.protocol] ERROR: Event loop error 21
0000375575 [system] WARN: Communication loop error, closing cloud socket
0000375577 [system] INFO: Cloud: disconnecting
0000375577 [system] INFO: Cloud: disconnected
0000375677 [system] INFO: Cloud: connecting

And that same loop just carries on forever (or it would if we didn’t realize and System.reset). The proximate cause seems to be the message type=9 (9 = KEY_CHANGE) as it’s always followed by Event loop error 21 (21 = IO_ERROR_LIGHTSSL_BLOCKING_SEND).

As this doesn’t happen when I connect at other times I presume it’s the particular loop our app is in at this point. We are intentionally in our own loop and not in the main loop(). Reason being we can’t have a reset happen at any random time (like when valves are open) so we System.disableUpdates(). When we see a pending update and get to an appropriate time where a reset would be OK we System.enableUpdates() and head off to this particular loop:

void waitForUpdate() {
  log.info("Expecting update to start...");
  if (! statusUpdateTs) {
    statusUpdateTs = millis();
  }
  int publishAttempts = 0;
  while (true) {
    Particle.process();
    timestamp expectedFinishTs = statusUpdateTs + (isComplete ? COMPELTE_TIMEOUT : IN_PROGRESS_TIMEOUT);
    if (millis() > expectedFinishTs) {
      log.error("Update timed out...");
      if (isFreshBootAttempt()) {
        log.error("Already tried fresh boot, will resume normal operations");
        delay(500);
        System.reset();
      } else {
        log.error("Will force handshake and try with a fresh boot");
        bool publishSuccess = Particle.publish("spark/device/session/end", PRIVATE | WITH_ACK);
        if (publishSuccess) {
          log.info("Cleared session to force handshake next boot");
          delay(500);
          System.reset(FRESH_BOOT_RESET_MAGIC);
        } else {
          log.error("Couldn't clear session");
          if (publishAttempts++ >= 3) {
            log.error("Giving up on clearing seession");
            delay(500);
            System.reset(FRESH_BOOT_RESET_MAGIC);
          }
        }
      }
    }
    delay(1000);
  }
}

statusUpdateTs and isComplete are being updated whenever a system firmware event happens (through System.on(firmware_update_pending + firmware_update, firmwareEventHandler)). isFreshBootAttempt() is false when this issue is happening.

The point to this loop is to wait for the update to apply but if it doesn’t within a timeout (which happens all the time for us) we publish an event to end our session and reboot to pick the update up next boot (next boot we won’t disable updates because of FRESH_BOOT_RESET_MAGIC so the firmware update won’t fail – which it does for us most times on the first attempt when updates are re-enabled).

In that loop you can see it’s log messages sprinkled through the log output above. As per the log output because the Particle.publish is WITH_ACK it takes about 10 seconds to return with a failure result, and then another 1000ms delay at the end of the loop means Particle.process() is only being called every 10s or so. Could that be the problem and causing the KEY_CHANGE to fail?

I can’t see the Particle.connect() call, but if you call it once (in multithreaded, manual mode) you need to make sure not to call it again till the connection is made, you abort the attempt or wait 500sec.
Also calling Particle.publish() with no connection or during an ongoing connection attempt can cause troubles.
Additionally, during an ongoing update (in multithreaded mode) the connection can be dropped by the device OS for the update process, if your still running code does not account for that case, you may see issues too.

In multithreaded, manual mode that should not play too much role.

Thanks @ScruffR, I’m never quire sure how much code to post and lose the important detail.

This is the code that calls the waitForUpdate function I posted above:

  if (Particle.connected() && isPending) {
    log.info("Re-enabling firmware updates");
    System.enableUpdates();
    waitForUpdate();
  }

And that waitForUpdate never returns so nothing else should be calling connect or `similar.

How long are your time constants?
AFAICT you are forcing a handshake within 15 seconds. A full handshake should only be triggered once. Having it trigger repeatedly would only result in extra data consumption and potentially interfere with ongoing connections.

BTW, you may have seen millis() timing done as if (millis() - prevTime > timeout) instead of if (millis() > nextTime). The reason behind that is explained here

const system_tick_t TIME_PER_SECOND = 1000;
const system_tick_t TIME_PER_MINUTE = TIME_PER_SECOND * 60;
...
static const system_tick_t IN_PROGRESS_TIMEOUT = 5 * TIME_PER_MINUTE;
static const system_tick_t COMPELTE_TIMEOUT = 10 * TIME_PER_SECOND;

The timeout that is always triggered is the IN_PROGRESS_TIMEOUT, as per this that I posted a while ago: How to notify application incase of an incoming OTA update?

So it’s waited 5 minutes before sending the spark/device/session/end event. But I see what you’re saying / that’s made me think: sending the spark/device/session/end event may be causing the session to be broken during the processing of that same event? Which causes the error response. Because even tho every Particle.publish("spark/device/session/end", PRIVATE | WITH_ACK) returns false it does in fact work (the next connect after reboot has to do a full handshake and correctly pulls down the firmware).

So maybe the solution to my problem is to remove the WITH_ACK. It’s not the end of the world if that event is lost.

Thanks for the heads up. I’ll definitely rearrange things.

1 Like