Electron Freezing During Connection Process

I’ve got an Electron that is running v1.0.1 firmware, thread enabled, semi-automatic mode. Sometimes when connecting to the cloud it freezes during the connection process. Here is a sample of my logs (it freezes on the last word in the log):

0000001355 [temp] INFO: Temp C: 18.00
0000002356 [temp] INFO: Sensor attached and available
0000002356 [temp] INFO: Configure probe to correct temperature resolution
0000002381 [temp] INFO: Probe is configured to correct resolution: 0x40
0000002382 [temp] INFO: Writing Values Th = 0x0, Tl = 0x0, cfg = 0x5f
0000002428 [temp] INFO: Waiting for 1000 ms
0000003232 [system] INFO: Sim Ready
0000003232 [system] INFO: ARM_WLAN_WD 1
0000003342 [system] INFO: ARM_WLAN_WD 2
0000003342 [system] INFO: CLR_WLAN_WD 1, DHCP success
0000003342 [system] INFO: Cloud: connecting
0000003343 [app] INFO: connectionEvent event=1 data=1
0000003343 [app] INFO: cellular up
0000003346 [system] INFO: Read Server Address = type:1,domain:$id.udp.particle.io
0000003454 [temp] INFO: Temp C: 18.00
0000003675 [system] INFO: Cloud socket connected
0000003675 [system] INFO: Starting handshake: presense_announce=0
0000003675 [comm.protocol.handshake] INFO: Establish secure connection
0000003701 [comm.dtls] INFO: (CMPL,RENEG,NO_SESS,ERR) restoreStatus=0
0000003701 [comm.dtls] INFO: out_ctr 0,1,0,0,0,0,0,12, next_coap_id=46
0000003701 [comm.dtls] INFO: app state crc: cached: e1e4706e, actual: def6369c
0000003703 [comm.dtls] INFO: restored session from persisted session data. next_msg_id=70
0000003703 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 2
0000003705 [comm.protocol.handshake] INFO: Sending HELLO message
0000004455 [temp] INFO: Sensor attached and available
0000004455 [temp] INFO: Configure probe to correct temperature resolution
0000004481 [temp] INFO: Probe is configured to correct resolution: 0x40
0000004481 [temp] INFO: Writing Values Th = 0x0, Tl = 0x0, cfg = 0x5f
0000004527 [temp] INFO: Waiting for 1000 ms
0000005554 [temp] INFO: Temp C: 18.00
0000006001 [comm.protocol.handshake] INFO: Handshake completed
0000006001 [system] INFO: Send spark/device/claim/code event
0000006150 [system] INFO: Send spark/hardware/max_binary event
0000006293 [system] INFO: Send subscriptions
0000006555 [temp] INFO: Sensor attached and available
0000006555 [temp] INFO: Configure probe to correct temperature resolution
0000006580 [temp] INFO: Probe is configured to correct resolution: 0x40
0000006581 [temp] INFO: Writing Values Th = 0x0, Tl = 0x0, cfg = 0x5f
0000006581 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0000006593 [comm.dtls] INFO: session 

It seems to be freezing when there is a collision between the connection process and my 1-Wire library and Serial printing to console. The freeze doesn’t seem to happen when I remove Serial printing OR when I remove the 1-Wire library. Also, I can’t seem to reproduce the issue when I am careful to place Serial.flush() around the code that freezes. Finally, the freeze only occurs once in a while, not every time reset, and only when Electron is making connection to Particle cloud, after successfully connecting to tower.

The function that seems to freeze the electron is:

int TemperatureDS18::functionWrite(byte th, byte tl, byte config)
#ifdef DEBUG_LOG
  logTemp.info("Writing Values Th = 0x%x, Tl = 0x%x, cfg = 0x%x", th, tl, config);
    present = 0;
    present = ds.reset(); // first clear the 1-wire bus
    if (present == 0)
    { // no response from device, so don't bother with configuring
#ifdef DEBUG_LOG
      logTemp.warn("Present Write = %x. Cancelling this configuration, as no device present.", present);
      return 2;
    ds.write(0x4E, PARASITE_POWER_OFF);   // Write Scratchpad
    ds.write(th, PARASITE_POWER_OFF);     // Write Scratchpad Th
    ds.write(tl, PARASITE_POWER_OFF);     // Write Scratchpad Tl
    ds.write(config, PARASITE_POWER_OFF); // Write Scratchpad cfg
  return 0;

Once frozen the Electron will breathe or flash Cyan indefinitely. The only way out of the issue is by pressing the reset button. The ApplicationWatchdog does not fire and reset the Electron even if set.

I’ve been trying to debug this issue using the Particle debugger, but I can’t seem to find the exact source of the issue. When I catch one of these with the debugger, the location provided seems to be looping inside the xQueueGenericReceive within queues.c within the device-os. The issue is similar to what is discussed here: https://github.com/particle-iot/device-os/issues/877.

I’m hoping someone here can point me in the right direction for debugging this. My main questions are:

  • Where should I be looking for a bug like this?
  • The 1-wire library is a variation of the Particle 1-wire library, using ATOMIC_BLOCKs instead of enabling and disabling interrupts. I have a feeling the issue is in here, but I’m having a hard time finding it.
  • In general, it seems like this type of freeze should not happen. Is there some way to avoid this? Do I need to avoid any sort of interrupt manipulation while device is connecting to cloud?

@ParticleD, or @mstanley are you able to assist?

Please test again but move the call to logTemp.warn() outside of the SINGLE_THREADED_BLOCK(). Calling that within a SINGLE_THREADED_BLOCK() section is dangerous. If the warn needs to allocate memory or acquire resource currently held by the system thread very likely the cause of your lockup. I suspect your issue is not with cellular connect specifically but just that the system thread is unusually busy during that time frame. Increases chance for application thread preempting and taking control while the system thread is holding a critical resource.

Thanks for the response!

I’ll make the change and attempt to test for this bug. Its super sporadic.

If I call return inside a SINGLE_THREADED_BLOCK, am I exiting single threaded mode? For example, in my code if I return 2; before exiting SINGLE_THREADED_BLOCK, am I stuck in single threaded mode?

Another question I have on this is: what is the point of calling SINGLE_THREADED_BLOCK here? Inside this block the Electron seems to be doing other stuff with comm.dtls, unless the lock up is later and I am looking for the bug in the wrong function

SINGLE_THREADED_BLOCK implements a RAII-style guard and will cleanup properly even on return 2;.

SINGLE_THREADED_BLOCK is often used to make sure a context switch doesn’t happen in the middle of a critical multipart operation. A similar ATOMIC_BLOCK goes one step further and disables all interrupts.

My guess is the library implementer is trying to guard against another task preempting and trying to take over the 1-Wire bus, causing both operations to fail. SINGLE_THREADED_BLOCK is one way to protect against that. Just have to be careful as certain operations aren’t safe from such blocks.

There is no hard and fast rule on what is safe/unsafe from SINGLE_THREADED_BLOCK. Just consider what might be using a resource that could also be in use from or depend on another task. Such as an AT command if the System thread might already have a lock on the cellular modem interface, serial writes that might require the system thread to process a buffer to make room for more data, dynamic memory allocation that might fail because another thread was in the middle of an allocation, etc.