Periodically (every 3 days) disconnect from the cloud

Hello there :slight_smile:

We have a few Argons and Borons scattered around different locations.
We use these devices to control some sensors for which it is essential that the Particle device is responding to the requests we make via cloud functions

Lately we have been seeing an issues where the device is unresponsive for a few seconds. I checked the vitals in the dashboard and noticed that every 3 days the device disconnects from the cloud. Immediately after te disconnect it reconnects (about 10 seconds average), which works great. The only issue here being that we are not able to control the Particle device which is a problem when this disconnect happens and need to control a connected sensor.
Since the devices are placed on locations that we haven’t always acces to it is hard to get the logs out of the device. So I have made a setup in my office with an Argon which had this issue as well on a location it was placed on, to be able to log what is happening with the device. From the logs I noticed that when the disconnect happens it is because of an error on the system:

0055442972 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0055442973 [comm.protocol] TRACE: message id 41252 complete with code 0.00
0055442973 [comm.protocol] TRACE: rcv'd message type=13
0055468079 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0055468080 [comm.protocol] TRACE: message id 41253 complete with code 0.00
0055468080 [comm.protocol] TRACE: rcv'd message type=13
0055493186 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0055493186 [comm.protocol] TRACE: message id 41254 complete with code 0.00
0055493187 [comm.protocol] TRACE: rcv'd message type=13
0055505693 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 1
0055505694 [comm.protocol] TRACE: rcv'd message type=9
0055505695 [comm.protocol] ERROR: Event loop error 21
0055505695 [system] WARN: Communication loop error, closing cloud socket
0055505696 [system] INFO: Cloud: disconnecting
0055505696 [system] INFO: Cloud: disconnected
0055505697 [system] INFO: Cloud: connecting
0055505698 [system] INFO: Read Server Address = type:1,domain:$id.udp-mesh.particle.io
0055505699 [system] WARN: Failed to load session data from persistent storage
0055505699 [system] INFO: Discarding session data
0055505701 [system] TRACE: Resolving xxxxxxx.v2.udp-mesh.particle.io#5684
0055505749 [system] TRACE: Address type: 3
0055505749 [system] TRACE: Cloud socket=0, family=2, type=2, protocol=17
0055505750 [system] INFO: Cloud socket=0, connecting to xxxxxx#5684
0055505750 [system] TRACE: Cloud socket=0, connected to xxxxxx#5684
0055505751 [system] TRACE: Updating cloud keepalive for AF_INET: 25000 -> 25000
0055505751 [system] TRACE: Applying new keepalive interval now
0055505752 [system] INFO: Cloud socket connected
0055505752 [system] INFO: Starting handshake: presense_announce=0
0055505753 [comm.protocol.handshake] INFO: Establish secure connection
0055505756 [comm.dtls] INFO: (CMPL,RENEG,NO_SESS,ERR) restoreStatus=2
0055506323 [comm.protocol.handshake] INFO: Sending HELLO message
0055506446 [comm.protocol.handshake] INFO: Handshake completed
0055506447 [comm.protocol.handshake] TRACE: Updating protocol flags
0055506447 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0055506448 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0055506614 [hal] TRACE: Cached ESP32 NCP firmware version: 5
0055506616 [comm.protocol] INFO: Posting 'S' describe message
0055506621 [system] INFO: Send spark/hardware/max_binary event
0055506624 [system] INFO: Send spark/device/last_reset event
0055506631 [comm] INFO: Sending TIME request
0055506634 [system] INFO: Sending application DESCRIBE
0055506634 [comm.protocol] INFO: Posting 'A' describe message
0055506637 [system] INFO: Sending subscriptions
0055506637 [comm.protocol] INFO: Sending subscriptions
0055506642 [system] TRACE: Waiting until all handshake messages are processed by the protocol layer
0055506759 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0055506759 [comm.protocol] TRACE: message id 41256 complete with code 0.00
0055506760 [comm.protocol] TRACE: Updating system DESCRIBE checksum
0055506760 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0055506926 [hal] TRACE: Cached ESP32 NCP firmware version: 5
0055506928 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 3
0055506929 [comm.protocol] TRACE: rcv'd message type=13
0055506929 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0055506930 [comm.protocol] TRACE: message id 41257 complete with code 0.00
0055506930 [comm.protocol] TRACE: rcv'd message type=13
0055506931 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0055506932 [comm.protocol] TRACE: message id 41258 complete with code 0.00
0055506932 [comm.protocol] TRACE: rcv'd message type=13

I can’t figure out why that is. I have done an extensive code review with the Particle documentation next to me to see if there are important notes we missed for some logic that causes memory leaks or threading issues and I don’t think that’s the case. The code is also fairly simple, all the Particle connection logic is handled automatically and we have one cloud function that controls max 3 pins of the device. Good to add here is that the device is reset at midnight to make sure the device doesn’t suffer from any memory issues or something that makes the device not function or function incorrectly throughout the day. That is also backtracked in the logs as well, so no issues there that could cause something to happen after 3 days. And it runs on a fairly old DeviceOs version -> 2.3.0. The test device (same as on location and Borons over cellular) is connected via Ethernet and when logging I checked if there was an outage with the internet, but that wasn't the case.

Particle is fairly new to me, although I have some experience with other hardware like Arduino, ESP32, ESP8266 and developing software in general. Would be great to get some new insights and solve this issue with help of the community. I looked around the forum already but couldn’t find find a issue that is similar (unfortunately).

Huge thanks in advance for the help!

This is the expected behavior. The cloud connection is DTLS and the session key expires every 3 days and needs to be refreshed. The device should only be offline for a few seconds, but there is no way to avoid this.

3 Likes

Thank you for your reaction! Out of curiosity, so if I understand correctly a call of System.reset() doesn't trigger a refresh of this session key? And is the behavior of the cloud connection and the expiration of the session key documented somewhere?

System.reset() and OTA reset reuse the session key. The physical reset button usually reuses the session key, but not always on Gen 4 devices.

This behavior is mentioned in clearSession which is also how you can force it to occur earlier. This is not recommended in general, but if you wanted to schedule it to happen at a specific time of the day, you could, starting after 2 days from the last session reset, clear the session at that time.

1 Like

I will have a look at the documentation, thank you very much!