Originally inspired by this post (FYI @Vitesze) but this was a little off topic.
So during OTA updates, my loop always hangs on my MQTT client.isConnected() method, which is just a wrapper for the TCPClient connected() function. It hangs for about 30 seconds and blocks the OTA for that time. It seems to be waiting for something to happen while checking the socket. Weird. So I tried wrapping it in a SINGLE_THREADED_BLOCK, and it just hangs infinitely.
The call to TCPClient::connected() ultimately ends up calling the below:
So the OTA update is triggering the LOCK mutex condition (which makes sense). What doesn’t make sense is the fact that the OTA update and the loop BOTH get delayed by 30 seconds. This is all before any system events trigger. This always causes the OTA update to appear to timeout on the Particle console / CLI, though they ultimately successfully complete.
Is there any way to make sure that the OTA stops getting delayed and/or to check for whatever condition is triggering the lock so as to not call anything on my TCPClient?
Within handle_update_begin, the following things happen:
it sends a brief ack that it got the update request
It calls prepare_for_firmware_update in system_update.cpp, which then does:
Sets the flag SYSTEM_FLAG_OTA_UPDATE_PENDING which is returned by System.UpdatePending.
Notifies the Application Thread with the event firmware_update_pending and waits for the Application Thread to handle the event with a timeout of 30 seconds.
sets the flag SYSTEM_FLAG_OTA_UPDATE_PENDING back to false and then, if updates are enabled, it notifies of the event firmware_update - firmware_update_begin.
the update starts.
The device sends an update_ready message back to the server to begin the update.
So, the problem appears to be that somehow the System Thread locks down the modem during the message handler, which initiates a permanent lock on the TCPClient.connected() until the timeout takes effect waiting for the Application Thread to respond.
Unfortunately, checking System.updatesPending() does not return true when called immediately before TCPClient.connected(). Thus, the lock on the modem appears to be set before the flag is set, and held until some point in time later in the update start process.
This means that there doesn’t appear to be a way to avoid this condition with the standard System flags and events. What I can’t figure out yet is what is holding the LOCK on the modem? I’ll try and investigate that next, but it’s a lock_guard type mutex, so in theory the fact that we are outside the scope of mdm_hal.cpp should mean that it shouldn’t be locked but I’m clearly missing or misunderstanding something.
@mdma not sure if this is something you might have any insight into - any thoughts would be great!
I think if you care about this timeout you are supposed to:
Run with System.disableUpdates()
Listen for the firmware_update_pending event which happens even when disabled.
When you get said event, you should shutdown all your connections (set a flag in the handler and do the work in loop()) and then call System.enableUpdates() when you are ready.
It is the same problem either way. The problem is that the modem mutex is locked BEFORE the firmware_update_pending event OR the System updates_pending flag are set.
My firmware has already gotten “locked up” by the time either of those System indicators are provided to the Application Thread.
EDIT: to be specific, the System firmware doesn’t actually check the System.updatesEnabled flag until AFTER this timeout has passed, which is what enables you to catch the event and enable updates in the event handler if you want to still update.