Unreliable OTA firmware updates

Hi All,

I am noticing that my devices are having trouble updating the firmware from a cloud flash while running my application code. Here is a Trace log from a B524SOM (OS v4.2.0) where firmware updates have failed. Sometimes it takes multiple flash attempts to be successful, sometimes it can take 10mins plus for the update to occur.

Initially I thought that it may be a signal issue, but the flash always works when the device is in safemode.

0001157446 [wiring.ble] TRACE: Disconnected
0001225436 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0001225437 [comm.protocol] TRACE: message id 12 complete with code 0.00
0001225437 [comm.protocol] TRACE: rcv'd message type=13
0001593936 [comm.ota] INFO: Received UpdateStart request
0001593937 [comm.ota] INFO: File size: 44583
0001593937 [comm.ota] INFO: Chunk size: 512
0001593937 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0001593938 [comm.ota] INFO: Starting firmware update
0001594445 [comm.ota] INFO: Start offset: 0
0001594445 [comm.ota] INFO: Chunk count: 88
0001594445 [comm.ota] TRACE: Window size (chunks): 256
0001594447 [comm.protocol] TRACE: rcv'd message type=17
0001594479 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0001594480 [comm.ota] INFO: File size: 44583
0001594480 [comm.ota] INFO: Chunk size: 512
0001594480 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0001594481 [comm.ota] INFO: Starting firmware update
0001594958 [comm.ota] INFO: Start offset: 0
0001594959 [comm.ota] INFO: Chunk count: 88
0001594959 [comm.ota] TRACE: Window size (chunks): 256
0001594960 [comm.protocol] TRACE: rcv'd message type=17
0001595130 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0001595130 [comm.ota] INFO: File size: 44583
0001595131 [comm.ota] INFO: Chunk size: 512
0001595131 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0001595132 [comm.ota] INFO: Starting firmware update
0001595597 [comm.ota] INFO: Start offset: 0
0001595597 [comm.ota] INFO: Chunk count: 88
0001595597 [comm.ota] TRACE: Window size (chunks): 256
0001595599 [comm.protocol] TRACE: rcv'd message type=17
0001595879 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0001595879 [comm.ota] INFO: File size: 44583
0001595880 [comm.ota] INFO: Chunk size: 512
0001595880 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0001595881 [comm.ota] INFO: Starting firmware update
0001596356 [comm.ota] INFO: Start offset: 0
0001596356 [comm.ota] INFO: Chunk count: 88
0001596356 [comm.ota] TRACE: Window size (chunks): 256
0001596358 [comm.protocol] TRACE: rcv'd message type=17
0001596655 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0001596655 [comm.ota] INFO: File size: 44583
0001596656 [comm.ota] INFO: Chunk size: 512
0001596656 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0001596657 [comm.ota] INFO: Starting firmware update
0001716866 [comm.ota] INFO: Start offset: 0
0001716866 [comm.ota] INFO: Chunk count: 88
0001716867 [comm.ota] TRACE: Window size (chunks): 256
0001716868 [comm.protocol] TRACE: rcv'd message type=17
0001717072 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0001717072 [comm.ota] INFO: File size: 44583
0001717078 [comm.ota] INFO: Chunk size: 512
0001717078 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0001717078 [comm.ota] INFO: Starting firmware update
0001717553 [comm.ota] INFO: Start offset: 0
0001717560 [comm.ota] INFO: Chunk count: 88
0001717560 [comm.ota] TRACE: Window size (chunks): 256
0001717561 [comm.protocol] TRACE: rcv'd message type=17
0001717563 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0001717564 [comm.ota] INFO: File size: 44583
0001717564 [comm.ota] INFO: Chunk size: 512
0001717564 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0001717565 [comm.ota] INFO: Starting firmware update
0001718026 [comm.ota] INFO: Start offset: 0
0001718027 [comm.ota] INFO: Chunk count: 88
0001718027 [comm.ota] TRACE: Window size (chunks): 256
0001718028 [comm.protocol] TRACE: rcv'd message type=17
0001718196 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0001718197 [comm.protocol] TRACE: message id 13 complete with code 0.00
0001718197 [comm.protocol] TRACE: rcv'd message type=13
0001718298 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0001718298 [comm.protocol] TRACE: message id 14 complete with code 0.00
0001718298 [comm.protocol] TRACE: rcv'd message type=13
0001718399 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0001718399 [comm.protocol] TRACE: message id 15 complete with code 0.00
0001718399 [comm.protocol] TRACE: rcv'd message type=13
0001718500 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0001718500 [comm.protocol] TRACE: message id 16 complete with code 0.00
0001718501 [comm.protocol] TRACE: rcv'd message type=13
0001718605 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0001718606 [comm.protocol] TRACE: message id 17 complete with code 0.00
0001718606 [comm.protocol] TRACE: rcv'd message type=13
0001719107 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0001719107 [comm.protocol] TRACE: message id 18 complete with code 0.00
0001719107 [comm.protocol] TRACE: rcv'd message type=13
0001719212 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0001719212 [comm.protocol] TRACE: message id 19 complete with code 0.00
0001719213 [comm.protocol] TRACE: rcv'd message type=13
0002085512 [comm.ota] ERROR: Transfer timeout
0002085512 [comm.protocol] ERROR: Event loop error 10
0002085513 [system] WARN: Communication loop error, closing cloud socket
0002085513 [system] INFO: Cloud: disconnecting
0002085514 [system] INFO: Cloud: disconnected
0002085514 [system] INFO: Cloud: connecting
0002085517 [system] INFO: Cloud socket connected
0002085517 [comm.protocol.handshake] INFO: Establish secure connection
0002085520 [comm.dtls] INFO: session has 0 uses
0002085533 [comm.dtls] INFO: (CMPL,RENEG,NO_SESS,ERR) restoreStatus=0
0002085533 [comm.dtls] INFO: out_ctr 0,1,0,0,0,0,0,44, next_coap_id=13
0002085534 [comm.dtls] INFO: restored session from persisted session data. next_msg_id=19
0002085535 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 2
0002085851 [ncp.at] TRACE: > AT+CFUN?
0002085854 [ncp.at] TRACE: < +CFUN: 1
0002085855 [ncp.at] TRACE: < OK
0002085856 [ncp.at] TRACE: > AT+CCID
0002085858 [ncp.at] TRACE: < +CCID: 89883070000023239831
0002085858 [ncp.at] TRACE: < OK
0002085859 [ncp.at] TRACE: > AT+CGSN
0002085863 [ncp.at] TRACE: < 868986060851202
0002085864 [ncp.at] TRACE: < OK
0002085865 [ncp.at] TRACE: > AT+CGMR
0002085867 [ncp.at] TRACE: < EG91EFBR06A07M4G
0002085867 [ncp.at] TRACE: < OK
0002085868 [comm.protocol.handshake] INFO: Skipping HELLO message
0002085871 [comm.protocol] INFO: Checksum has not changed; not sending application DESCRIBE
0002085871 [comm.protocol] INFO: Checksum has not changed; not sending subscriptions
0002086113 [comm.dtls] INFO: session cmd (CLS,DIS,MOV,LOD,SAV): 4
0002086114 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0002086114 [comm.protocol] TRACE: message id 20 complete with code 0.00
0002086115 [comm.protocol] TRACE: rcv'd message type=13
0002086139 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0002086140 [comm.protocol] TRACE: message id 21 complete with code 0.00
0002086140 [comm.protocol] TRACE: rcv'd message type=13
0002086241 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0002086241 [comm.protocol] TRACE: message id 22 complete with code 0.00
0002086242 [comm.protocol] TRACE: rcv'd message type=13
0002086341 [system] INFO: Cloud connected
0002087301 [comm.protocol] TRACE: rcv'd message type=8
0002088338 [app.pubq] TRACE: reset or disconnect event, save files to queue
0002331164 [comm.ota] INFO: Received UpdateStart request
0002331164 [comm.ota] INFO: File size: 44583
0002331165 [comm.ota] INFO: Chunk size: 512
0002331165 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0002331166 [comm.ota] INFO: Starting firmware update
0002331629 [comm.ota] INFO: Start offset: 0
0002331630 [comm.ota] INFO: Chunk count: 88
0002331630 [comm.ota] TRACE: Window size (chunks): 256
0002331631 [comm.protocol] TRACE: rcv'd message type=17
0002331852 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0002331853 [comm.ota] INFO: File size: 44583
0002331853 [comm.ota] INFO: Chunk size: 512
0002331853 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0002331854 [comm.ota] INFO: Starting firmware update
0002332317 [comm.ota] INFO: Start offset: 0
0002332317 [comm.ota] INFO: Chunk count: 88
0002332317 [comm.ota] TRACE: Window size (chunks): 256
0002332318 [comm.protocol] TRACE: rcv'd message type=17
0002332474 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0002332475 [comm.ota] INFO: File size: 44583
0002332475 [comm.ota] INFO: Chunk size: 512
0002332476 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0002332476 [comm.ota] INFO: Starting firmware update
0002332947 [comm.ota] INFO: Start offset: 0
0002332947 [comm.ota] INFO: Chunk count: 88
0002332947 [comm.ota] TRACE: Window size (chunks): 256
0002332948 [comm.protocol] TRACE: rcv'd message type=17
0002333213 [comm.ota] WARN: Received UpdateStart request but another update is already in progress
0002333213 [comm.ota] INFO: File size: 44583
0002333214 [comm.ota] INFO: Chunk size: 512
0002333214 [comm.ota] INFO: File checksum:
7bd64459a481aa805165df94606df6285e441632527f3b292f42fa9c3256db5b
0002333214 [comm.ota] INFO: Starting firmware update
0002333680 [comm.ota] INFO: Start offset: 0
0002333680 [comm.ota] INFO: Chunk count: 88
0002333680 [comm.ota] TRACE: Window size (chunks): 256
0002333681 [comm.protocol] TRACE: rcv'd message type=17
0002333959 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0002333959 [comm.protocol] TRACE: message id 23 complete with code 0.00
0002333959 [comm.protocol] TRACE: rcv'd message type=13
0002402832 [comm.ota] TRACE: Received 512 of 44583 bytes
0002402833 [comm.protocol] TRACE: rcv'd message type=19
0002402841 [comm.protocol] TRACE: rcv'd message type=19
0002402950 [comm.protocol] TRACE: rcv'd message type=19
0002403060 [comm.protocol] TRACE: rcv'd message type=19
0002403143 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0002403143 [comm.protocol] TRACE: message id 24 complete with code 0.00
0002403144 [comm.protocol] TRACE: rcv'd message type=13
0002403277 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0002403277 [comm.protocol] TRACE: message id 25 complete with code 0.00
0002403278 [comm.protocol] TRACE: rcv'd message type=13
0002403279 [comm.protocol] TRACE: Reply recieved: type=2, code=0
0002403280 [comm.protocol] TRACE: message id 26 complete with code 0.00
0002403280 [comm.protocol] TRACE: rcv'd message type=13
0002403391 [comm.protocol] TRACE: rcv'd message type=19
0002403493 [comm.protocol] TRACE: rcv'd message type=19
0002403553 [comm.protocol] TRACE: rcv'd message type=19
0002403655 [comm.protocol] TRACE: rcv'd message type=19
0002403864 [comm.protocol] TRACE: rcv'd message type=19
0002403966 [comm.protocol] TRACE: rcv'd message type=19
0002404273 [comm.protocol] TRACE: rcv'd message type=19
0002404383 [comm.protocol] TRACE: rcv'd message type=19
0002404496 [comm.protocol] TRACE: rcv'd message type=19
0002404558 [comm.protocol] TRACE: rcv'd message type=19
0002404667 [comm.protocol] TRACE: rcv'd message type=19
0002404775 [comm.protocol] TRACE: rcv'd message type=19
0002404911 [comm.protocol] TRACE: rcv'd message type=19
0002405015 [comm.protocol] TRACE: rcv'd message type=19
0002405150 [comm.protocol] TRACE: rcv'd message type=19
0002405162 [comm.protocol] TRACE: rcv'd message type=19
0002405273 [comm.protocol] TRACE: rcv'd message type=19
0002405344 [comm.protocol] TRACE: rcv'd message type=19
0002405444 [comm.protocol] TRACE: rcv'd message type=19
0002405555 [comm.protocol] TRACE: rcv'd message type=19
0002405612 [comm.protocol] TRACE: rcv'd message type=19
0002405715 [comm.protocol] TRACE: rcv'd message type=19
0002454128 [comm.ota] TRACE: Received 10240 of 44583 bytes
0002454129 [comm.protocol] TRACE: rcv'd message type=19
0002454200 [comm.protocol] TRACE: rcv'd message type=19
0002454305 [comm.protocol] TRACE: rcv'd message type=19
0002454392 [comm.protocol] TRACE: rcv'd message type=19
0002454491 [comm.protocol] TRACE: rcv'd message type=19
0002454603 [comm.protocol] TRACE: rcv'd message type=19
0002454672 [comm.protocol] TRACE: rcv'd message type=19
0002454780 [comm.protocol] TRACE: rcv'd message type=19
0002454860 [comm.protocol] TRACE: rcv'd message type=19
0002454962 [comm.protocol] TRACE: rcv'd message type=19
0002455065 [comm.protocol] TRACE: rcv'd message type=19
0002455173 [comm.protocol] TRACE: rcv'd message type=19
0002455276 [comm.protocol] TRACE: rcv'd message type=19
0002455385 [comm.protocol] TRACE: rcv'd message type=19
0002455487 [comm.protocol] TRACE: rcv'd message type=19
0002455597 [comm.protocol] TRACE: rcv'd message type=19
0002455699 [comm.protocol] TRACE: rcv'd message type=19
0002455810 [comm.protocol] TRACE: rcv'd message type=19
0002455939 [comm.protocol] TRACE: rcv'd message type=19
0002455944 [comm.protocol] TRACE: rcv'd message type=19
0002456054 [comm.protocol] TRACE: rcv'd message type=19
0002456118 [comm.protocol] TRACE: rcv'd message type=19
0002456238 [comm.protocol] TRACE: rcv'd message type=19
0002456342 [comm.protocol] TRACE: rcv'd message type=19
0002456444 [comm.protocol] TRACE: rcv'd message type=19
0002456548 [comm.protocol] TRACE: rcv'd message type=19
0002456615 [comm.protocol] TRACE: rcv'd message type=19
0002456712 [comm.protocol] TRACE: rcv'd message type=19
0002456773 [comm.protocol] TRACE: rcv'd message type=19

1 Like

Are you using SYSTEM_THREAD(ENABLED)? That is recommended.

Do you have things that block loop() from returning? That's another possibility as some things only occur between calls to loop. Ideally you want loop to be returning within a millisecond.

Do you have large delay() calls? That can also cause issues with threading disabled, because delay only services the cloud connection once per second instead of 1000 times or more per second, and that's not enough to reliably OTA.

Do you have any code that uses SINGLE_THREADED_BLOCK or turns off interrupts? That can cause problems with or without threading and it can prevent the cloud connection from being serviced.

1 Like

Thanks Rick,

I am using SYSTEM_THREAD(ENABLED). It is not every loop but I do have functions that can take seconds to complete. I will rewrite these to be non blocking, I suspected that this might be the issue.

With SYSTEM_THREAD(ENABLED), does the particle not service the cloud connection during large (multi second) delay() calls?

I am not using any code for SINGLE_THREADED_BLOCK or turning off interrupts.

I suspect that I am not returning to loop() frequently enough. I will rectify this issue and report back. Thanks for the quick response.

Can I ask from the messages in the trace log, what should I be keeping an eye out for? Do any of those trace logs tip you off to the above, or are you just responding from experience? I'm just interested to learn better troubleshooting methods.

Where returning to loop within a millisecond is ideal. Do you have any idea at what length of time trouble might start to arise? After some refactoring I am down to around the 30ms mark, but this may rise slightly as I start to add in more actions.

That should be fine. And it's OK if sometimes it's longer. You just want to avoid situations like never returning from loop, or only returning every few seconds constantly. I know if affects functions and variables, but I'm less sure it's the cause of your OTA issue.

Thanks Rick.

I have also been seeing intermittent issues with particle functions. So the delay with returning to loop is most likely the cause of that. I will revert when the rest of the code is refactored to report if the issue has been resolved with the new lower loop return interval.

thanks for sharing

A reduced loop return time seems to have eliminated the OTA and cloud related problems. Thanks for the help

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.