We have been using several hundreds of photon 1 devices, operating at a variety of locations for about 4-5 years, with an overall high reliability. Lately, we have started doing several firmware improvements, updating the OS and revising the code, so there have been a lot of OTA updates in these past months. After every update event, we always got some devices not coming back online, usually under 1% of the devices, which is still quite problematic. We also had problems properly diagnosing after the fact; all the devices are in customer's factories, and usually the issue is fixed by a manual power reset before we can get too much info to debug the problem.
In an attempt to lower that 1% to 0%, and assuming the issue comes from our firmware, we put the devices in Safe Mode, and then perform the update.
In this case though, around 23% of the devices did not come back online, and we could afterwards confirm all of them were in a solid LED state, magenta color, and unresponsive. Also, they all came back online after a power cycle, and they were all online and in SafeMode before the OTA update. We use comercial routers of our own on location, we don't use the customer's wifi. The devices were running Device OS 2.3.1.
What could cause a code freeze while in Safe Mode, that could be triggered by an OTA update process? Where should I look into having more reliable OTA updates? In our code, we connect to a mqtt broker aside from the particle Cloud, though in Safe Mode I understand none of our code runs.
Any help or pointers in the right direction is appreciated!