We have a few products live and the product with the Electrons are suddenly offline most of the time.
There are a few things going on and i've verified the following:
The devices report to be online, but I can not ping or use the functions of the device. Sometimes they send an spark/statusonline but they are still not responsive
2G sunset, the provider our devices connect to is KPN and will sunset 1 december 2027
i did not deploy an recently, so that is not something which could create this issue
the SIM card deactivation process should not affect the devices, because they were online recently
I see more cloud failure calls in the fleet health dashboard (see image)
What can I do? I can only think of issues that are out of my reach to fix.
I do not know if KPN is doing this, however when 2G was sunset in the United States, T-Mobile deactivated some 2G towers to convert them to 5G and reuse the frequencies. 2G was still technically active, but with redundant towers removed, there were more gaps in coverage.
Sorry, I should have explained that better. What is the signal strength and quality reported in the console for these devices? It's possible that the signal is enough to get connected, but not enough to reliably communicate.
The best way to debug this if you can connect a computer by USB when the problem occurs is to enable logging:
SerialLogHandler logHandler(LOG_LEVEL_TRACE);
and check the logs. You want to see if the packets from the cloud are making it to the device or not when it's in this state.
Another thing you can try is to temporarily reduce the Particle.keepAlive. It defaults to 23 minutes but you could try reducing it to 5 minutes or 2 minutes to see if the behavior changes. This should not be done permanently, because it will use a lot of cellular data, but this can rule out an issue with the port forwarding being removed by the carrier prematurely.
Just in case you feel crazy like I do, I am having this exact issue as of January 21st on several of my Electron devices (Ontario, Canada). One in Windsor came back on its own and has been operating since the 23rd, but the others in different geographies are still down. Many periodic "status/online" messages, but nothing else. If I catch an "online" event I can sometimes GET cloud variables at that moment.
I tried to call KPN, but they can not help me, that should be done by Particle.
I see that the devices with the connection problems do report back every 23 minutes. I think this indicates that the "UDP hole punching" is working as documented. Only the Particle API sees it as a new connection, because you do not see these messages if the connection was still established.
Joining up in here - I have very similar symptoms with devices in my fleet. Devices are listed as online, but in general neither pings, nor fetching vitals coerces to success.
Additionally in vitals historical records I see error -100 "unknown system error"
After few attempts for pings or fetching vitals, the device start to respond and in one case even a flashing was possible.
Same as reported by @tompesman - devices used to be operational no recent deployments were made and no machanical/electrical maintenance was carried in proximity.
I'm jumping in here to acknowledge that we have confirmation from our provider that a change was made on the 21st, without that change being shared with us. It's clear that this change lead to sessions on Electron devices being closed prematurely, resulting in the behaviour that you've observed.
Work is ongoing to confirm the root cause, and we believe a solution will be implemented during the course of next week.