Electron devices offline (while they report being online)

We have a few products live and the product with the Electrons are suddenly offline most of the time.

There are a few things going on and i've verified the following:

  • The devices report to be online, but I can not ping or use the functions of the device. Sometimes they send an spark/status online but they are still not responsive
  • 2G sunset, the provider our devices connect to is KPN and will sunset 1 december 2027
  • i did not deploy an recently, so that is not something which could create this issue
  • the SIM card deactivation process should not affect the devices, because they were online recently
  • I see more cloud failure calls in the fleet health dashboard (see image)

What can I do? I can only think of issues that are out of my reach to fix.

1 Like

I do not know if KPN is doing this, however when 2G was sunset in the United States, T-Mobile deactivated some 2G towers to convert them to 5G and reuse the frequencies. 2G was still technically active, but with redundant towers removed, there were more gaps in coverage.

That could be possible, but the devices report to be online (very shortly).

If i read page of the 2G sunset (in Dutch) Ga mee naar 4G en 5G | KPN They explain that you could use the network till 1 december 2027.

What steps can I take to see what the problem is?

Sorry, I should have explained that better. What is the signal strength and quality reported in the console for these devices? It's possible that the signal is enough to get connected, but not enough to reliably communicate.

What i've just noticed. A device came back online, I could interact with it and after a few minutes it was unreachable again.

If 2G was disabled this should be impossible. I did run a vitals check and most values are fine but the cloud disconnect events are a bit high: 651.

Signal strength for the most devices is around 70%. Those 20 devices are operating in the field for years.

The best way to debug this if you can connect a computer by USB when the problem occurs is to enable logging:

SerialLogHandler logHandler(LOG_LEVEL_TRACE);

and check the logs. You want to see if the packets from the cloud are making it to the device or not when it's in this state.

Another thing you can try is to temporarily reduce the Particle.keepAlive. It defaults to 23 minutes but you could try reducing it to 5 minutes or 2 minutes to see if the behavior changes. This should not be done permanently, because it will use a lot of cellular data, but this can rule out an issue with the port forwarding being removed by the carrier prematurely.

Because I do not have physical access to the devices right now, but I can update them. I'll try the Particle.keepAlive.

It seems that the Particle.keepAlive(2min) solves the disconnection. This indicates that something changed at the carrier.

How do we proceed from here? Because my data usage will be through the roof! :wink:

Just in case you feel crazy like I do, I am having this exact issue as of January 21st on several of my Electron devices (Ontario, Canada). One in Windsor came back on its own and has been operating since the 23rd, but the others in different geographies are still down. Many periodic "status/online" messages, but nothing else. If I catch an "online" event I can sometimes GET cloud variables at that moment.

Yes it is very similar!

The device and date are similar, but it looks more like a carrier issue I think.

@rickkas7 What steps can we take from here?

I tried to call KPN, but they can not help me, that should be done by Particle.

I see that the devices with the connection problems do report back every 23 minutes. I think this indicates that the "UDP hole punching" is working as documented. Only the Particle API sees it as a new connection, because you do not see these messages if the connection was still established.

@rickkas7 How can we escalate this issue?

1 Like

@rickkas7 or someone at Particle, we really need help on this.

I asked support to take a look at this.

1 Like

Hi folks,

We've already escalated this issue with our MVNO - they are investigating.
Let me get back to you about mitigation.

Thank you for your response!

Joining up in here - I have very similar symptoms with devices in my fleet. Devices are listed as online, but in general neither pings, nor fetching vitals coerces to success.
Additionally in vitals historical records I see error -100 "unknown system error"

After few attempts for pings or fetching vitals, the device start to respond and in one case even a flashing was possible.

Same as reported by @tompesman - devices used to be operational no recent deployments were made and no machanical/electrical maintenance was carried in proximity.

Looking forward for response @no1089

Hi folks,

I'm jumping in here to acknowledge that we have confirmation from our provider that a change was made on the 21st, without that change being shared with us. It's clear that this change lead to sessions on Electron devices being closed prematurely, resulting in the behaviour that you've observed.

Work is ongoing to confirm the root cause, and we believe a solution will be implemented during the course of next week.

2 Likes

Thank you for the update!