Apparent cellular connectivity issues across the fleet for the fourth day

Hi,

This is our 3rd year using Particle as a paying customer with ~100 units fielded and a handful used for testing purposes.
It has started on May 18th - the units (Boron 404X) started having trouble connecting to the cell network. Minutes of green LED flashing followed by eventually connecting to the cloud or sometimes going through more interesting patterns never seen before.

This behavior correlates with a bump in the event traffic across the fleet which in the past has been consistent with cellular connectivity degradation. I have attached the screen capture of the "fleet health" page for the last 30 days.
Can not be sure what is going on with the units out in the field, but there is one 404X which we use for testing which exhibits this behavior e00fce68f42925354b743022
We were hoping that the issue would be picked up and resolved, but since it persists for the fourth day straight and the Particle status page shows that everything is fine, we decided to bring this issue up.

Just went from bad to quite a bit worse. Had >60 devices online, now 49... in combination with traffic spiking even higher

There is currently a cellular issue, but the last couple of days were normal.

An increase in events can be triggered by many things - it's impossible to say without knowing what in your application is responsible for publishes however.

If you look at your intergration and the errors you'll see this.

That happens when your server does not respond <15s, and the Webhook will be tried again up to three times total. If we keep seeing failures, messages will be skipped.

I'm guessing if your devices don't receive a response from the server, they try again? That could explain why it's spiking so badly. Cellular issues will exaggerate this, but it's being caused by the server being overwhelmed most likely.

Well, nothing in either hardware or software has changed in over 2 years. The spike in traffic has always been associated with degraded connectivity. Our application attempts to publish new data at 15min or longer intervals and tries to republish unless confirmation of the successful transmission is received.

See my message above please.

We have typed messages at the same time.
Just went through our server logs for today - average response time is <0.2 seconds (less than 200 milliseconds).
The only reason for this issue could be degraded internet connection between our server and yours. Do not see anything abnormal or unusual on our side. Our servers are in the US.

Of course if confirmations are delayed or lost due to the cellular connectivity issues, the messages would be resent.

I can attest to the fact that we did see connectivity issues starting at least May 19 with 2-3 units we have been testing prior to deployment. Did not raise an alarm because the sample was limited and there was hope that the issue would be resolved.

Our infrastructure (Webhooks) is operating as expected, but there is a widespread cellular outage today. status.particle.io has the details.

You could have been experiencing local network issues, but in general the cellular side was OK this past week.

I can not prove it as there are many factors involved, but what I have seen today with the unit sitting in front of me is consistent with the webhook responses not being received by the unit.

Looks like in response to my problem report something has been done that caused collapse in connectivity. Correlation does not establish causality, but...

Our fleet is deployed over a very large area primarily on the West Coast - from Washington to South California

Identified - We are seeing initial signs of recovery as connectivity is being restored for a portion of affected devices. Our teams continue to closely monitor the situation and work with carrier partners to ensure full service restoration.
May 21, 2026 - 08:44 PDT

Today we had an outage, which would have negatively affected this, and explains today’s issues.

The errors on your webhook indicate your server is not always responding as quickly as needed, leading to retries. That’s all I can offer for what happened previously.

If your device relies on the ack, it will retry if that’s dropped. I have not looked as your signal metrics.

I checked - our Webhook infrastructure is holding strong.

Thank you. I guess I will just have to wait for someone with more weight to complain.

Will just have to wait and see if this issue throws us over the data allowance limit this month so we would have an opportunity to pay for one more block of data.

No other complaints of this nature have been received. As far as I can tell it's unique to your fleet.
Device side logs would help us understand what's going on here.

It has started on May 19 and it is getting worse. It is not related to your webhook infrastructure. The units simply cannot connect to the cellular network! Our devices are solar powered and sleep with the cellular modem powered off.
They simply have a great difficulty reconnecting coming out of sleep. The amount of time it takes to reconnect to the cell network is longer that what we normally see during the first activation - minutes and minutes!
This is also random, sometimes connection is established in a few seconds as it normally would.
There are no issues with our server as evidenced by the virtual absence of the integration errors.

I believe we would all agree that one or a handful of errors per day out of several 1000 operations is not much.

The traffic jumped on May 19 and is climbing. Now it stands at 2x the normal value.
I can see our server responses in the event stream, but they do not appear to be received by the units.

Apparently

Particle.subscribe(System.deviceID() + "/hook-response/data", CloudResponceHandler, MY_DEVICES);

is not triggered in many cases.

Everything points out to some serious issues with the cellular communication. We have conducted tests and the issue is resolved if the network interface is put in inactive standby during sleep: (NETWORK_INTERFACE_CELLULAR, SystemSleepNetworkFlag::INACTIVE_STANDBY);

in this case the cell network connection is apparently working fine, the device reconnects to the cloud almost instantaneously coming out of sleep and transmissions are not delayed or lost.

Our system heavily relies on the real-time data transmission and all the points which fail to be transmitted in real time are flagged. Here is the representative plot:

I do not know what happened on May 19, but whatever it was it had a great negative effect on the reliability of the cellular connectivity. It does not appear to be carrier or location specific either. Our fleet consists of ~100 units deployed mostly in CA, OR and Washington.
The basic connectivity issue can be easily replicated by having a Boron sleep with the cell modem powered off, wake up every few minutes and try to reconnect. Lots of green LED blinking!

Can you please send me some logs of the units failing to connect?
You can use https://docs.particle.io/reference/device-os/api/logging/ to enable logging in your app directly - ALL logs should be enabled please.

We have conducted tests and the issue is resolved if the network interface is put in inactive standby during sleep: (NETWORK_INTERFACE_CELLULAR, SystemSleepNetworkFlag::INACTIVE_STANDBY);

The above appears to indicate that your device is not waiting for the ACK, goes to sleep and turns off the modem before the message is received. If it has to keep the cellular connection alive to receive the message that would mean the wake cycle is too short.

I am out of office doing field work. For obvious reasons I can only send you logs from the test units I have in the office. This will not happen before Friday.

Just so that we are on the same page - I have personally designed all of the electronics, wrote all of the firmware, backend server code and frontend web user interface for our solution. And I have been doing electronics design and writing software professionally for over 30 years.
The solution in question has been deployed in the field with minor design changes for ~7 years originally with modems from another vendor. It has also been tested very extensively with DiGi Xbee modems in combination with AT&T, Verizon, Hologram and other SIMs.
I am saying this to explain that you should not try looking for simple algorithmic bugs in our firmware. The units do wait for response before going to sleep. I can send you the whole code if you like, but not for public consumption.
Everything has been working predictably and as expected for 2+ years. Our server response time almost never exceed 0.2 seconds.
I only surface to ask questions when there is an issue which I have double-checked and cross-referenced. This is the first time in 2+ years I expressed concerns regarding connectivity.