Boron Inconsistent Runaway Data Usage

I have 4 Borons deployed as prototypes. Earlier today I noticed that 2 of them are using about an order of magnitude more cellular data than the other two (~1.5 MB per day vs ~2 MB per month). The devices are all running the same firmware and are all in the same location. There does not seem to be a correlation with signal quality; the high data devices also do not seem to disconnect from the cloud more or less frequently than the others. All devices usually have over 50% signal quality.

Furthermore, after looking through the cloud usage history, I couldn't find anything that would have used this much data. The devices averaged about ~50 data operations per day, and never exceeded 150. They are set to publish vitals every 5 minutes, which at first seemed frequent enough a potential source, but it doesn't explain using 20 MB in a few weeks, and all the devices are doing this, not just the high data ones.

I have no idea what would cause one device to use ~10x the data as another identical device running the same firmware in the same location. Is there something going on behind the scenes to cause this independent of vitals frequency and data operations?

If the cellular data usage and data operations diverge, there are only a few possibilities, assuming you are not bypassing the Particle cloud by using a service that makes a direct connection by TCP or UDP to an external server or service. However your symptoms don't really match any of them.

The most common occurs when the device is able to connect to cellular but not the cloud, stuck in blinking cyan or fast blinking cyan. This is more common in areas of poor cellular connectivity. Poor connectivity can also incrementally increase cellular usage due to retries, but this would only be a 2-3x increase, not an order of magnitude, and you don't appear to have poor cellular connectivity.

Manually managing the cloud connection by forcing a session disconnect from the cloud or device side will dramatically increase data usage because every full handshake will use 5-6K of data, where a resume will only use around 100 bytes.

I'm only sending data through publishes, and while I am using semi_automatic because some of the devices are solar-powered (and need to check the battery level before attempting to connect), the session is never forcibly disconnected programmatically. One other device has started doing this in a separate product, and its firmware is significantly simpler and does not use semi_automatic mode, so I don't think the firmware is the determining factor.

Is there a way to see a breakdown of what the data usage actually is? In the console I'm just seeing the raw total. It doesn't seem like the device is ever disconnecting from the cloud for any extended period, because vitals are being published at regular 5 minute intervals.

The one correlation of affected devices is that they are situated closer to the ground amid some potential obstructions, whereas the unaffected devices are deployed on equipment much higher up in the air. That would make poor cellular connectivity make sense, but I would expect it would be reflected with a much weaker signal quality, and there is no correlation with signal strength or quality.

Is there a way to mitigate this from within firmware, for example by detecting if cellular is connected but not the cloud for more than a timeout, then sleeping for a while before trying again? The prototypes these are on have worked wonderfully so far other than this, and I would like to start scaling them up potentially to ~1000 devices, but I can't if this is an issue. The 3 affected Borons are burning through well over 90% of all cellular data usage in my account. It's going to come right down to the wire as to whether the account will run out of data before the month rolls over. Particle's platform has always been a joy to work with, and it is a perfect fit for this use case, so I would love it if there is a fix for this.

The cellular usage tool may be helpful. Also downloading the device vitals from the console and seeing if any other parameter is unusual in the affected devices.

I wouldn't modify the firmware until you figure out what the actual cause is. While you can't measure the cellular usage on-device, you can detect the case where you're cellular connected and not cloud connected (Cellular.ready() && !Particle.connected()) and do something if you seem to be stuck in that state for a while. Since the data usage is continuous in that state, going into a sleep cycle might be appropriate.

1 Like

I've looked at the device vitals and there doesn't seem to be anything out of the ordinary or that would correlate with this. I'm not understanding how the device could be disconnected from the cloud (although that seems the most plausible) and still be publishing vitals regularly.

How long should being connected to cellular but not the cloud time out before sleeping and retrying? Would about 10 seconds work? What is the typical rate of data usage during this state?

I marked the two affected devices in this product as development and updated them to cumulatively keep track of how many seconds they spent in this fast blinking cyan mode by setting a timestamp from System.millis() whenever they entered this mode based on that condition and then adding the elapsed time to a cloud variable whenever the cloud reconnects. Because I don't have physical access to the deployed devices right now, I made sure it was working by testing it on another Boron I had lying around and intentionally moving it to and from spots with poor signal quality to make the connection reconnect.

So far, this has been running for several hours, and the affected devices have not entered fast blinking cyan mode once after connecting initially after the update. I will continue monitoring for a full 24 hours, but this seems increasingly bizarre.

Is there any way to look up what actual data is being sent to or from those devices to determine where this is coming from? They aren't publishing more data or vitals and they aren't reconnecting frequently, or getting hung while connected to cellular but not the cloud.

No, there is no way to see what data is being sent from the device before the device is cloud connected.

If the devices get close to the account's data limit before the data usage rolls over, is there an easily reversible way to stop their data usage temporarily? Normally I wrap publishes in a conditional that checks a 'quarantine' bool set by a cloud function to shut down excessive publishing remotely, but because this doesn't seem to be coming from publishes or vitals or anything configured on my end, what is the best way to do this? Can I deactivate the SIM and reactivate it later, so that the device comes back online upon reactivation without having to physically access it?

Yes, deactivating the SIM can stop runaway data usage and reactivating it will typically allow it to reconnect without intervention at the device. It might take several minutes for the reactivation to propagate to the nearest tower, and in some cases reconnection may not occur until the next modem reset, which occurs after 10 minutes of attempting to reconnect.

1 Like

How does inbound request to a particle count against data usage? Could there be a scenario here where you’re trying to communicate to the device via API from an application in the cloud or some other location?

Cellular data is measured by the carrier level in both directions at the IP layer, technically the PDP session.

The data is measured even if not received, as long as it transits the carrier. When retransmits occur at the CoAP layer to the Particle cloud, the cellular data could be counted multiple times.

If the device is known by the Particle cloud to be offline, however, doing a Particle variable or function request to it will not use cellular data, as the request will be rejected at the API layer instead of actually sending data and timing out.

Update: I deactivated the affected sims to avoid going over the data limit; upon reactivating them once the billing period rolled over, the issue was gone. I'm not sure if there is a causal link between de- and reactivation and the data usage rate returning to normal, but everything seems to be fine now.

1 Like