Looking for text compression for published data

sensorcheck · May 21, 2023, 3:45pm

In an effort to reduce the amount of data we publish I’m looking for text compression that could be called just before the Publish. I found this when researching:

http://www.unit-conversion.info/texttools/compress/#data

It seems to work well for our data. And it appears to still be 7-bit ASCII safe. Does anyone recognize the compression being used here?

jgskarda · May 21, 2023, 4:19pm

@sensorcheck - Others I'm sure will chime in here but I think your options are Base64 or Base85 if you are transferring binary/numeric data. I personally use Base64 to convert a byte array to a string before publishing. It works well. It's on my list to also look at Base85 to further compress it but isn't a huge priority right now. For me, the extra data operations isn't worth the headache yet to convert to Base85. Here is a good reference post asking a similar question that links to several examples with libraries for Base64 and Base85.

I personally use this: Base64RK

From my experience, you rarely hit the data (MBs/month) limit but rather you may hit the number of data operations/month. So the name of the game is to squeeze as much data into a single publish event. So on a related topic, I also use the technique to assemble a JSON Array of data to reduce data operations. This is explained here. Each member of the array is a new reading from a different period of time. I only publish once the max number of characters would be exceeded OR every 20 minutes whichever comes first. That works for my use case. It may or may not depending on how real time the data needs to be.

ScruffR · May 21, 2023, 4:29pm

In this context “compression” doesn’t appear to be quite fitting.
AFAICT the linked page produces a longer output than the input is - hence it doesn’t compress
Although the blurb on that page states that it uses the Unix compress algorithm - which would be Lepel-Ziv-Welch (which could render “inflated” text for short input strings).

However, I guess you rather want to encode binary data for transport - as @jgskarda pointed out.
If space is a concern Base85 would be the way to go. Otherwise Base64 would be the next best option - for one, because there already is a ready made library.

BTW, Gen3+ devices have a publish limit of 1024 bytes.
Unless you have one of these

rickkas7 · May 21, 2023, 4:37pm

Based on the text on that page, it uses the LZW algorithm (Lempel–Ziv–Welch), which is what Unix compress uses. The output of that algorithm is binary, so they used some other binary to make it back into ASCII. It wasn’t immediately obvious to me what they used because it doesn’t appear to be base64 or uuencode, which would have been my first two guesses.

The problem with compressing publishes is that in most cases using a dictionary algorithm (like LWZ) it’s all that effective because the strings are too short for removing the duplicate sequences to help. Plus, the output will be binary which will require that the encoded output be converted back to ASCII, which removes some of the benefits of compression. There are cases where it’s helpful, of course, but generally it doesn’t help that much.

If the data is JSON, the biggest help will be using short key names. One handy trick is to use single-character key names in the published JSON data which you can then expand to longer key names in a webhook using a mustache template.

sensorcheck · May 21, 2023, 7:14pm

Thank you for the unix “compress” reference and notice of underlying LZ compression algorithm. Also I should have mentioned that our data is already printable ASCII. We send up to 20 logical records for each Publish operation. There are still opportunities to shorten those records but compression came to mind as a potential next step. Glad to see others are also trying to pack more data into each physical Publish.

sensorcheck · May 21, 2023, 8:19pm

This looks interesting: https://pieroxy.net/blog/pages/lz-string/index.html. Would need ported.

sensorcheck · May 29, 2023, 4:33pm

I am realizing that the pressure for compression is coming from actual cumulative data usage per device that is WAY out of line from my expected totals. Our devices publish every 10 minutes and carry a maximum published string length of 700 bytes. Even if I round this up to the max of 1024 (Boron gen 3), the math comes out as 1024 * 144 (publishes per day) * 30 (days/month) = 4.2MB/month. One of our devices is already over 18MB for the month.

There were no OTA updates this month.
We publish with this syntax: Particle.publish("datapayload",vardata,PRIVATE,NO_ACK);
We stay connected and set keepalive: Particle.keepAlive(60);
We set publish vitals: Particle.publishVitals(43200);
Most of our time is spent in BLE.scan.

What else could be contributing to MBS_USED_CUMULATIVE per device?
Thanks,
Tim

rickkas7 · May 30, 2023, 9:25am

Why are you using a 60 second keep-alive ping? The default is 23 minutes and if you are using the Particle SIM you should not have to change it. If you are using a 3rd-party SIM it could be necessary.

Each keep-alive uses 122 bytes of data but no data operations. That adds up to 5.2 MB per device per month, plus your other usage.

Are the devices that are using a lot of data also dropping their connection? Reconnecting also uses data but not data operations and the amount can be significant for devices that are unable to do a session resume and instead have to do a full handshake. The most common cause of that is poor connectivity such that during the resume, the process fails, so it needs to restart over with a full handshake, which can take 5K of data per reconnections. A session resume is only a couple hundred bytes.

sensorcheck · June 1, 2023, 12:47pm

Whew! This is great info and questions. Follow-up thoughts and questions:

keepAlive - I think you are correct in that this was left over from before we started using the Particle sims. All of our 3rd party sims start with 8901170 - easy enough to check in the code and decide whether to set the keepAlive or leave the default.

dropping connections - I know from the console that the signal strength for the "worst" sim is 16% and quality 25%. These are not great numbers. Dropped connections is a metric that I have for my 3rd party sims but not for Particle sims (that I know of). Can Support provide this?

10-minute BLE scan - during the 10 minutes that my loop is scanning for BLE, what we are suspecting is that the cellular connection (different thread) could be dropping and reconnecting with full handshake. Is there a passive way to tell that thread to "care less" during the BLE scan? I always perform a particle connection check before publishing anyway.

Thanks,
Tim

sensorcheck · June 3, 2023, 8:26pm

Found dropped connections in View Device, Last Vitals.

sensorcheck · June 9, 2023, 11:31pm

Doing some keep-alive tests, I stopped setting keepAlive and noticed the following "offline / online" behavior with an external SIM just before a published event:

According to JASPER (ATT SIM Control Center) the SIM is staying in session with the tower.

Questions: Are these cloud disconnects? Are they less expensive than keepAlive?

rickkas7 · June 10, 2023, 8:02am

Oh, I didn't realize you were using a 3rd-party SIM card. You will probably need a 30 second keep-alive with an actual AT&T SIM.

Most cellular devices are basically on a huge NAT port forwarding network.

Your device is not assigned a public IP, and instead one is shared when you're sending data. This includes your Particle device. When it sends data to the Particle cloud by UDP, a temporary port forwarding is set up so the response UDP packet can be forwarded to the Particle device.

With the Particle SIM, the amount of time this port forwarding is kept alive is unusually long at 23 minutes. For 3rd-party SIMs, the is anywhere from 30 seconds to a couple minutes. For Wi-Fi and Ethernet (home and business networks with NAT), we assume 30-ish seconds.

What's happening is that AT&T sees that no data is being sent or received by your device and removes the port forwarding so the port can be used by another customer. The PDP (IP data) session is still up, but that port will no longer work for sending data to the Particle device.

This means that necessarily the cloud can no longer communicate from the cloud side, so functions, variables, events subscribed to on-device, and OTA won't work from the time the port forwarding stopped.

When the device goes to send data again, on the next keep-alive or publish, it will be assigned a new port and possibly IP address by your carrier. The Particle cloud will recognize the port/IP changed, and issue an offline event (because it was offline, but we only just realized it), then an online (because you are now online). This is a lighter-weight synchronization, less than a full handshake but more than a keep-alive.

The reason the cloud doesn't know the device fell offline when the port forwarding was removed by the carrier is that there is no way to know this. Neither side is notified, so the only way we know is when we get a port/IP change, full reconnection, or we've missed two 23-minute keep-alive packets. If we get no cellular data after around 46 minutes we mark the device as offline. Offline occurs around a minute for Wi-Fi devices, which have a 25-30 second keep-alive. Setting a different keep-alive doesn't affect the offline indicator timeout.

sensorcheck · June 11, 2023, 2:55pm

Ah, sorry, yes, we have both. The product started with ATT SIMs (3rd party) and with Particle SIMs being used in test. We recently started deploying with Particle SIMs and just moved the product out of the Sandbox. Now most of our devices are using Particle.

Based on this discussion, for keepAlive I now have the following check in setup():

// SIM Type
SimType simtype;

void setup() {    
    .
    .
    .
    // Get SIM Type
    simtype = Cellular.getActiveSim();
    
    // Set keepAlive
    // Type 2 (ATT) set to every 2 minutes 
    // Type 1 (internal) will run with Particle default
    if (simtype==2) {
        Log.info("===> PARTICLE KEEP ALIVE");
        Particle.keepAlive(60 * 2);
    }
}

New testing this weekend has shown that a 2-minute keepAlive for ATT 3rd party appears to be avoiding the cloud disconnections, at least from what I can see on the console.

With this change to keepAlive, a slight reduction of our publish payload, and the daily System.reset() (see System.reset() thread), I'm pushing the new firmware out today and will monitor over the next few weeks to see if daily usage comes down across the board.

Then we have "the big 5", 5 devices out of 24 that are 3-4X the daily data usage as compared to the rest. Publish data volumes for these is greater than the others, but should only attribute 2-3MB/month of extra data. Number of data operations per month is the same for all. Particle SIM is used for all. Device OS is 4.0.2 for all. Firmware is the same for all. This leaves location, location, location as the variable that matters?

We will continue to track this down and share findings here.

system · December 11, 2023, 2:55am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Compressing and breaking data into chunks prior to transmission Firmware	6	1495	September 21, 2018
Base 256 encoding in Particle publishes Firmware	5	2536	October 22, 2018
How to publish binary data Firmware	2	1345	September 20, 2017
Publish data not displaying 1024 bytes, or not publishing 1024 bytes Firmware	5	469	January 26, 2022
Particle.publish() = 622bytes? Cloud	32	1028	September 22, 2021

Looking for text compression for published data

Related topics