Designing systems for critical message reliability

zachary · April 2, 2018, 5:29pm

Continuing the discussion from March 20 incident postmortem:

Thanks for the transparency. What can be done from the device firmware perspective to have workarounds when this happens? I had built in my devices a few precautions whenever the device senses there is no cloud or cellular connectivity (i.e storing some data locally and wait until connection restores before transmitting). However, these measures did not work in this event.

As a context, we have a connected-machines service. Our customers rely on us sensing when their machines are operating and providing a service - they bill in turn to their customers based on this info. Hence, the data we get from the IoT devices is not real-time sensitive but we need to ensure it gets to its destination. We could not do it during downtime and the local storage solution we had did not work as the particle devices thought they had connectivity. We use the particle cloud and webhooks to send the messages from Particle to our final dashboard/data analysis solutions. We liked this approach (as opposed to sending the data from the device directly to our cloud) because of the built-in security and data efficiency in the particle.publish methods.

Thanks in advance for the suggestions.

Outstanding question. The solution will vary from product to product, but let's talk about some of the constraints.

Distributed systems are hard. Perfectly reliable message delivery is impossible to achieve without infinite memory and time. However, there are lots of things we can do to make any particular system more likely to be more reliable more of the time.

First, I'd recommend defining precisely what you need reliably to know. In the scenario you describe, it seems—please correct me if I'm wrong—like you don't need to hear every single message, but instead you need to know whether a machine was active in some time window related to billing. Do you customers bill by the day? By the hour? By the minute? The more fine-grained the billing, the harder this problem is, and the more you have to accept some small level uncertainty in the results. If they bill by the day, you only need to know whether a machine sent any messages in a given day, and you don't need to store every single one of those messages.

Your on-device caching—storing messages when you know there's no connection to the cloud—is exactly what you should be doing. Great job.

However, if you want to do the engineering work to handle the (rare) cases where messages get into the cloud but webhooks don't reach your server, then you need always to cache messages on device until you hear a confirmation from the end system that will receive the message. Luckily makes one form of this pretty easy—the device can receive the webhook HTTP response given by your server. You can use these responses as a signal that it's OK to delete the locally stored messages.

If for some reason this solution doesn't work for you, you can also code a separate acknowledgement mechanism by having your server hit the API to publish a message or call a device function upon webhook receipt. Just recognize the overhead in time and communication, and gut check whether your solution will scale as your customer base grows.

Yet what happens when your server sends an acknowledging response but the device doesn't receive it? How long should the device wait before retrying? This will be constrained by the memory you have on the device, the chattiness of the device, and how big the messages are.

Any retry attempts you make though, could result in duplicate messages received by your server, depending on the exact nature of the problem. Maybe the webhook was sent fine, but the response event got lost on the way back to the device. You can mitigate this by making it easier to de-duplicate messages at the end of the pipeline by picking a message ID. This could be as simple as the device ID and the timestamp, but since these constrained devices have limited knowledge of wall clock time, you may want to add a small random component to the ID.

Then we come back to the problem of, OK, how much data does this mean you actually need to store on the device. Take a fresh look at the problem.

Can the devices simply communicate less, e.g., in whatever period they send 30 messages today, could they instead send stats (avg, min, max, stddev) for that same window?
Can the individual messages be more compact?
Maybe some messages are critical, requiring this whole caching+acknowledgement scheme, whereas some other messages are not and don't need to be remembered. You can save a lot of memory by simply acknowledging to yourself that certain message types can be dropped with minimal adverse effects, allowing more memory for the truly critical messages.

Who else out there has designed IoT products on with high reliability requirements? Let's share some best practices!

fenriquez · April 2, 2018, 7:10pm

Thanks Zachary for the recommendations. Look forward to hearing best practices as a response to this post.

We looked at the web hook HTTP response option when designing the product but initially discarded it in order to save data over the cellular network. We use electron as opposed to wifi because of wifi (lack of) availability where the machine are located. It may be time to revisit it as the Electron plans now have 3MB which gives us more room to maneuver.

FYI - our initial customers bill by “service”. I.e the machine is connected performing a task during a predefined period of time (30 or 40min normally). We sense this activity in the machine and send a count to our servers. Normally we send this real time as soon as each service is finished or store it locally when the network is down. We also send alarm activity real time whenever the machine has a local alarm activated during the “service” but we don’t store alarms info locally if the network is down. The critical data is the count of services. Alarm data may be useful to store for future releases when we start offering predictive maintenance analysis but for now is discarded if cloud or network is down. Future customers we want to serve may have a “by the hour” “by the day” or “by minute” type of models.

BulldogLowell · April 2, 2018, 7:52pm

Given your interesting and unique requirement, I would think that local data storage would be a good backup. Storing the service events on some type of persistent memory device (i.e. SD card, FRAM, etc) would give you a hardware backup of events and you could develop a monthly process to check that your electron didn’t lose cellular some extended time during a qualified billing event. A simple event counter on the device can let you know if you received every event, comparing it to your server.

Given your comment re: cheaper data, creating a function that dumped a billing period’s worth of data once a month would be an option. Still have to determine how to store it compactly.

FYI, my experience is that cellular connectivity (even in areas with good coverage) is the real enemy, and the hour of service outage was the rare anomaly.

Topic		Replies	Views
Connect Cores using wires or attempt cloud communication General	2	898	September 3, 2015
Disappointed in Particle Firmware	14	1696	February 18, 2021
How long do I have to get data from Particle Cloud? Cloud	4	647	September 22, 2018
Webhook and Function Timing, Retries, and Acknowledgements Cloud electron	6	548	October 27, 2021
Reliability of Device Broker for the cloud connections General	0	452	June 20, 2018

Designing systems for critical message reliability

Related topics