Designing systems for critical message reliability

Continuing the discussion from March 20 incident postmortem:

Outstanding question. The solution will vary from product to product, but let's talk about some of the constraints.

Distributed systems are hard. Perfectly reliable message delivery is impossible to achieve without infinite memory and time. However, there are lots of things we can do to make any particular system more likely to be more reliable more of the time.

First, I'd recommend defining precisely what you need reliably to know. In the scenario you describe, it seems—please correct me if I'm wrong—like you don't need to hear every single message, but instead you need to know whether a machine was active in some time window related to billing. Do you customers bill by the day? By the hour? By the minute? The more fine-grained the billing, the harder this problem is, and the more you have to accept some small level uncertainty in the results. If they bill by the day, you only need to know whether a machine sent any messages in a given day, and you don't need to store every single one of those messages.

Your on-device caching—storing messages when you know there's no connection to the cloud—is exactly what you should be doing. Great job.

However, if you want to do the engineering work to handle the (rare) cases where messages get into the cloud but webhooks don't reach your server, then you need always to cache messages on device until you hear a confirmation from the end system that will receive the message. Luckily :particle: makes one form of this pretty easy—the device can receive the webhook HTTP response given by your server. You can use these responses as a signal that it's OK to delete the locally stored messages.

If for some reason this solution doesn't work for you, you can also code a separate acknowledgement mechanism by having your server hit the :particle: API to publish a message or call a device function upon webhook receipt. Just recognize the overhead in time and communication, and gut check whether your solution will scale as your customer base grows.

Yet what happens when your server sends an acknowledging response but the device doesn't receive it? How long should the device wait before retrying? This will be constrained by the memory you have on the device, the chattiness of the device, and how big the messages are.

Any retry attempts you make though, could result in duplicate messages received by your server, depending on the exact nature of the problem. Maybe the webhook was sent fine, but the response event got lost on the way back to the device. You can mitigate this by making it easier to de-duplicate messages at the end of the pipeline by picking a message ID. This could be as simple as the device ID and the timestamp, but since these constrained devices have limited knowledge of wall clock time, you may want to add a small random component to the ID.

Then we come back to the problem of, OK, how much data does this mean you actually need to store on the device. Take a fresh look at the problem.

  • Can the devices simply communicate less, e.g., in whatever period they send 30 messages today, could they instead send stats (avg, min, max, stddev) for that same window?
  • Can the individual messages be more compact?
  • Maybe some messages are critical, requiring this whole caching+acknowledgement scheme, whereas some other messages are not and don't need to be remembered. You can save a lot of memory by simply acknowledging to yourself that certain message types can be dropped with minimal adverse effects, allowing more memory for the truly critical messages.

Who else out there has designed IoT products on :particle: with high reliability requirements? Let's share some best practices!

2 Likes

Thanks Zachary for the recommendations. Look forward to hearing best practices as a response to this post.

We looked at the web hook HTTP response option when designing the product but initially discarded it in order to save data over the cellular network. We use electron as opposed to wifi because of wifi (lack of) availability where the machine are located. It may be time to revisit it as the Electron plans now have 3MB which gives us more room to maneuver.

FYI - our initial customers bill by “service”. I.e the machine is connected performing a task during a predefined period of time (30 or 40min normally). We sense this activity in the machine and send a count to our servers. Normally we send this real time as soon as each service is finished or store it locally when the network is down. We also send alarm activity real time whenever the machine has a local alarm activated during the “service” but we don’t store alarms info locally if the network is down. The critical data is the count of services. Alarm data may be useful to store for future releases when we start offering predictive maintenance analysis but for now is discarded if cloud or network is down. Future customers we want to serve may have a “by the hour” “by the day” or “by minute” type of models.

1 Like

Given your interesting and unique requirement, I would think that local data storage would be a good backup. Storing the service events on some type of persistent memory device (i.e. SD card, FRAM, etc) would give you a hardware backup of events and you could develop a monthly process to check that your electron didn’t lose cellular some extended time during a qualified billing event. A simple event counter on the device can let you know if you received every event, comparing it to your server.

Given your comment re: cheaper data, creating a function that dumped a billing period’s worth of data once a month would be an option. Still have to determine how to store it compactly.

FYI, my experience is that cellular connectivity (even in areas with good coverage) is the real enemy, and the hour of service outage was the rare anomaly.

2 Likes