Continuing the discussion from March 20 incident postmortem:
Outstanding question. The solution will vary from product to product, but let's talk about some of the constraints.
Distributed systems are hard. Perfectly reliable message delivery is impossible to achieve without infinite memory and time. However, there are lots of things we can do to make any particular system more likely to be more reliable more of the time.
First, I'd recommend defining precisely what you need reliably to know. In the scenario you describe, it seems—please correct me if I'm wrong—like you don't need to hear every single message, but instead you need to know whether a machine was active in some time window related to billing. Do you customers bill by the day? By the hour? By the minute? The more fine-grained the billing, the harder this problem is, and the more you have to accept some small level uncertainty in the results. If they bill by the day, you only need to know whether a machine sent any messages in a given day, and you don't need to store every single one of those messages.
Your on-device caching—storing messages when you know there's no connection to the cloud—is exactly what you should be doing. Great job.
However, if you want to do the engineering work to handle the (rare) cases where messages get into the cloud but webhooks don't reach your server, then you need always to cache messages on device until you hear a confirmation from the end system that will receive the message. Luckily makes one form of this pretty easy—the device can receive the webhook HTTP response given by your server. You can use these responses as a signal that it's OK to delete the locally stored messages.
If for some reason this solution doesn't work for you, you can also code a separate acknowledgement mechanism by having your server hit the API to publish a message or call a device function upon webhook receipt. Just recognize the overhead in time and communication, and gut check whether your solution will scale as your customer base grows.
Yet what happens when your server sends an acknowledging response but the device doesn't receive it? How long should the device wait before retrying? This will be constrained by the memory you have on the device, the chattiness of the device, and how big the messages are.
Any retry attempts you make though, could result in duplicate messages received by your server, depending on the exact nature of the problem. Maybe the webhook was sent fine, but the response event got lost on the way back to the device. You can mitigate this by making it easier to de-duplicate messages at the end of the pipeline by picking a message ID. This could be as simple as the device ID and the timestamp, but since these constrained devices have limited knowledge of wall clock time, you may want to add a small random component to the ID.
Then we come back to the problem of, OK, how much data does this mean you actually need to store on the device. Take a fresh look at the problem.
- Can the devices simply communicate less, e.g., in whatever period they send 30 messages today, could they instead send stats (avg, min, max, stddev) for that same window?
- Can the individual messages be more compact?
- Maybe some messages are critical, requiring this whole caching+acknowledgement scheme, whereas some other messages are not and don't need to be remembered. You can save a lot of memory by simply acknowledging to yourself that certain message types can be dropped with minimal adverse effects, allowing more memory for the truly critical messages.
Who else out there has designed IoT products on with high reliability requirements? Let's share some best practices!