This is a little off-topic but I believe still related enough to particle to make sense here. The tl;dr; is how to store time-series data in a relational database. I don’t have a solution but I suspect others have and will go down similar reasonings…
I’m trying to get started using postgres for IoT sensor time-series storage and grafana for visualization, plus lots of special queries to extract information. I know about time-series specific databases and such, as well as timescaledb (a postgres extension), but would prefer to go with plain postgres if possible for a variety of reasons. I’m writing-up my plans in the hope that it spurs some discussion that may help me find a better design or that helps others reading this in the future.
Requirements
I have relatively modest requirements: ~1000 time-series with a data point every minute, more specifically:
- ~1000 time-series mostly with a data point every minute
- typ message from a device contains about 10 metrics, or put differently, groups of about 10 metrics share timestamps and much of the meta-data
- quad-core ARM w/2GB memory, eMMC storage (i.e. not a big x64 box)
- at least 10-years lifetime of the DB (not necessarily the HW)
- support lots of experimentation, i.e., deployment of new devices and replacement of existing ones
Option 1
My first option is to use a narrow data table for the time-series with just three columns: time-stamp, tags, and value; where tags is a foreign key to a record that has all the meta-data, e.g., device, sensor, metric, name, scale, unit, location, etc… This provides great flexibility because a device can report whatever metrics it wants to and they can be represented individually.
I believe this means that assuming timestamp, integer, real columns a data table row uses 23+1+8+4+4=40 bytes. Thus 1000 metrics, for one year, every minute would accumulate to about 21GB of data. That’s not impossibly large but not small either, given the “low end” system I’m contemplating.
One thing that is clear is that I need to partition the table by timestamp, probably monthly, so I can deal with old data, whether that’s deleting or archiving.
I’m pretty sure I will need to produce aggregated tables, e.g. hourly, daily or whatnot. Otherwise I expect that the time to produce the data for a dashboard full of yearly graphs will take forever. What is not clear to me is how to automatically switch between the raw and the various aggregated tables on the grafana end, I wonder whether a view can do that so I don’t have to have mile-long SQL in each graph’s definition.
Option 2
An alternative design would be to use a wider table for the time-series. E.g. a row could hold timestamp, foreign key to meta-data, and an array of values (all the values reported by a device in one message). Assuming an average of 10 values per row pg_column_size tells me that I would end up with 97 bytes per row. That sums to 5GB per year.
The 4x size reduction over option 1 is certainly attractive. The downside is that the SQL becomes more complex because the meta-data has to specify which array element to extract the value from.
Location vs physical device
One of the design decisions I struggle with is the replacement of devices and repurposing of devices. Due to the rather experimental nature of my deployments it happens frequently that devices get replaced by newer ones and that devices get repurposed in different locations. This means that if I’m querying a specific metric (say “temperature in the attic”) this may map to values reported by different HW devices over time. In the wide-table option it may even map to different columns in the values array over time.
One solution to the location/device mapping is to deal with it on ingress, e.g., when data come in it gets tagged with the “logical attributes” based on information about where a specific HW device is deployed. The downside is that this assumes an orderly workflow where the mapping is updated at the right time before/after device replacement or relocation.
The alternative is to record the data tagged using the device hardware ID and to have a separate table that keeps the device<->location mapping by time-period. This way the mapping can be fixed-up after changes have been made. However, querying the data becomes more complicated because of the join involved.
Thoughts? I’m sure others have gone down this or a very similar path, but I haven’t found a whole lot of write-ups or experiences…