Postgres schema for IoT sensors

This is a little off-topic but I believe still related enough to particle to make sense here. The tl;dr; is how to store time-series data in a relational database. I don’t have a solution but I suspect others have and will go down similar reasonings…

I’m trying to get started using postgres for IoT sensor time-series storage and grafana for visualization, plus lots of special queries to extract information. I know about time-series specific databases and such, as well as timescaledb (a postgres extension), but would prefer to go with plain postgres if possible for a variety of reasons. I’m writing-up my plans in the hope that it spurs some discussion that may help me find a better design or that helps others reading this in the future.

Requirements

I have relatively modest requirements: ~1000 time-series with a data point every minute, more specifically:

  • ~1000 time-series mostly with a data point every minute
  • typ message from a device contains about 10 metrics, or put differently, groups of about 10 metrics share timestamps and much of the meta-data
  • quad-core ARM w/2GB memory, eMMC storage (i.e. not a big x64 box)
  • at least 10-years lifetime of the DB (not necessarily the HW)
  • support lots of experimentation, i.e., deployment of new devices and replacement of existing ones

Option 1

My first option is to use a narrow data table for the time-series with just three columns: time-stamp, tags, and value; where tags is a foreign key to a record that has all the meta-data, e.g., device, sensor, metric, name, scale, unit, location, etc… This provides great flexibility because a device can report whatever metrics it wants to and they can be represented individually.

I believe this means that assuming timestamp, integer, real columns a data table row uses 23+1+8+4+4=40 bytes. Thus 1000 metrics, for one year, every minute would accumulate to about 21GB of data. That’s not impossibly large but not small either, given the “low end” system I’m contemplating.

One thing that is clear is that I need to partition the table by timestamp, probably monthly, so I can deal with old data, whether that’s deleting or archiving.

I’m pretty sure I will need to produce aggregated tables, e.g. hourly, daily or whatnot. Otherwise I expect that the time to produce the data for a dashboard full of yearly graphs will take forever. What is not clear to me is how to automatically switch between the raw and the various aggregated tables on the grafana end, I wonder whether a view can do that so I don’t have to have mile-long SQL in each graph’s definition.

Option 2

An alternative design would be to use a wider table for the time-series. E.g. a row could hold timestamp, foreign key to meta-data, and an array of values (all the values reported by a device in one message). Assuming an average of 10 values per row pg_column_size tells me that I would end up with 97 bytes per row. That sums to 5GB per year.

The 4x size reduction over option 1 is certainly attractive. The downside is that the SQL becomes more complex because the meta-data has to specify which array element to extract the value from.

Location vs physical device

One of the design decisions I struggle with is the replacement of devices and repurposing of devices. Due to the rather experimental nature of my deployments it happens frequently that devices get replaced by newer ones and that devices get repurposed in different locations. This means that if I’m querying a specific metric (say “temperature in the attic”) this may map to values reported by different HW devices over time. In the wide-table option it may even map to different columns in the values array over time.

One solution to the location/device mapping is to deal with it on ingress, e.g., when data come in it gets tagged with the “logical attributes” based on information about where a specific HW device is deployed. The downside is that this assumes an orderly workflow where the mapping is updated at the right time before/after device replacement or relocation.

The alternative is to record the data tagged using the device hardware ID and to have a separate table that keeps the device<->location mapping by time-period. This way the mapping can be fixed-up after changes have been made. However, querying the data becomes more complicated because of the join involved.

Thoughts? I’m sure others have gone down this or a very similar path, but I haven’t found a whole lot of write-ups or experiences…

I was in charge of development of a rotating machine condition monitoring system for industrial application. Data was collected by fixed, multi channel devices as well as mobile 1,2 or 4 channel devices. A typical database might have 1000 to 5000 measurement locations. At each location we would take a waveform of at least 1024 samples but in many cases 2 or 4 times longer than that. Sample frequencies were often in the range of 5-10khz.

Each measurement was described by:

  • dts (always gmt, no daylight saving correction)
  • Location id
  • Device id
  • Operator id (for mobile measurements)
  • Measurement angle (vertical, horizontal)
  • Environment temp (when available)
  • Machine id

There were some more esoteric descriptors but the above are the most important ones.

The system’s mission was to find ongoing machine failures as early as possible. Sometimes this is a needle in haystack business. Often times we were tracking the wrong thing; trends sometimes followed the monitoring device, sometimes the location, some times the operator (!). I myself was baffled for a while when monitoring a gas turbine which displayed a most curious signal in one of its bearings only for a portion of the day - only to find out that it was tracking the sun (it got hot on one side)

So my recommendation is to not be cheap and cheerful on your db design but include everything that might help you understand your data later on. Seperate your diagnostic needs from your db design needs. If you know what you need to do your diagnostics, then put your db design together. There are many ways to organize your data and optimize it for storage, speed or both.

Hope this helps…

3 Likes