Best practices to manage high volumes of data

dale · April 9, 2018, 4:26pm

I’m looking for discussions on best practices for managing high volumes of time series sensor data. To be more specific;

I have one proof-of-concept Photon deployed with a bunch of temperature sensors that report every 3 minutes.
This generates approx 2.5M data points per year per device.
I need native resolution data for anomaly detection and alerts. I plan to keep this data ~4 weeks to support post anomaly/event diagnostics.
I am considering summarizing data hourly for data that is older than 4 weeks. This would include; count, maximum, minimum, sum, and sum of squares.

There are lots of ways to approach this and they all have some trade offs.If anyone has worked through the trade offs associated with these decisions I welcome your input.
Thanks

rickkas7 · April 9, 2018, 7:11pm

I think I have around 16 temperature sensors reporting every 3 minutes. The data is stored a local MySQL database. I do roll the data up to hourly min/mean/max buckets and discard the minute samples after 7 days, but really for any real sql or nosql database, it’s not that much data even keeping every sample for years.

dale · April 10, 2018, 3:00pm

Do you have a local server running your MySQL db? I was considering setting up a Google Cloud SQL db but haven’t done so yet. How do you get the data into your db? webhook?

Another questions I’ve been wrestling with;
Where should I put the anomaly detection capability? in the photon or in the cloud/db? or more likely some combination of the two.
If there is anomaly detection is in the photon how can I generate alerts to end users? Is the only option;

publish to particle cloud plus an integration plus …???

Thanks in advance for everyone’s help

rickkas7 · April 10, 2018, 4:10pm

My situation is a bit unusual, so I’m not necessarily recommending it as best-practices, but one option available.

I have a home server running Apache Tomcat (Java) only because I’ve done a lot of Java programming. It could be node.js, Python, whatever you prefer. I also have a local MySQL database, though if I were using node.js I’d use a NoSQL database like CouchDB instead.

All of the storing, filtering, bucketing, and looking for anomalies is done on the server. It does Apple push notifications and SMS, as it’s much easier to do this from a server than on-device.

You can easily do the device-to-server communication using Particle publish, and that’s how I initially did it. I used the Server Sent Events stream to get events in real-time from the Particle cloud. it works fine and it’s easy.

I eventually switched to doing direct TCP, because it allowed me to do thing like handle data larger and faster than one 255 byte message per second per device. It also handles rerouting system logger messages into a database. And I wanted it to continue to function if my Internet was down. I still use the Particle cloud for OTA software updates.

All of the local server stuff could easily be done on a cloud provider like Google or AWS, and that’s certainly another option.

Rftop · April 10, 2018, 5:06pm

Take a look at ThingSpeak.com as another option.
You can use the Library, or a webhook & Publish.

ThingSpeak incorporates MATLAB for data analysis.
At only 2.5 million data points per sensor, there wouldn’t be any need to delete data.

Also, I’m not recommending this as a Best Practice, it’s just the one I happen to use
There are better options if you are also wanting swanky looking custom dashboards.

davidgs · April 10, 2018, 5:50pm

Check out InfluxDB. There’s an integration between InfluxDB and Particle as well, so you can just connect your Photons to Telegraf (the InfluxDB data ingestion engine) and save your time series data in a native Time Series Database.

InfluxDB is highly efficient at data ingestion and storage, so 2.5M data points/sensor can easily be handled by something as small as a Raspberry Pi.

dale · April 10, 2018, 10:15pm

I currently publish to the particle cloud and have a webhook that puts the data into a google spreadsheet. I have a google apps script that I wrote to parse the comma delimited data into columns. I use Tableau for analysis and visualizations. This has been fine for my proof-of-concept but I want to change the process before deploying more devices.

tgmcnaughton · September 2, 2019, 8:33pm

This sounds very interesting. I'd like to do this with a Raspberry Pi to collect time based data from the 10 or so Photons I have in my house doing temperature and power line monitoring. I'm more interested in doing larger and faster data sends than off-net functioning.

I've read your TCPServer example using SYSTEM_MODE(MANUAL) which looks very interesting but alerts me to the fact that things can get pretty complicated when you are start communicating with two servers simultaneously.

Do you have some example code which does show best-practice approaches for logging data locally via TCP - to - MySQL local database? That would be very useful and I think of pretty wide interest.
Thanks,
Tim McNaughton

ScruffR · September 2, 2019, 8:48pm

You may want to look at this repo

Which *two" servers are you thinking of?
Why would that be complicated? It's a matter of organisation.

Topic		Replies	Views
Logging data while offline, delayed cloud dump Getting Started	5	2360	January 27, 2016
Best framework for data logging Cloud	4	7780	March 17, 2017
TempoDB storage for sensor data General	1	1286	February 5, 2014
Limitation of logging Cloud	22	4945	January 14, 2016
Logging Data Every 10ms - Best way to retrieve it over the web? Getting Started	4	2060	July 17, 2016

Best practices to manage high volumes of data

Related topics