How big is your data, really?

I was inspired to write this post by multiple discussion in many different data analytics Facebook groups all around the world as well as talking with our clients. There is repeating pattern in the question: “How big is your data?”

Usually the answer is something like this: “We have over 985GB a month” or similar. There is a time range and some value in bytes – but what does it really mean in terms of information that is in the data? What, where how and in which format is it stored?

To put the question in a context, we need to talk mainly about data format and compression. Basic data formats we use almost daily (even without knowing it) are CSV, JSON and XML and binary. The same thing goes for the compression – almost everybody knows ZIP and RAR, developers also GZIP and maybe LZO. But even database like MySQL, PostgreSQL, Redshift or Snowflake use compression, which plays a huge role in processing the data. Even bigger than one may think.

I gave you few hints that measuring the data volume is not that easy. And you probably have a hunch it might be a bit complicated. But what if I tell you that exactly the same table, with the exact structure and number of rows can have size ranging from 241 KB to 107.3 MB – one being 445x bigger than the other? The same thing, just in different format with compression. What is happening here?!

Data formats and their importance

To demonstrate the importance of data format and compression, I got the very famous Iris dataset with 100 rows and inflated it bit to 500,000 rows.

The original format of this dataset is a simple table, so I stored it as CSV. it looks like this:


… You get the idea. Standard CSV. With 500,000 rows it has 34.2 MB. Let’s say that we want this dataset to represented in JSON, which is valid requirement if we use NoSQL database like ElasticSearch. I used simple Python code to turn this CSV into JSON, the result looks like this:

{"sepal_width_w": "sepal_width_w", "petal_width": "petal_width", "sepal_length": "sepal_length", "petal_length": "petal_length", "sepal_width": "sepal_width", "sepal_length_w": "sepal_length_w", "petal_length_w": "petal_length_w", "species": "species"}
{"sepal_width_w": "0.625", "petal_width": "0.041666667", "sepal_length": "5.1", "petal_length": "1.4", "sepal_width": "3.5", "sepal_length_w": "0.222222222", "petal_length_w": "0.06779661", "species": "setosa"}
{"sepal_width_w": "0.416666667", "petal_width": "0.041666667", "sepal_length": "4.9", "petal_length": "1.4", "sepal_width": "3", "sepal_length_w": "0.166666667", "petal_length_w": "0.06779661", "species": "setosa"}
{"sepal_width_w": "0.5", "petal_width": "0.041666667", "sepal_length": "4.7", "petal_length": "1.3", "sepal_width": "3.2", "sepal_length_w": "0.111111111", "petal_length_w": "0.050847458", "species": "setosa"}
{"sepal_width_w": "0.458333333", "petal_width": "0.041666667", "sepal_length": "4.6", "petal_length": "1.5", "sepal_width": "3.1", "sepal_length_w": "0.083333333", "petal_length_w": "0.084745763", "species": "setosa"}

This structure is actually not one pure JSON object like you would find in AJAX or response to the REST API call. Those are new-line delimited JSON objects that for example Kafka or AWS Firehose uses to store packets of data they are processing. With the clean conversion of our Iris dataset with the same structure and number of rows, we got from 34.2 MB to 107.3 MB – more than 3x the volume of data. This is caused mainly by repeating the keys – attribute names all over again, for each row.

Data compression to save your harddisk

Now we now that it really matters what data format we use to store our data. What about compression? I took both CSV and JSON versions of the dataset and applied GZIP compression algorithm. CSV went from 34.2 MB to 241 KB (141x less) and JSON from 107.3 MB to 1.6 MB (67x less). To be fair, in real life you wont get such a great ratio for compression. When I artificially inflated the Iris dataset, I simply copied the existing rows – now there is a great repetition of the data causing the compression algorithm to be super-effective. In real life the data will have more variations. Usual ratio for GZIP compression is somewhere between 4x to 6x. If you compress real-life data, from 10 MB you’ll get 1.7 MB to 2.5 MB compressed file.

Databases and special cases of storages

Last but not least we have databases and their storages. Even though it does not like it, each database has some specifics in storing the data. Let’s make an example on databses we in Keboola use the most: MySQL, Redshift and Snowflake.

MySQL stores the 500,000 rows Iris dataset in a single table with size 43.6 MB, Redshift 26 MB and Snowflake 1.88 MB. This is caused by usage of different compression algorithms and the whole principal of storing the data in general. The most important thing to notice here is to realize this fact while looking at the pricing of different solutions: Redshift can be cheaper for 1 TB of data than Snowflake, but Snowflake can store much more _real data_ in 1 TB of space – just because of compression!

How to wrap it all? Next time someone asks you “How big is your data?”, or you will ask someone the same, always demand and offer more details. For example: “We have a client who has 1.5 TB a day of data stored as new-line delimited JSON packets, compressed by LZO and stored in S3.”. If you want to impress someone, you could translate it either as “9 TB a day of data stored as new-line delimited JSON packets in S3, without compression” or “250 GB a day of data stored as CSV, compressed by GZIP”. All 3 claims are true and the ratios are from real-life example. Strange, isn’t it?

So, how big is your data, really?