Data Lake -Comparing Performance of Known Big Data Formats

Original article was published by Manoj Kukreja on Artificial Intelligence on Medium


Data Lake -Comparing Performance of Known Big Data Formats

Performance Comparison of well known Big Data Formats — CSV, JSON, AVRO, PARQUET & ORC

Photo by Mika Baumeister on Unsplash

For past several years I have been using all kinds of data formats in Big Data projects. During this time I have strongly favored one format over other — my failures have taught me a few lessons. During my lectures I keep stressing the importance of using the correct Data Format for the correct purpose — it makes a world of a difference.

All this time I have wondered whether I am delivering the right knowledge to my customers and students. Can I support my claims using data? Therefore I decided to do this performance comparison.

Before I start the comparison let me briefly describe to you the various Data Formats that were considered.

Row Formats

CSV, JSON and Avro (binary)

Columnar Formats

Parquet and ORC (both binary)

I am sure you are wondering what’s the difference between Row and Columnar Formats. How data is stored on disk makes all the difference. While row format is stored as Row 1 > Row 2 > Row 3 the columnar format is stored to disk as Col 1 > Col 2 > Col 3

All good but how does that matter…data is data. Well not quite. What if I ask you to average the Salary column. For the row format of data each row needs to be read from disk, the salary column extracted and then averaged. This consumes a lot of IO because of a large sector of disk needs to be read.

For the columnar format a contiguous block of Salary column is read from disk in one shot and then averaged. Less IO means higher performance. Mathematical operations like averaging, summation etc. comprise a large portion of Data Analytics processing — naturally the columnar format is favored in OLAP

Image by Author

If row format is that bad then why is it still used? Well the row format is a clear winner if you are performing OLTP. In case of web/mobile applications you care less about mathematical processing over a range of rows, but you do want fast access for singular row. That’s where the row format is an automatic choice. For this reason all known relational databases store data in this fashion.

Performance Comparison — The Platform

Storage Layer: Hadoop 3 .2.1— Hadoop File System (HDFS)

Compute Layer: Spark 3 — 4 workers— 32 cores — 58GB

Data Sample Size: 500 million rows stored in HDFS

Sample Data:

+-----+---------+----------+
|stock| price| timetamp|
+-----+---------+----------+
| AMZN| 121.1792|1600952178|
| SBUX|132.03702|1600952178|
| MCD| 407.2828|1600952178|
| MCD| 43.12032|1600952178|
| MSFT|115.79154|1600952178|
+-----+---------+----------+

Processing Operations

  1. Sort 500 million rows in this order [stock, price, timetamp] and save results to HDFS
  2. Group Data by stock and find the max(price) and save results to HDFS

Performance Comparison Results

Following are the results of the file sizes across different formats. Notice the vast variation in file sizes for 500 million rows. JSON has the largest footprint whereas Parquet has the lowest.

Image by Author

JSON has the largest footprint because it stores the schema attributes for each row. For this reason, I rarely store JSON or CSV formats in curated and transformed zone in a data lake. JSON and CSV formats are still the most widely used data interchange methods. I make it a point to convert them to binary formats after ingestion from source. My data lake usually looks like:

Raw Zone: JSON or CSV

Curated and Transformed Zone: Avro, ORC or Parquet (or a combination thereof)

Following are the results of the sort operation across different formats. As expected JSON performed the slowest and Parquet, ORC were fastest. I must admit that I was a little surprised to see AVRO perform pretty well.

Image by Author

Following are the results of the grouping operation and finding the maximum value across different formats. Note that this operation is purely mathematical in nature. It closely resembles a Data Analytics use case. I am happy to confirm that all binary formats (AVRO, Parquet, ORC) performed reasonably well.

Parquet and ORC performed almost the same, however ORC had a slightly smaller file footprint. Little surprised to see the adoption of ORC has hugely declined in the recent years. I hardly encounter any projects using ORC format any more.

Image by Author

To sum up I can now safely infer that Parquet and ORC formats are indeed well suited for Data Analytics operations. Another very important takeaway from this article is — Immediately discontinue using CSV, JSON in your curated and transformed zones. That may seem obvious but you will be surprised how many times I run into deployments that do not follow this rule.

All the code for this comparison can be found on the link below:

I hope this article was helpful. The topic of Big Data Formats is covered in greater detail as part of the Big Data Hadoop, Spark & Kafka course offered by Datafence Cloud Academy. The course is taught online by myself on weekends.