Original article was published by Ajay Gupta on Artificial Intelligence on Medium
Guide to Spark Partitioning
Partitioning is one of the basic building blocks on which the Apache Spark framework has been built. Just setting the right partitioning across various stages, a lot of spark programs can be optimized right away.
I encountered Apache Spark around 4 years back, and since then, I have been architecting Spark applications that are meant for executing complex data processing flow on massively sized multiple data sets.
During all these years of architecting numerous Spark Jobs and working with a big team of Spark developers, I noticed, in general, that the comprehensive understanding of various aspects of Spark partitioning lacks among Spark users. Because of this, they lose on the massive optimization opportunities which exist for building reliable, efficient, and scalable Spark Jobs meant for processing larger data sets.
Therefore, based on our experience, knowledge, and research, I and my colleague Naushad decided to write a book focusing just on this one important aspect of Apache Spark, i.e., partitioning. The book’s title, “Guide to Spark Partitioning” is also aligned with this single objective of the book.
Chapter 1 of the book introduces you to the concept of partitioning and its importance. Chapter 2 goes into depth to explain partitioning rules while reading ingested data files. Chapter 3 goes into depth to explain partitioning rules for Spark transformations that affect partitioning structure. Chapter 4 focusses on explicit partitioning APIs including the various re-partition APIs and the coalesce API. Chapter 5, the last chapter, provides details on how the partitions are written on to a permanent storage medium.
Further, the book focusses primarily on the RDD and Dataset representation of the data since the earlier Dataframe representation is now merged with the Dataset in the recent versions of Spark. Also, to aid in understanding the concepts presented in the chapter, we have also provided many examples in every chapter of the book.
The book is available on Kindle, here are the links to get your copy of it:
Hopefully, the book would concretize your understanding of the various aspects of Spark Partitioning in depth. Armed with the knowledge gained from the book, you should be able to set right partitioning in your Spark Jobs for Large Datasets.