Apache Spark 3.0 Review–What the Spark is all about

Source: Deep Learning on Medium

Version 3.0 of spark is a major release and introduces major and important features:

Language support

Spark 3.0 will move to Python3 and Scala version is upgraded to version 2.12. In addition it will fully support JDK 11. Python 2.x is heavily deprecated .

Adaptive execution of Spark SQL

This feature helps in where statistics on the data source do not exists or are in accurate. So far Spark had some optimizations which could be set only in the planning phase and according to the statistics of data (e.g. the ones captured by the ANALYZE command when deciding weather to perform a Broadcast-hash join over an expensive Sort-merge join. In cases in which these statistics are missing or not accurate BHJ might not kick in. with adaptive execution in Spark 3.0 spark can examine that data at runtime once he had loaded it and opt-in to BHJ at runtime even it could not detect it on the planning phase.

Dynamic Partition Pruning (DPP)

Spark 3.0 introduces Dynamic Partition Pruning which is a major performance improvement for SQL analytics workloads that in term can make integration with BI tools much better. The idea behind DPP is to apply the filter set on the dimension table — mostly small and used in a broadcast hash join — directly on the fact table so it could skip scanning unneeded partitions.

from Databricks session in Spark AI

DPP’s Optimisation is implemented both on the logical plan optimization and the physical planning. It showed speedup in many TCPDS queries and works well with star-schemas without the need to denormalize the tables.

Running TPCDS Query #98 with DPP. From Databricks session on DPP

Databricks’s session on DPP.

Enhanced Support for Deep Learning

Deep Learning on Spark was already possible so far. However Spark MLlib was not focused on Deep Learning and did not offer deep learning algorithms and in particular didn’t offer much for image processing. Existing projects like TensorFlowOnSpark, MMLSpark and some others made it possible somehow but presented significant challenges. For example — given that Spark resiliency is very good and knows to recompute tasks over partitions on failure — in deep learning for if you loose a partition in the middle of a training job and you recompute this individual partition Tensorflow or others will not work well. It requires to train on all partitions in the same time.

Spark 3.0 handles the above challenges much better. In addition it adds support for different GPUs like Nvidia, AMD, Intel and can use multiple types at the same time. In addition Vectorized UDFs can use GPUs for acceleration. For Kubernetes it offers GPU support in a flexible manner when running on Kubernetes.

Better Kubernetes Integration

Spark support for Kubernetes is relatively not matured in the 2.x version and difficult to use in production and performance was lacking in compare with the YARN cluster manager. Spark 3.0 introduces new shuffle service for Spark on Kubernetes that will allow dynamic scale up and down (more precisely out and in)

Spark 3.0 also supports GPU support with pod level isolation for executors which makes scheduling more flexible on a cluster with GPUs. Spark authentication on Kubernetes also has some goodies.

Graph features

Graph processing can be used in data science for several application including recommendation engine and fraud detections.

Spark 3.0 introduces a whole new module named SparkGraph with major features for Graph processing. These features include the popular Cypher query language developed by Neo4J which is a SQL like for graphs, the Property Graph Model processed by this language and Graph algorithms. This integration is something Neo4J worked on for several years and it’s named Morpheus (formerly named Cypher for Spark) but as said will be named SparkGraph inside the spark components.

Morpheus extends the Cypher language with multiple-graph feature, Graph catalog and Property graph data sources for integration with Neo4j engine, RDMS and more. It allows usage of the cypher language on graphs in a similar way SparkSQL operates over tabular data and it will have its own catalyst optimiser. In addition it would be possible to interoperate between SparkSQL and SparkGraph which can be very useful.

For a deep dive: Check out this session

ACID Transactions with Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark 3.0 and through easy implementation and upgrading of existing Spark applications, it brings reliability to Data Lakes. It announced to join the Linux foundation to grow its community.

It solves issues presented when data in the data lake is modified simultaneously by multiple modifiers and allows you to focus on logic and not worry from inconsistencies. Its very valuable for streaming applications but also very relevant for batch scenarios. Over 3500 organizations already use Delta Lake.

Spark 3.0 supports data lake out of the box and can be used just as it is used for example with parquet. sometimes replacing the read class to the deltalake’s one is enough to start using Delta Lake.

Quick Start: https://docs.databricks.com/delta/quick-start.html

Growing integration with Apache Arrow data format

Apache Arrow is an in-memory columnar data structure for efficient analytical operations. Its has benefits like being cross-language platform, performing zero-copy streaming messaging and interprocess communications without serialization costs which often occur with other systems.

In Spark 3.0 Usage in Apache Arrow takes bigger place and its used to improve the interchange between the Java and Python VMs. This usage enables new features like Arrow accelerated UDFs, TensorFlow being able to read error data in CUDA and more features in the Deep Learning section in Spark 3.0

Binary files data source

Spark 3.0 supports binary file data source. You can use it like this:

val df = spark.read.format(“binaryFile”)

The above will read binary files and converts each one to a single row that contains the raw content and metadata of the file. The DataFrame will contain the following columns and possibly partition columns:

  • path: StringType
  • modificationTime: TimestampType
  • length: LongType
  • content: BinaryType

writing back a binary DataFrame/RDD is currently not supported.

DataSource V2 Improvements

Few improvements for the DataSource API are included with Spark 3.0:

  • Pluggable catalog integration
  • Improved predicate push down for faster queries via reduced data loading

In addition there are many JIRAs to solve many issues existing with the current DataSource API.

YARN Features

  • Spark 3.0 can auto discover GPUs on a YARN cluster and schedule tasks specifically on nodes with GPUs.

More Features:

The above features are somehow the major and more influencing one but Spark 3.0 ships more enhancements and features with it.

Mostly it is clear that Spark 3.0 is a big step up for data scientists and enables them to run Deep Learning with distributed training and serving.