And Talk Apache Spark API Three Musketeers - RDD, DataFrame and Dataset

And talk Apache Spark API Three Musketeers: RDD, DataFrame and

Jules S. Damji

, translators with a single step

with a single step

Posted on September 29, 2017.
as Netflix, Microsoft and ThoughtWorks! discuss

This article is translated from A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets , which has been licensed by Jules S. Damji , the original
author .

One of the things that makes developers happy is that there is a set of APIs that make developers more productive, easier to use, more intuitive, and expressive. An
important reason Apache Spark is widely welcomed by developers is its easy-to-use API that makes it easy to manipulate big data sets in multiple languages such as
Scala, Java, Python, and R.

In this article, I'll dig deeper into the three APIs available in Apache Spark 2.2 and beyond - RDD, DataFrame, and Dataset, when and why you choose which, and
outline their performance and optimization points, List those scenes that should use DataFrame and Dataset instead of RDD. I will pay more attention to DataFrame
and Dataset, because in Apache Spark 2.0 these two APIs are integrated.

The motivation behind this integration is that we want to make it simpler to use Spark by reducing the number of concepts you need to
master and providing a way to handle structured data. When working with structured data, Spark offers the same high level of
abstraction and API as the language-specific languages provide.

Resilient Distributed Dataset (RDD)

From the very beginning, RDD was the main user-facing API provided by Spark. Basically, an RDD is an immutable set of distributed elements of your data
distributed across nodes in a cluster that can be processed in parallel by a number of underlying APIs that provide translation and processing.

Under what circumstances do you use RDD?

The following are the scenarios and common cases for using RDD:

You want the most basic transformation, processing and control of your data set;
Your data is unstructured, such as streaming media or character streams;
You want to process your data through functional programming rather than in specific areas;
You do not want to define a pattern as you do with column processing, handling or accessing data attributes by name or field;
You do not care about some of the optimization and performance benefits that can be gained from structured and semi-structured data processing with
DataFrame and Dataset;

RDD Apache Spark 2.0 in what changes?

You may ask: RDD is not about to downgrade to second class citizens? Is not about to withdraw from the stage of history?

The answer is very firm: no !

And, as you'll learn later, you can seamlessly switch between a DataFrame or Dataset and an RDD via simple API method calls. In fact, DataFrame and Dataset are
also provided based on RDDs.

Like RDDs , DataFrames are also an immutable distributed collection of data. But unlike RDD, data is organized into named columns, just as tables in a relational
database. The goal of designing a DataFrame is to make it easier to handle large data sets by giving developers the ability to specify a pattern for distributed data sets
for a higher level of abstraction. It provides specialized APIs for specific areas to handle your distributed data and makes it easier for more people to use Spark, not
just professional data engineers.

In our Apache Spark 2.0 Webinar and the follow-up blog , we mentioned that in Spark 2.0, the DataFrame and Dataset APIs will be merged to complete the
integration of data-handling capabilities across libraries. Once the integration is complete, developers no longer have to learn or memorize so many concepts that
they can get the job done through a suite of advanced and type-safe APIs called Dataset.

(Click to enlarge image)

As the following table shows, starting with Spark 2.0, Dataset started with two different types of API features: explicit type APIs and untyped APIs. Conceptually,
you can think of a DataFrame as an alias for a collection of generic objects Dataset [Row], which is a generic untyped JVM object. In contrast, Dataset is a
collection of JVM objects that have a well-defined type, as specified in the Case Class or Java class you define in Scala.

There are types and untyped APIs

Language The main abstract

Scala Dataset [T] & DataFrame (alias of Dataset


Java Dataset [T]

Python DataFrame

R DataFrame

Note: Since Python and R are not type-safe at compile time, we only have a typeless API called DataFrame.

Dataset API advantages

In Spark 2.0, the unified APIs for DataFrame and Dataset provide many benefits to Spark developers.

1, static type and runtime type safety

From the minimal constraints of SQL to the most stringent of Dataset's constraints, think of static typing and runtime security as a graph. For example, if you are
using a Spark SQL query, you will not find a syntax error (which is expensive) until you run it, and you catch errors at compile time if you are using DataFrame and
Dataset (This saves developers time and money.) That is, the compiler can detect this error when you call a function other than API in the DataFrame. However, if
you use a name that does not exist in the field, it is up to the runtime to find the error.

At the other end of the spectrum is the most stringent Dataset. Because Dataset APIs are all represented using lambda functions and JVM type objects, all unmatched
type parameters can be found at compile time. And when using Dataset, your analysis errors are also discovered at compile time, saving developers time and money.

All of this ends up being interpreted as a map of type safety with the syntax and parsing errors in your Spark code. In the map Dataset is the most stringent end, but it
is also the most efficient for developers.

(Click to enlarge image)

2, advanced and customized views of structured and semi-structured data

With DataFrame as a collection of Dataset [Row], you have a structured, customized view of your semi-structured data. For example, suppose you have a very large
set of IoT device event data in JSON format. Because JSON is a semi-structured format, it is well suited to Dataset as a strongly typed Dataset [DeviceIoTData]
{device_id: 198164 device_name: sensor-pad-198164owomcJZ ip: cca2: PL cca3: POL, cn: "Latitude": 53.080000, "longitude": 18.620000, "scale

You can use a Scala Case Class to represent each JSON record as a DeviceIoTData, a custom object.
case class DeviceIoTData (battery_level: Long, c02_level: Long, cca2:
String, cca3: String, cn: String, device_id: Long, device_name: String, humidity:
Long, ip: String, latitude: Double, lcd: String, longitude: Double, scale: String, temp: Long, timestamp: Long)

Next, we can read data from a JSON file.

// read the json file and create the dataset from the
// case class DeviceIoTData
// ds is now a collection of JVM Scala objects DeviceIoTData
val ds = ("/ databricks-public-datasets / data / iot / iot_devices.json"). as [DeviceIoTData]

The above code can be subdivided into three steps:

1. Spark into the JSON, according to the model to create a DataFrame collection;
2. At this point, Spark transforms your data with "DataFrame = Dataset [Row]" into a collection of universal row objects, since at this point it does not know the
exact type;
3. Spark can then convert a specific type of Scala JVM object like "Dataset [Row] -> Dataset [DeviceIoTData]", as defined by the class DeviceIoTData.
Many people who deal with structured data are accustomed to using the column mode to view and process data, or access to a particular attribute in the object. With
Dataset as a collection of typed Dataset [ElementType] objects, you get natural, compile-time security features and a customized view of strongly typed JVM
objects. And the strongly typed Dataset [T] you get with the above code can also be easily displayed or manipulated using advanced methods.

(Click to enlarge image)

3, easy to use structured API

Although structuring may limit the control of your data by the Spark program, it provides rich semantics and easy-to-use, action-specific operations in a particular
area, which can be represented as a high-level structure. In fact, most calculations can be done using Dataset's advanced APIs. For example, it is much simpler to
perform operations such as agg, select, sum, avg, map, filter, or groupBy than using data fields in RDD data rows. You only have to deal with Dataset-type
DeviceIoTData objects.

Using a set of APIs for a specific area to express your algorithm is much simpler than using RDD for relational algebra. For example, the following code will use
filter () and map () to create another immutable Dataset.
// Use filter (), map (), groupBy () country, and compute avg ()
// for temperatures and humidity. This operation results in
// another immutable Dataset. The query is simpler to read,
// and expressive
val dsAvgTmp = ds.filter (d => {d.temp> 25}). map (d => (d.temp, d.humidity, d.cca3)). groupBy ($ "_ 3"). avg
// display the resulting dataset
display (dsAvgTmp)

(Click to enlarge image)

4, performance and optimization

In addition to the above benefits, you'll also see the space efficiency and performance improvements that come with using the DataFrame and Dataset APIs. There
are two reasons for this:

First, because the DataFrame and Dataset APIs are built on the Spark SQL engine, they use Catalyst to generate optimized logical and physical query plans. All R,
Java, Scala, or Python's DataFrame / Dataset APIs, the underlying code optimizer is used at the bottom of all relational queries, and therefore gains space and speed
efficiency. Although the typed Dataset [T] API is optimized for data processing tasks, the untyped Dataset [Row] (alias DataFrame) runs faster and is suitable for
interactive analysis.

(Click to enlarge image)

Second, Spark, as a compiler , understands Dataset-type JVM objects that use encoders to map specific types of JVM objects to Tungsten's internal memory
representation. As a result, Tungsten's encoders can very efficiently serialize or deserialize JVM objects, yielding compressed bytecodes, which can be very efficient.
When should I use DataFrame or Dataset?

If you need rich semantics, advanced abstractions, and domain-specific APIs, use DataFrame or Dataset;
Use DataFrame or Dataset if your processing requires advanced processing of semi-structured data such as filter, map, aggregation, average, sum, SQL
queries, columnar access, or using lambda functions;
If you want to have a high degree of type safety at compile time, want a typed JVM object, use Catalyst optimizations, and benefit from the efficient code
generated by Tungsten, use Dataset;
If you want to use a consistent and simplified API between different Spark libraries, then use DataFrame or Dataset;
If you are a R language user, use DataFrame;
If you are a Python language user, use a DataFrame that will be returned to RDD if you need more granular control.

Note that you can convert DataFrame or Dataset to RDD seamlessly by simply invoking .rdd. Examples are as follows:

// select specific fields from the Dataset, apply a predicate

// using the where () method, convert to an RDD, and show first 10
// RDD rows
val deviceEventsDS = ($ "device_name", $ "cca3", $ "c02_level"). where ($ "c02_level"> 1300)
// convert to RDDs and take the first 10 rows
val eventsRDD = deviceEventsDS.rdd.take (10)

(Click to enlarge image)

to sum up
In short, when to use the RDD, DataFrame or Dataset seems to be quite obvious. The former provides the underlying functionality and control, which supports
customized views and structures that provide advanced and domain-specific operations that save space and run quickly.

When we review the lessons learned from earlier versions of Spark, we asked ourselves how to simplify Spark for developers. How to optimize it to make it higher
performance? We decided to make a high-level abstraction of the underlying RDD APIs into DataFrame and Dataset, using them to build a consistent data
abstraction across libraries across the Catalyst optimizer and Tungsten.

DataFrame and Dataset, or the RDD API, choose one for your needs and scenarios, and I will not be surprised when you're working with structured or semi-
structured data like most developers do.

Jules S. Damji is Databricks' preacher in the Apache Spark community. He is also a front-line developer with more than 15 years of experience in the development
of large distributed systems at industry-leading companies. Prior to joining Databricks, he was a Developer Advocate at Hortonworks.

Thanks to Cai Fangfang for reviewing this article.

