And Talk Apache Spark API Three Musketeers - RDD, DataFrame and Dataset

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

BT

How to use fragmentation time to enhance technology awareness and ability? Click to get the answer

Submission
Activity base camp
InfoQ mobile client
about us
Partner

Welcome to follow us:

InfoQ - Promoting the dissemination of knowledge and innovation in software development


搜索关键词

log in
1

En
Chinese
Japan
Fr
Br
966,690 December independent access to users

Language & Development


Java
Clojure
Scala
Net
mobile
Android
iOS
HTML 5
JavaScript
Functional programming
Web API

Special Topics in Language & Development


WeChat machine learning and artificial intelligence application practice

January 13-14, 2018, AICon Global Artificial Intelligence and Machine Learning Technology Conference was held in Beijing, WeChat applet
business technology leader Zhang Chongyang was invited as the co-chair of this conference, delivered a speech introducing the application of artificial
intelligence landing process Of the four "in", and combined with the practical case of WeChat done in-depth explanation. The following is the full text of the
speech.

Browse all languages & Development


Architecture & Design
Architecture
Enterprise Architecture
Performance and scalability
design
case analysis
Design Patterns
Safety
Special Topics in Architecture & Design
WeChat machine learning and artificial intelligence application practice

January 13-14, 2018, AICon Global Artificial Intelligence and Machine Learning Technology Conference was held in Beijing, WeChat applet
business technology leader Zhang Chongyang was invited as the co-chair of this conference, delivered a speech introducing the application of artificial
intelligence landing process Of the four "in", and combined with the practical case of WeChat done in-depth explanation. The following is the full text of the
speech.

Browse all Architecture & Design


Data Science
Big Data
NoSQL
database

Special Topics in Data Science


AI Frontline Special Issue: Progress Summary for 2017 AI Field

2017 has become the past, there are too many milestones in the AI field to be remembered, and 2017 is a better step toward 2018. Therefore,
the AI front-line will offer readers a mini-book in the beginning of 2018 covering all the world Year-end summary and trend interpretation of technical experts
in AI and big data fields, as well as the technology summary and trend forecast of the world's leading technology manufacturers at year-end.

View all Data Science


Culture & Methods
Agile
leadership
Teamwork
test
user experience
Scrum
Lean

Special Topics Cultural & Methods


Interview with Author of Agile Manpower

Pia-Maria Thoren wrote a book titled Agile Manpower, which questioned the role of human resources in organizations, the conditions in
which existing methods fail, and why these tactics need to be changed to support modern organizations thinking.

View all cultures & Methods


DevOps
Continuous delivery
Automated operation
cloud computing

Special Topic DevOps


Five major challenges in the implementation of Internet of Things project

People involved in the IoT project have realized that there is a big gap between what customers want and what suppliers have to offer. Mikael
Hakansson describes five key areas for securing the IoT, including ownership of the business, team skills, on-board devices, ability to handle change, and
comprehensive testing.

Browse all DevOps

InfoQ mobile client


Architecture
mobile
Operation and maintenance
cloud computing
AI front line
Big Data
front end
QCon collection
ArchSummit
Baidu

All topics
You are currently at: InfoQ Homepage Articles and Discuss Apache Spark API Three Musketeers: RDD, DataFrame and Dataset

And talk Apache Spark API Three Musketeers: RDD, DataFrame and
Dataset

Like | By Jules S. Damji


Jules S. Damji

, translators with a single step


with a single step

Posted on September 29, 2017. Estimated reading time: 18 minutes | QCon Beijing 2018 Getstarted: Open the road to technological innovation with companies such
as Netflix, Microsoft and ThoughtWorks! discuss

Share: Weibo WeChat Facebook Twitter Youdao cloud mail share mail
"Read later"
"My reading list"

Dear Reader: We recently added some customization features for personal messages, you can get important information emails and web notifications just by
choosing the technical topics you are interested in .
This article is translated from A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets , which has been licensed by Jules S. Damji , the original
author .

One of the things that makes developers happy is that there is a set of APIs that make developers more productive, easier to use, more intuitive, and expressive. An
important reason Apache Spark is widely welcomed by developers is its easy-to-use API that makes it easy to manipulate big data sets in multiple languages such as
Scala, Java, Python, and R.

In this article, I'll dig deeper into the three APIs available in Apache Spark 2.2 and beyond - RDD, DataFrame, and Dataset, when and why you choose which, and
outline their performance and optimization points, List those scenes that should use DataFrame and Dataset instead of RDD. I will pay more attention to DataFrame
and Dataset, because in Apache Spark 2.0 these two APIs are integrated.

Related manufacturers content

Excellent programmers should understand the development of practical cases

From C # to see the impact of open programming language development

Netflix Engineering Culture: What inspires us?

Father of Baidu Post Bar: Product Manager's Discovery and Growth

Apache Kafka past, present, and future


Related sponsors

The motivation behind this integration is that we want to make it simpler to use Spark by reducing the number of concepts you need to
master and providing a way to handle structured data. When working with structured data, Spark offers the same high level of
abstraction and API as the language-specific languages provide.

Resilient Distributed Dataset (RDD)


From the very beginning, RDD was the main user-facing API provided by Spark. Basically, an RDD is an immutable set of distributed elements of your data
distributed across nodes in a cluster that can be processed in parallel by a number of underlying APIs that provide translation and processing.

Under what circumstances do you use RDD?

The following are the scenarios and common cases for using RDD:

You want the most basic transformation, processing and control of your data set;
Your data is unstructured, such as streaming media or character streams;
You want to process your data through functional programming rather than in specific areas;
You do not want to define a pattern as you do with column processing, handling or accessing data attributes by name or field;
You do not care about some of the optimization and performance benefits that can be gained from structured and semi-structured data processing with
DataFrame and Dataset;

RDD Apache Spark 2.0 in what changes?

You may ask: RDD is not about to downgrade to second class citizens? Is not about to withdraw from the stage of history?

The answer is very firm: no !

And, as you'll learn later, you can seamlessly switch between a DataFrame or Dataset and an RDD via simple API method calls. In fact, DataFrame and Dataset are
also provided based on RDDs.

DataFrame
Like RDDs , DataFrames are also an immutable distributed collection of data. But unlike RDD, data is organized into named columns, just as tables in a relational
database. The goal of designing a DataFrame is to make it easier to handle large data sets by giving developers the ability to specify a pattern for distributed data sets
for a higher level of abstraction. It provides specialized APIs for specific areas to handle your distributed data and makes it easier for more people to use Spark, not
just professional data engineers.

In our Apache Spark 2.0 Webinar and the follow-up blog , we mentioned that in Spark 2.0, the DataFrame and Dataset APIs will be merged to complete the
integration of data-handling capabilities across libraries. Once the integration is complete, developers no longer have to learn or memorize so many concepts that
they can get the job done through a suite of advanced and type-safe APIs called Dataset.

(Click to enlarge image)


Dataset
As the following table shows, starting with Spark 2.0, Dataset started with two different types of API features: explicit type APIs and untyped APIs. Conceptually,
you can think of a DataFrame as an alias for a collection of generic objects Dataset [Row], which is a generic untyped JVM object. In contrast, Dataset is a
collection of JVM objects that have a well-defined type, as specified in the Case Class or Java class you define in Scala.

There are types and untyped APIs

Language The main abstract

Scala Dataset [T] & DataFrame (alias of Dataset


[Row])

Java Dataset [T]

Python DataFrame

R DataFrame

Note: Since Python and R are not type-safe at compile time, we only have a typeless API called DataFrame.

Dataset API advantages


In Spark 2.0, the unified APIs for DataFrame and Dataset provide many benefits to Spark developers.

1, static type and runtime type safety

From the minimal constraints of SQL to the most stringent of Dataset's constraints, think of static typing and runtime security as a graph. For example, if you are
using a Spark SQL query, you will not find a syntax error (which is expensive) until you run it, and you catch errors at compile time if you are using DataFrame and
Dataset (This saves developers time and money.) That is, the compiler can detect this error when you call a function other than API in the DataFrame. However, if
you use a name that does not exist in the field, it is up to the runtime to find the error.

At the other end of the spectrum is the most stringent Dataset. Because Dataset APIs are all represented using lambda functions and JVM type objects, all unmatched
type parameters can be found at compile time. And when using Dataset, your analysis errors are also discovered at compile time, saving developers time and money.

All of this ends up being interpreted as a map of type safety with the syntax and parsing errors in your Spark code. In the map Dataset is the most stringent end, but it
is also the most efficient for developers.

(Click to enlarge image)

2, advanced and customized views of structured and semi-structured data

With DataFrame as a collection of Dataset [Row], you have a structured, customized view of your semi-structured data. For example, suppose you have a very large
set of IoT device event data in JSON format. Because JSON is a semi-structured format, it is well suited to Dataset as a strongly typed Dataset [DeviceIoTData]
collection.
{device_id: 198164 device_name: sensor-pad-198164owomcJZ ip: 80.55.20.25 cca2: PL cca3: POL, cn: "Latitude": 53.080000, "longitude": 18.620000, "scale

You can use a Scala Case Class to represent each JSON record as a DeviceIoTData, a custom object.
case class DeviceIoTData (battery_level: Long, c02_level: Long, cca2:
String, cca3: String, cn: String, device_id: Long, device_name: String, humidity:
Long, ip: String, latitude: Double, lcd: String, longitude: Double, scale: String, temp: Long, timestamp: Long)

Next, we can read data from a JSON file.


// read the json file and create the dataset from the
// case class DeviceIoTData
// ds is now a collection of JVM Scala objects DeviceIoTData
val ds = spark.read.json ("/ databricks-public-datasets / data / iot / iot_devices.json"). as [DeviceIoTData]

The above code can be subdivided into three steps:

1. Spark into the JSON, according to the model to create a DataFrame collection;
2. At this point, Spark transforms your data with "DataFrame = Dataset [Row]" into a collection of universal row objects, since at this point it does not know the
exact type;
3. Spark can then convert a specific type of Scala JVM object like "Dataset [Row] -> Dataset [DeviceIoTData]", as defined by the class DeviceIoTData.
Many people who deal with structured data are accustomed to using the column mode to view and process data, or access to a particular attribute in the object. With
Dataset as a collection of typed Dataset [ElementType] objects, you get natural, compile-time security features and a customized view of strongly typed JVM
objects. And the strongly typed Dataset [T] you get with the above code can also be easily displayed or manipulated using advanced methods.

(Click to enlarge image)

3, easy to use structured API

Although structuring may limit the control of your data by the Spark program, it provides rich semantics and easy-to-use, action-specific operations in a particular
area, which can be represented as a high-level structure. In fact, most calculations can be done using Dataset's advanced APIs. For example, it is much simpler to
perform operations such as agg, select, sum, avg, map, filter, or groupBy than using data fields in RDD data rows. You only have to deal with Dataset-type
DeviceIoTData objects.

Using a set of APIs for a specific area to express your algorithm is much simpler than using RDD for relational algebra. For example, the following code will use
filter () and map () to create another immutable Dataset.
// Use filter (), map (), groupBy () country, and compute avg ()
// for temperatures and humidity. This operation results in
// another immutable Dataset. The query is simpler to read,
// and expressive
val dsAvgTmp = ds.filter (d => {d.temp> 25}). map (d => (d.temp, d.humidity, d.cca3)). groupBy ($ "_ 3"). avg
// display the resulting dataset
display (dsAvgTmp)

(Click to enlarge image)

4, performance and optimization

In addition to the above benefits, you'll also see the space efficiency and performance improvements that come with using the DataFrame and Dataset APIs. There
are two reasons for this:

First, because the DataFrame and Dataset APIs are built on the Spark SQL engine, they use Catalyst to generate optimized logical and physical query plans. All R,
Java, Scala, or Python's DataFrame / Dataset APIs, the underlying code optimizer is used at the bottom of all relational queries, and therefore gains space and speed
efficiency. Although the typed Dataset [T] API is optimized for data processing tasks, the untyped Dataset [Row] (alias DataFrame) runs faster and is suitable for
interactive analysis.

(Click to enlarge image)

Second, Spark, as a compiler , understands Dataset-type JVM objects that use encoders to map specific types of JVM objects to Tungsten's internal memory
representation. As a result, Tungsten's encoders can very efficiently serialize or deserialize JVM objects, yielding compressed bytecodes, which can be very efficient.
When should I use DataFrame or Dataset?

If you need rich semantics, advanced abstractions, and domain-specific APIs, use DataFrame or Dataset;
Use DataFrame or Dataset if your processing requires advanced processing of semi-structured data such as filter, map, aggregation, average, sum, SQL
queries, columnar access, or using lambda functions;
If you want to have a high degree of type safety at compile time, want a typed JVM object, use Catalyst optimizations, and benefit from the efficient code
generated by Tungsten, use Dataset;
If you want to use a consistent and simplified API between different Spark libraries, then use DataFrame or Dataset;
If you are a R language user, use DataFrame;
If you are a Python language user, use a DataFrame that will be returned to RDD if you need more granular control.

Note that you can convert DataFrame or Dataset to RDD seamlessly by simply invoking .rdd. Examples are as follows:

// select specific fields from the Dataset, apply a predicate


// using the where () method, convert to an RDD, and show first 10
// RDD rows
val deviceEventsDS = ds.select ($ "device_name", $ "cca3", $ "c02_level"). where ($ "c02_level"> 1300)
// convert to RDDs and take the first 10 rows
val eventsRDD = deviceEventsDS.rdd.take (10)

(Click to enlarge image)

to sum up
In short, when to use the RDD, DataFrame or Dataset seems to be quite obvious. The former provides the underlying functionality and control, which supports
customized views and structures that provide advanced and domain-specific operations that save space and run quickly.

When we review the lessons learned from earlier versions of Spark, we asked ourselves how to simplify Spark for developers. How to optimize it to make it higher
performance? We decided to make a high-level abstraction of the underlying RDD APIs into DataFrame and Dataset, using them to build a consistent data
abstraction across libraries across the Catalyst optimizer and Tungsten.

DataFrame and Dataset, or the RDD API, choose one for your needs and scenarios, and I will not be surprised when you're working with structured or semi-
structured data like most developers do.

about the author


Jules S. Damji is Databricks' preacher in the Apache Spark community. He is also a front-line developer with more than 15 years of experience in the development
of large distributed systems at industry-leading companies. Prior to joining Databricks, he was a Developer Advocate at Hortonworks.

Thanks to Cai Fangfang for reviewing this article.

To submit InfoQ Chinese station or participate in content translation, please email editors@cn.infoq.com . We are also welcome to follow us through Weibo (
@InfoQ , @ Ding Xiaoyun ), WeChat ( InfoQChina ).

Language & Development Architecture & Design DataFrame Apache Spark

related topic:
Language & Development Architecture & Design DataFrame Apache Spark

 Follow 204 of his fans  Watch 541 his fans  Follow 0 His Fans  Follow 1 his fans  Follow 1 his fans

related information

SkyWalking joins the Apache Incubator

Apache Pulsar's Multi-Tenancy Messaging System

Why choose Apache Pulsar (b)

Why choose Apache Pulsar (a)

Geography in Apache Pulsar, Part 2: Patterns and Practices

Hello friend!
You need to register an InfoQ account or log in to comment. After you complete the registration you also need to make some settings.

Get more experience from InfoQ.


Tell us what you think
信息
请输入主题

Allowed HTML tags: a, b, br, blockquote, i, li, pre, u, ul, p

Please e-mail me when someone replies to this comment


send Message
Community Comments Watch Thread
shut down

by

posted on

View
Reply
back to the top

shut down
refers to the original message
theme Your reply

Allowed HTML tags: a, b, br, blockquote, i, li, pre, u, ul, p

Please e-mail me when someone replies to this comment


send Message 取消
shut down

theme your response

Allowed HTML tags: a, b, br, blockquote, i, li, pre, u, ul, p

Please e-mail me when someone replies to this comment

取消
shut down

OK
sponsor link
"Microservice Architecture Core 20" - Senior
Architect Yang Bo on "Microservice
Architecture core points" to explain in simple
terms, hoping to help technicians in the micro-
What is the value of operation and
maintenance, how to create business
operation and maintenance organizational
structure, operation and maintenance need to

related information

Apache Pulsar's Multi-Tenancy Messaging System December 4, 2017


Why Apache Pulsar (2) November 28, 2017

Why Apache Pulsar (1) November 27, 2017

Geography in Apache Pulsar, Part 2: Patterns and Practices November 24, 2017

Geography in Apache Pulsar, Part 1: Concepts and Capabilities November 21, 2017

RocketMQ Apache Top Project Road November 9, 2017

Talk to Apache Arrow Oct 1, 2017

Apache Kafka: The Big Data Real Time Processing Age August 18, 2017

Distributed Machine Learning Platform Competition: Spark, PMLS, TensorFlow, MXNet 5 October 2017

Application of Spark Technology in Jingdong Intelligent Supply Chain Forecasting August 11, 2017

Small and Medium R & D Team Architecture Practices: Search Server Solr December 18, 2017
Sponsor content

A lesson teaches you to understand technological innovation and business model

As a programmer, how to understand the company, how to understand the company at different stages of different strategies? In the tide of internet, how
should technicians follow the trend? Changing columns can give you new thinking about technological innovation and business models.

How important is it to improve technology awareness?

Yu Jun, the father of Baidu Post Bar Jun Rao, VP of Kafka, VP Georges Saab, Oracle Java Platform Business Group Neal Ford, Author of Prometheus
Monitoring System, Julius Volz, Prometheus Surveillance System, Katharina Probst, Director of Engineering, Netflix QCon Beijing 2018 , Collide with big
coffee research and development thinking.

Sponsor

related information
Building an Ad Consumption Forecasting System Based on Kafka Streams November 9, 2017

How to Build a Machine Learning Platform Using Existing Big Data Technology November 15, 2017
Aruba's Smart Edge concept January 29, 2018
Replace the wallet address: grab mining machine new idea! January 29, 2018
AI + Education Has Become a Trend and Education in Smart Age Still Faces Challenges January 29, 2018
Meizu Data Platform Design Philosophy and Core Architecture January 29, 2018
HashiCorp, Contino Enterprise Terraform Recommended Practice Guides Share on January 29, 2018
Microsoft Releases a Public Preview of Azure Data Factory v2 Visualization Tools January 29, 2018
UWP applications in the enterprise challenges January 29, 2018
Using a Vehicle Image to Train a Neural Network Supporting UAVs Traveling Through the City January 29, 2018

WeChat Machine Learning and Artificial Intelligence Application Practice January 29, 2018
related information

Interview with author of Agile Manpower on January 29, 2018

DJI Service Gateway Global Design and Practice January 29, 2018

From automation to intelligent operation and maintenance system of Ali January 29, 2018

Alibaba Dispatch and Cluster Management System Sigma January 29, 2018

DevOps Knowledge System and Standardization Building January 28, 2018

PaxosStore: WeChat High Availability, Strongly Consistent Storage System January 28, 2018
Why AIOps is the future, Baidu's thinking and practice January 27, 2018

Evolution of the MySQL Database Architecture January 27, 2018


Ruby 2.5.0 Overview January 26, 2018
EE4J code opens the open source journey January 26, 2018
2018 Test Status Survey January 26, 2018

InfoQ Weekly Essentials

Subscribe to InfoQ's weekly essentials and join the massive technical community of more than 250,000 senior developers.

您的邮箱 subscription

Language & Development

Aruba's "smart edge" concept

Replace the wallet address: grab mining machine new idea!

AI + education has become a trend, education in the era of intelligence is still facing challenges

Architecture & Design

Meizu data platform design philosophy and core architecture

HashiCorp, Contino Enterprise Terraform Recommended Practice Guide to share

Microsoft Releases a Public Preview of Azure Data Factory v2 Visualization Tools

Culture & Methods

Meizu data platform design philosophy and core architecture

Interview with Author of Agile Manpower

2018 test status survey

Data Science

Replace the wallet address: grab mining machine new idea!

AI + education has become a trend, education in the era of intelligence is still facing challenges

Use the vehicle image to train the neural network supporting UAV traveling through the city

DevOps

HashiCorp, Contino Enterprise Terraform Recommended Practice Guide to share

How is DevOps principles applied to network deployments?

Micro-service architecture era, why the operation and maintenance system to "application" as the core?

Home
All topics
QCon Global Software Development Conference
about us
Submission
Create an account
log in

Global QCon
London Mar 6-10, 2017
Beijing Apr 16-18, 2017
Sao Paulo Apr 24-26, 2017
New York Jun 26-30, 2017
Shanghai Oct 19-21, 2017
Tokyo, 2017 Autumn
San Francisco Nov 13-17, 2017

InfoQ Weekly Essentials

Subscribe to InfoQ's weekly essentials and join the massive technical community of more than 250,000 senior developers.
您的邮箱 subscription

RSS subscription
InfoQ official microblogging
InfoQ official WeChat
Community news and hot spots

Special topic

Activity base camp


Monthly: "architect"
AWS area
Baidu technology salon area
Information accessibility reference document
InfoQ.com and All Content Copyright © 2006-2017
C4Media Inc. InfoQ.com Server Powered by
Provide feedback error report Business CooperationContent cooperation market collaboration
Contegix , our trusted partner for Linux .
feedback@cn.infoq.combugs@cn.infoq.com hezuo@geekbang.org editors@cn.infoq.comhezuo@geekbang.org
Beijing Innovation Network Media Advertising Co.,
Ltd. 京 ICP 备 09022563 号 -7 Privacy Policy
BT

我们发现您在使用ad blocker。
我们理解您使用ad blocker的初衷,但为了保证InfoQ能够继续以免费方式为您服务,我们需要您的支持。InfoQ绝不会在未经您许可的情况下将您的数据
提供给第三方。我们仅将其用于向读者发送相关广告内容。请您将InfoQ添加至白名单,感谢您的理解与支持。

You might also like