Professional Documents
Culture Documents
And Talk Apache Spark API Three Musketeers - RDD, DataFrame and Dataset
And Talk Apache Spark API Three Musketeers - RDD, DataFrame and Dataset
And Talk Apache Spark API Three Musketeers - RDD, DataFrame and Dataset
How to use fragmentation time to enhance technology awareness and ability? Click to get the answer
Submission
Activity base camp
InfoQ mobile client
about us
Partner
log in
1
En
Chinese
Japan
Fr
Br
966,690 December independent access to users
January 13-14, 2018, AICon Global Artificial Intelligence and Machine Learning Technology Conference was held in Beijing, WeChat applet
business technology leader Zhang Chongyang was invited as the co-chair of this conference, delivered a speech introducing the application of artificial
intelligence landing process Of the four "in", and combined with the practical case of WeChat done in-depth explanation. The following is the full text of the
speech.
January 13-14, 2018, AICon Global Artificial Intelligence and Machine Learning Technology Conference was held in Beijing, WeChat applet
business technology leader Zhang Chongyang was invited as the co-chair of this conference, delivered a speech introducing the application of artificial
intelligence landing process Of the four "in", and combined with the practical case of WeChat done in-depth explanation. The following is the full text of the
speech.
2017 has become the past, there are too many milestones in the AI field to be remembered, and 2017 is a better step toward 2018. Therefore,
the AI front-line will offer readers a mini-book in the beginning of 2018 covering all the world Year-end summary and trend interpretation of technical experts
in AI and big data fields, as well as the technology summary and trend forecast of the world's leading technology manufacturers at year-end.
Pia-Maria Thoren wrote a book titled Agile Manpower, which questioned the role of human resources in organizations, the conditions in
which existing methods fail, and why these tactics need to be changed to support modern organizations thinking.
People involved in the IoT project have realized that there is a big gap between what customers want and what suppliers have to offer. Mikael
Hakansson describes five key areas for securing the IoT, including ownership of the business, team skills, on-board devices, ability to handle change, and
comprehensive testing.
All topics
You are currently at: InfoQ Homepage Articles and Discuss Apache Spark API Three Musketeers: RDD, DataFrame and Dataset
And talk Apache Spark API Three Musketeers: RDD, DataFrame and
Dataset
Posted on September 29, 2017. Estimated reading time: 18 minutes | QCon Beijing 2018 Getstarted: Open the road to technological innovation with companies such
as Netflix, Microsoft and ThoughtWorks! discuss
Share: Weibo WeChat Facebook Twitter Youdao cloud mail share mail
"Read later"
"My reading list"
Dear Reader: We recently added some customization features for personal messages, you can get important information emails and web notifications just by
choosing the technical topics you are interested in .
This article is translated from A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets , which has been licensed by Jules S. Damji , the original
author .
One of the things that makes developers happy is that there is a set of APIs that make developers more productive, easier to use, more intuitive, and expressive. An
important reason Apache Spark is widely welcomed by developers is its easy-to-use API that makes it easy to manipulate big data sets in multiple languages such as
Scala, Java, Python, and R.
In this article, I'll dig deeper into the three APIs available in Apache Spark 2.2 and beyond - RDD, DataFrame, and Dataset, when and why you choose which, and
outline their performance and optimization points, List those scenes that should use DataFrame and Dataset instead of RDD. I will pay more attention to DataFrame
and Dataset, because in Apache Spark 2.0 these two APIs are integrated.
The motivation behind this integration is that we want to make it simpler to use Spark by reducing the number of concepts you need to
master and providing a way to handle structured data. When working with structured data, Spark offers the same high level of
abstraction and API as the language-specific languages provide.
The following are the scenarios and common cases for using RDD:
You want the most basic transformation, processing and control of your data set;
Your data is unstructured, such as streaming media or character streams;
You want to process your data through functional programming rather than in specific areas;
You do not want to define a pattern as you do with column processing, handling or accessing data attributes by name or field;
You do not care about some of the optimization and performance benefits that can be gained from structured and semi-structured data processing with
DataFrame and Dataset;
You may ask: RDD is not about to downgrade to second class citizens? Is not about to withdraw from the stage of history?
And, as you'll learn later, you can seamlessly switch between a DataFrame or Dataset and an RDD via simple API method calls. In fact, DataFrame and Dataset are
also provided based on RDDs.
DataFrame
Like RDDs , DataFrames are also an immutable distributed collection of data. But unlike RDD, data is organized into named columns, just as tables in a relational
database. The goal of designing a DataFrame is to make it easier to handle large data sets by giving developers the ability to specify a pattern for distributed data sets
for a higher level of abstraction. It provides specialized APIs for specific areas to handle your distributed data and makes it easier for more people to use Spark, not
just professional data engineers.
In our Apache Spark 2.0 Webinar and the follow-up blog , we mentioned that in Spark 2.0, the DataFrame and Dataset APIs will be merged to complete the
integration of data-handling capabilities across libraries. Once the integration is complete, developers no longer have to learn or memorize so many concepts that
they can get the job done through a suite of advanced and type-safe APIs called Dataset.
Python DataFrame
R DataFrame
Note: Since Python and R are not type-safe at compile time, we only have a typeless API called DataFrame.
From the minimal constraints of SQL to the most stringent of Dataset's constraints, think of static typing and runtime security as a graph. For example, if you are
using a Spark SQL query, you will not find a syntax error (which is expensive) until you run it, and you catch errors at compile time if you are using DataFrame and
Dataset (This saves developers time and money.) That is, the compiler can detect this error when you call a function other than API in the DataFrame. However, if
you use a name that does not exist in the field, it is up to the runtime to find the error.
At the other end of the spectrum is the most stringent Dataset. Because Dataset APIs are all represented using lambda functions and JVM type objects, all unmatched
type parameters can be found at compile time. And when using Dataset, your analysis errors are also discovered at compile time, saving developers time and money.
All of this ends up being interpreted as a map of type safety with the syntax and parsing errors in your Spark code. In the map Dataset is the most stringent end, but it
is also the most efficient for developers.
With DataFrame as a collection of Dataset [Row], you have a structured, customized view of your semi-structured data. For example, suppose you have a very large
set of IoT device event data in JSON format. Because JSON is a semi-structured format, it is well suited to Dataset as a strongly typed Dataset [DeviceIoTData]
collection.
{device_id: 198164 device_name: sensor-pad-198164owomcJZ ip: 80.55.20.25 cca2: PL cca3: POL, cn: "Latitude": 53.080000, "longitude": 18.620000, "scale
You can use a Scala Case Class to represent each JSON record as a DeviceIoTData, a custom object.
case class DeviceIoTData (battery_level: Long, c02_level: Long, cca2:
String, cca3: String, cn: String, device_id: Long, device_name: String, humidity:
Long, ip: String, latitude: Double, lcd: String, longitude: Double, scale: String, temp: Long, timestamp: Long)
1. Spark into the JSON, according to the model to create a DataFrame collection;
2. At this point, Spark transforms your data with "DataFrame = Dataset [Row]" into a collection of universal row objects, since at this point it does not know the
exact type;
3. Spark can then convert a specific type of Scala JVM object like "Dataset [Row] -> Dataset [DeviceIoTData]", as defined by the class DeviceIoTData.
Many people who deal with structured data are accustomed to using the column mode to view and process data, or access to a particular attribute in the object. With
Dataset as a collection of typed Dataset [ElementType] objects, you get natural, compile-time security features and a customized view of strongly typed JVM
objects. And the strongly typed Dataset [T] you get with the above code can also be easily displayed or manipulated using advanced methods.
Although structuring may limit the control of your data by the Spark program, it provides rich semantics and easy-to-use, action-specific operations in a particular
area, which can be represented as a high-level structure. In fact, most calculations can be done using Dataset's advanced APIs. For example, it is much simpler to
perform operations such as agg, select, sum, avg, map, filter, or groupBy than using data fields in RDD data rows. You only have to deal with Dataset-type
DeviceIoTData objects.
Using a set of APIs for a specific area to express your algorithm is much simpler than using RDD for relational algebra. For example, the following code will use
filter () and map () to create another immutable Dataset.
// Use filter (), map (), groupBy () country, and compute avg ()
// for temperatures and humidity. This operation results in
// another immutable Dataset. The query is simpler to read,
// and expressive
val dsAvgTmp = ds.filter (d => {d.temp> 25}). map (d => (d.temp, d.humidity, d.cca3)). groupBy ($ "_ 3"). avg
// display the resulting dataset
display (dsAvgTmp)
In addition to the above benefits, you'll also see the space efficiency and performance improvements that come with using the DataFrame and Dataset APIs. There
are two reasons for this:
First, because the DataFrame and Dataset APIs are built on the Spark SQL engine, they use Catalyst to generate optimized logical and physical query plans. All R,
Java, Scala, or Python's DataFrame / Dataset APIs, the underlying code optimizer is used at the bottom of all relational queries, and therefore gains space and speed
efficiency. Although the typed Dataset [T] API is optimized for data processing tasks, the untyped Dataset [Row] (alias DataFrame) runs faster and is suitable for
interactive analysis.
Second, Spark, as a compiler , understands Dataset-type JVM objects that use encoders to map specific types of JVM objects to Tungsten's internal memory
representation. As a result, Tungsten's encoders can very efficiently serialize or deserialize JVM objects, yielding compressed bytecodes, which can be very efficient.
When should I use DataFrame or Dataset?
If you need rich semantics, advanced abstractions, and domain-specific APIs, use DataFrame or Dataset;
Use DataFrame or Dataset if your processing requires advanced processing of semi-structured data such as filter, map, aggregation, average, sum, SQL
queries, columnar access, or using lambda functions;
If you want to have a high degree of type safety at compile time, want a typed JVM object, use Catalyst optimizations, and benefit from the efficient code
generated by Tungsten, use Dataset;
If you want to use a consistent and simplified API between different Spark libraries, then use DataFrame or Dataset;
If you are a R language user, use DataFrame;
If you are a Python language user, use a DataFrame that will be returned to RDD if you need more granular control.
Note that you can convert DataFrame or Dataset to RDD seamlessly by simply invoking .rdd. Examples are as follows:
to sum up
In short, when to use the RDD, DataFrame or Dataset seems to be quite obvious. The former provides the underlying functionality and control, which supports
customized views and structures that provide advanced and domain-specific operations that save space and run quickly.
When we review the lessons learned from earlier versions of Spark, we asked ourselves how to simplify Spark for developers. How to optimize it to make it higher
performance? We decided to make a high-level abstraction of the underlying RDD APIs into DataFrame and Dataset, using them to build a consistent data
abstraction across libraries across the Catalyst optimizer and Tungsten.
DataFrame and Dataset, or the RDD API, choose one for your needs and scenarios, and I will not be surprised when you're working with structured or semi-
structured data like most developers do.
To submit InfoQ Chinese station or participate in content translation, please email editors@cn.infoq.com . We are also welcome to follow us through Weibo (
@InfoQ , @ Ding Xiaoyun ), WeChat ( InfoQChina ).
related topic:
Language & Development Architecture & Design DataFrame Apache Spark
Follow 204 of his fans Watch 541 his fans Follow 0 His Fans Follow 1 his fans Follow 1 his fans
related information
Hello friend!
You need to register an InfoQ account or log in to comment. After you complete the registration you also need to make some settings.
by
posted on
View
Reply
back to the top
shut down
refers to the original message
theme Your reply
取消
shut down
OK
sponsor link
"Microservice Architecture Core 20" - Senior
Architect Yang Bo on "Microservice
Architecture core points" to explain in simple
terms, hoping to help technicians in the micro-
What is the value of operation and
maintenance, how to create business
operation and maintenance organizational
structure, operation and maintenance need to
related information
Geography in Apache Pulsar, Part 2: Patterns and Practices November 24, 2017
Geography in Apache Pulsar, Part 1: Concepts and Capabilities November 21, 2017
Apache Kafka: The Big Data Real Time Processing Age August 18, 2017
Distributed Machine Learning Platform Competition: Spark, PMLS, TensorFlow, MXNet 5 October 2017
Application of Spark Technology in Jingdong Intelligent Supply Chain Forecasting August 11, 2017
Small and Medium R & D Team Architecture Practices: Search Server Solr December 18, 2017
Sponsor content
As a programmer, how to understand the company, how to understand the company at different stages of different strategies? In the tide of internet, how
should technicians follow the trend? Changing columns can give you new thinking about technological innovation and business models.
Yu Jun, the father of Baidu Post Bar Jun Rao, VP of Kafka, VP Georges Saab, Oracle Java Platform Business Group Neal Ford, Author of Prometheus
Monitoring System, Julius Volz, Prometheus Surveillance System, Katharina Probst, Director of Engineering, Netflix QCon Beijing 2018 , Collide with big
coffee research and development thinking.
Sponsor
related information
Building an Ad Consumption Forecasting System Based on Kafka Streams November 9, 2017
How to Build a Machine Learning Platform Using Existing Big Data Technology November 15, 2017
Aruba's Smart Edge concept January 29, 2018
Replace the wallet address: grab mining machine new idea! January 29, 2018
AI + Education Has Become a Trend and Education in Smart Age Still Faces Challenges January 29, 2018
Meizu Data Platform Design Philosophy and Core Architecture January 29, 2018
HashiCorp, Contino Enterprise Terraform Recommended Practice Guides Share on January 29, 2018
Microsoft Releases a Public Preview of Azure Data Factory v2 Visualization Tools January 29, 2018
UWP applications in the enterprise challenges January 29, 2018
Using a Vehicle Image to Train a Neural Network Supporting UAVs Traveling Through the City January 29, 2018
WeChat Machine Learning and Artificial Intelligence Application Practice January 29, 2018
related information
DJI Service Gateway Global Design and Practice January 29, 2018
From automation to intelligent operation and maintenance system of Ali January 29, 2018
Alibaba Dispatch and Cluster Management System Sigma January 29, 2018
PaxosStore: WeChat High Availability, Strongly Consistent Storage System January 28, 2018
Why AIOps is the future, Baidu's thinking and practice January 27, 2018
Subscribe to InfoQ's weekly essentials and join the massive technical community of more than 250,000 senior developers.
您的邮箱 subscription
AI + education has become a trend, education in the era of intelligence is still facing challenges
Data Science
AI + education has become a trend, education in the era of intelligence is still facing challenges
Use the vehicle image to train the neural network supporting UAV traveling through the city
DevOps
Micro-service architecture era, why the operation and maintenance system to "application" as the core?
Home
All topics
QCon Global Software Development Conference
about us
Submission
Create an account
log in
Global QCon
London Mar 6-10, 2017
Beijing Apr 16-18, 2017
Sao Paulo Apr 24-26, 2017
New York Jun 26-30, 2017
Shanghai Oct 19-21, 2017
Tokyo, 2017 Autumn
San Francisco Nov 13-17, 2017
Subscribe to InfoQ's weekly essentials and join the massive technical community of more than 250,000 senior developers.
您的邮箱 subscription
RSS subscription
InfoQ official microblogging
InfoQ official WeChat
Community news and hot spots
Special topic
我们发现您在使用ad blocker。
我们理解您使用ad blocker的初衷,但为了保证InfoQ能够继续以免费方式为您服务,我们需要您的支持。InfoQ绝不会在未经您许可的情况下将您的数据
提供给第三方。我们仅将其用于向读者发送相关广告内容。请您将InfoQ添加至白名单,感谢您的理解与支持。