Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Big Data Huawei Course

Spark2x
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.

Centro de Inovação EDGE - Big Data Course


Table of Contents
1. Spark Overview ......................................................................................... 1

2. Spark Highlights ........................................................................................ 2

3. Spark Ecosystem ...................................................................................... 2

4. Spark vs MapReduce ................................................................................ 3

5. Spark Principles and Architecture ............................................................. 4

5.1. Spark System Architecture.............................................................. 4

5.2. Core Concepts of Spark - RDD....................................................... 6

5.3. RDD Dependencies ........................................................................ 6

5.4. Stage Division of RDD .................................................................... 7

5.5. RDD Operators ............................................................................... 8

5.6. Spark on YARN - Client Operation Process.................................. 10

5.7. Spark on YARN - Cluster Operation Process................................ 11

5.8. Differences Between YARN client mode and cluster mode .......... 12

5.9. Typical Case - WordCount ............................................................ 12

6. Spark SQL Overview............................................................................... 13

7. Introduction to Dataset ............................................................................ 13

8. Introduction to DataFrame ...................................................................... 14

9. RDD, DataFrame, and Datasets ............................................................. 14

10. Spark SQL and Hive................................................................................ 15

11. Structured Streaming Overview .............................................................. 16

12. Overview of Spark Streaming ................................................................. 18

13. Micro Batch Processing of Spark Streaming........................................... 19

14. Fault Tolerance Mechanism of Spark Streaming..................................... 19

15. Spark and Other Components ................................................................ 20

Centro de Inovação EDGE - Big Data Course


Spark2x – Huawei Course
1. Spark Overview

Spark is a distributed computing engine based on memory. It stores the intermediate pro-
cessing data in memory. This is the feature that different from traditional computing engine
like MapReduce. Then what is spark provide? Spark provides a one-stop data analysis capability
and supports flow-based processing in small batches, offline batch processing, SQL query and
data mining, graph computing and machine learning. Users can synchronously use this func-
tions in an application.

In practical use, for example, batch processing can be used for data ETL(extraction, trans-
form and load). Machine learning can be used for shopping websites to judge whether cus-
tomer reviews are good or bad. And SQL query can be used for query data in Hive. And stream
processing is applicable for real time services such as click stream analysis, recommendation
system and public opinion analysis.

Centro de Inovação EDGE - Big Data Course 1


2. Spark Highlights

Take a look at some features of spark. First, light. Spark uses Scala language, which is much
more simple and vivid. Besides, it uses infrastructure of Hadoop and Mesos, which is also an
open source framework to manage computer clusters. So spark has only thirty thousand lines
of core codes. Second, Fast, Spark has sub-second level latency for small datasets. Specifically,
Spark has higher speed compared with MapReduce or Hive for large datasets application such
as iterative machine learning, ad hoc query and graph computing. Besides, Spark is featured
with in-memory computing, data locality and transmission optimization and scheduling opti-
mization. Third, Smart, it uses existing big data components. To be more specific, Spark syn-
chronously integrate with Hadoop and graph computing uses Pregel, Power graph APIs as well
as Power graph point of diffusion. Pregel and Power graph both are distributed graph parallel
computing framework. The last one is Flexible, Spark provides flexibility at different levels. For
example, Spark supports new data operators, new data sources and new language bytings.
And spark supports a variety of functions, such as in-memory computing, multi-iteration batch
processing, ad hoc query, streaming and graph computing.

3. Spark Ecosystem

Centro de Inovação EDGE - Big Data Course 2


Based on this figure we can see that Spark is capable of interacting with multiple applications,
environments and data sources.

4. Spark vs MapReduce

We've learned MapReduce before, it is also a computing model framework and it's used for
parallel computing of a massive dataset. And in MapReduce intermediate data generated by
Map tasks will be written to local disk as MOF, waiting to be obtained by Reduce tasks. But the
intermediate data of Spark is stored in memory. So Spark improves the computing efficiency
and reduces the delay of iterative operations and batch processing. Besides, Spark is suitable
for iterative computation. If repeated operations increase, more data needs to be read. And
Spark computing brings greater benefit because intermediate data of Spark is stored in mem-
ory. So the performance of Spark is over one hundred times higher than that of MapReduce in
specific scenarios with multiple iterations. For scenarios with fewer iterations, Spark does not
have many advantages over MapReduce. Spark also provides more flexible programming
model and higher development efficiency. Because Spark provides more dataset operation
types. And Spark has higher fault tolerance capability because of lineage mechanism.

Centro de Inovação EDGE - Big Data Course 3


5. Spark Principles and Architecture

5.1. Spark System Architecture

Centro de Inovação EDGE - Big Data Course 4


This is the System Architecture of Spark. At the bottom is the Standalone scheduler, which
is a resource management framework that comes with Spark. It also supports the resource
management system of YARN and Mesos. But FusionInsight integrates the Spark on YARN
mode by default and does not support other modes currently. In the middle is Spark Core. It's
a distributed computing framework similar to MapReduce. The intermediate computation re-
sult is directly stored in memory, which improves the computation performance. On top are
some functional modules. SparkSQL is a Spark component mainly for processing structured
data and conducting SQL-like queries on data. By using the Spark SQL, ETL operations can be
executed based on different data formats such as JSON, Parquet and ORC and data sources like
HDFS and database to complete specific query operations. Structured Streaming is an engine
built on Spark SQL to process streaming data. It's programmed by using Scale and it's fault
tolerant. Spark Streaming is a streaming engine for micro batch processing. Stream data is
sliced and then proceed in the computation engine of Spark Core. Compared with Storm, it
provides weaker real time performance, but better throughput. MLlib is Spark's machine learn-
ing library. Its goal is to make practical machine learning scalable and easy. GraphX is a new
component in spark for graphs and graph parallel computing. GraphX includes a growing col-
lection of graph algorithms and builders to simplify graph analytic tasks. SparkR is an R pack-
age that provides a lightweight front end to use Apache Spark from R. SparkR provides a dis-
tributed data frame implementation that supports operations like selection, filtering, aggrega-
tion and so on in large datasets. Spark R also supports distributed machine learning using ML-
Lib.

Centro de Inovação EDGE - Big Data Course 5


5.2. Core Concepts of Spark - RDD

Spark revolves around the concept of RDD, which is Resilient Distributed Datasets. RDD is
a fundamental data structure of Spark. It's a read-only partitioned collection of records. RDDs
are stored in memory by default and will overflow to disks in case of insufficient memory. Each
RDD is divided into logical partitions which may be computed on different nodes of the cluster.
This feature improves performance by their locality. There are two ways to create RDDs. An
RDD can be rated from a Hadoop file system such as HDFS or a storage system that Hadoop
is compatible with. Alternatively, an RDD can be converted from a parent RDD. Beside, RDD
supports a lineage mechanism which indicates the dependency chain. An RDD remembers how
it's involved from another RDD using a lineage. So in this case, data can be recovered quickly
when data loss occurs.

5.3. RDD Dependencies

Centro de Inovação EDGE - Big Data Course 6


There are two types of dependencies in RDD. Narrow dependencies and Wide dependen-
cies. Narrow dependencies indicates that each partition of the parent RDD is used at most by
one partition of the child RDD. While wide dependencies indicates that each partition of the
parent RDD may be used by multiple child RDD partitions which is the basis of stage division.
There are some advantages of Narrow Dependencies. First, narrow dependency supports exe-
cution of multiple commands in pipeline mode, on the same cluster node. For example, after
the Map operation is performed, the filter operation can be performed immediately. Second,
failure recovery of narrow dependency is more effective because only the last parent partitions
needs to be recomputed. And re computation can be performed concurrently on different
nodes.

5.4. Stage Division of RDD

Centro de Inovação EDGE - Big Data Course 7


It is very difficult to divide stages in a submitted job due to complex RDD dependencies. In
this case, Spark uses narrow dependencies and wide dependencies that are mentioned before
. Spark scheduler reversely traverses the whole dependency chain from the end of DAG(Di-
rected Acyclic Graph) when the wide dependency is encountered, the dependency chain is bro-
ken. When a narrow dependency is encoutered, the RDD partition is added to the current
stage. The number of tasks in a stage is determined by the number of RDD partitions at the
end of stage. RDD conversion is a partition-based coarse-grained computing, and the result of
a stage execution is the RDD of these partitions.

5.5. RDD Operators

Centro de Inovação EDGE - Big Data Course 8


RDD supports two types of operations: Transformation and Action. For example, Map is a
transformation that passes each dataset element through a function and returns a new RDD
representing the results. On the other hand, Reduce is an action, that aggregates all the ele-
ments of the RDD using some function and returns the final result to the driver program. All
transformations in Spark are lazy in that they do not compute their results right away. Instead,
they just remember the transformations apply to some datasets or files. The transformations
are only computed when an action requires results to be returned to the driver program. This
design enable Spark to run more efficiently.

Major Roles of Spark

Driver is responsible for the application business logic and operation planning. Application-
Master is used to manage application resources and applies for resources based on applica-
tions. Client is to submit applications.

Centro de Inovação EDGE - Big Data Course 9


ResourceManager is responsible for scheduling and allocation of resources in the whole
cluster. And NodeManager is responsible for resources management of it's node. Executor is
the one that really do the tasks. An application will be split for multiple executors to compute.

5.6. Spark on YARN - Client Operation Process

The Spark computing and scheduling can be implemented using YARN mode. Spark enjoys
the computing resources provided by YARN clusters and runs tasks in a distributed way. Spark
in YARN involves two modes, YARN client and YARN cluster. In YARN client mode, the Driver is
deployed and runs on the client. During the operation process, the client first sends the Spark

Centro de Inovação EDGE - Big Data Course 10


application request to ResourceManager and the client packages all the information required
to start ApplicationMaster and sends the information to ResourceManager. Then, ResourceM-
anager returns the results. The results include information such as application ID and the upper
limit as well as lower limit of available resources. After receiving the request, ResourceManager
finds a proper node for ApplicationMaster and started on it's node. ApplicationMaster is a role
in YARN and the process name in Spark is ExecutorLauncher. Based on the resource require-
ments of each task, ApplicationMaster can apply for series of containers to run tasks from Re-
sourceManager. After receiving the newly allocated container lists from ResourceManager, Ap-
plicationMaster sends information to the related NodeManager to start the container. Re-
sourceManager allocates a container to ApplicationMaster, then ApplicationMaster communi-
cated with the related NodeManager and starts executor on the obtain container. After Execu-
tor is started, it registers in Driver and applies for tasks. Then, Driver allocates tasks to Executor.
Executors do the tasks and report the operating status to Driver.

5.7. Spark on YARN - Cluster Operation Process

Client first generates the application information and then sends the information to Re-
sourceManager. The ResourceManager allocates the Container(ApplicationMaster) to
SparkApplication and then starts Driver on the Container node. After that, ApplicationMaster
applies for resources from ResourceManager to run Executor. The ResourceManager allocates
the Container to ApplicationMaster. ApplicationMaster communicates with the related Node-
Manager and starts Executor on obtained Container. After Executor started, it registers in Driver

Centro de Inovação EDGE - Big Data Course 11


and applies for tasks. Then, Driver allocates tasks to Executors and Executors do the job and
report the status to Driver.

5.8. Differences Between YARN client mode and cluster mode

The first one is ApplicationMaster. In YARN-Cluster mode, Driver runs in ApplicationMaster,


which is responsible for applying resources from YARN and monitoring the running status of a
job. When a user submits a job, client can be closed and the job continues running on YARN.
However, the YARN-Cluster mode is not suitable for running interactive jobs. In YARN-Client
mode, ApplicationMaster applies only executor from YARN. Client communicates with the ob-
tained Container to schedule tasks. Therefore, Client cannot be closed. And YARN-Cluster is
suitable for production because the output of applications can be quickly generated while
YARN-Client is suitable for testing. Besides, if the task submission node in YARN-Client mode
is down, then the entire task will fail. But this kind of situation in YARN-Cluster mode will not
affect the entire task.

5.9. Typical Case - WordCount

Centro de Inovação EDGE - Big Data Course 12


6. Spark SQL Overview

Spark SQL is a module for processing structured data. In Spark application, SQL statements
or DataFrame APIs can be synchronously used for querying structured data. Simple speaking,
Spark SQL is just a module that can parse SQL language to RDDs and then use Spark Core to
execute. Here, DataFrame is a distributed collection in which data is organized into named col-
umns. Spark SQL and DataFrame also provide a universal method for accessing multiple data
sources such Hive, CSV, ORC and JSON. These data sources also allow data interaction. Spark
SQL reuses the Hive front-end processing logic and metadata processing module. With Spark
SQL, you can directly QUERY existing Hive data. In addition, Spark SQL also provides API, CLI
and JDBC interfaces, allowing diverse inputs to the Clients.

7. Introduction to Dataset

Centro de Inovação EDGE - Big Data Course 13


Datasets are similar to RDD but does not use Java serialization or Kryo encoders to serialize
object for processing or transmission over the network. Encoders and standard serialization are
responsible for serializing an object into bytes. Encoders dynamically generate code and use a
format that allows Spark to perform many operations such as filter, sort, and hash without de-
serializing bytes into objects.

8. Introduction to DataFrame

Each Dataset also has an untyped view called a DataFrame. WIch is a dataset of row.
DataFrame is a structured distributed Dataset composed of several columns which is similar to
a table in the relational database or the DataFrame in R or python. DataFrame is a basic con-
cept in Spark SQL and can be created by using multiple methods, such as structured dataset,
Hive table, external Database or RDD. DataFrame has data structured information which is
schema.

9. RDD, DataFrame, and Datasets

Centro de Inovação EDGE - Big Data Course 14


RDD, Datasets and DataFrame are compared. For RDDs, both cluster communication and
I/O operations require serialization and deserialization of data and data structures. Datasets
and DataFrame have exactly the same functions, but the different types of data in each row. For
DataFrame, data in each row is the row type. Fields and types of this field in each row are un-
known, you can only use getAs or pandas matching in the column to obtain a specific field.

In a Dataset, the type of each row is not fixed. After defining the case class, you can obtain
the infromation in each row. Information in columns of DataFrame is clear but doesn't in rows
are unclear.

10. Spark SQL and Hive

Centro de Inovação EDGE - Big Data Course 15


Spark SQL uses Spark Core as it's execution engine while Hive uses MapReduce. And the
execution speed of Spark SQL is 10 to 100 times faster than Hive. Spark SQL syntax and Hive
syntax are basically the same except for bucket operations. Besides, Spark SQL depends on the
metadata of Hive and it's compatible with most syntax and functions of Hive. It can also use
user-defined functions in Hive.

11. Structured Streaming Overview

Structured Streaming is an engine built on Spark SQL to process streaming data. It's pro-
grammed by using Scala and it's fault tolerance capability. Streaming computing processes are
compiled similar to the application of static RDD data. If streaming data is incrementally and
continuously produced, Spark SQL will continue to process the data and synchronize the results
to the result set.

Centro de Inovação EDGE - Big Data Course 16


Similar to the data block processing model, the streaming data processing model applies
query operations on a static database table to streaming computing. And Spark uses standard
SQL statement for query to obtain data from the incrementally and unbounded table. Consider
the data input stream as the input table, every data item that is arriving on the stream is like a
new row being appended to the input table.

Each query operation will generate a result table. At each trigger interval, updated data will
be synchronized to the result table. Whenever the result table is updated, the updated result

Centro de Inovação EDGE - Big Data Course 17


will be written into an external storage system. And there are three types of storage mode of
Structured Streaming at the output phase. Complete Mode, Append Mode and Update Mode.
For Complete Mode, A connector of an external system writes the updated result set into the
external storage system. For Append Mode, if an interval is triggered, only added data in the
result table will be return into an external system. This mode is applicable only to a result set
that has already existed and will not be updated. For Update Mode, if an interval is triggered,
only updated data in the result table will be written onto an external system, which is a differ-
ence between the append mode and update mode.

12. Overview of Spark Streaming

Centro de Inovação EDGE - Big Data Course 18


Spark streaming is a real-time computing framework built on the Spark. Which expand the
capability for processing massive streaming data. Data can be injected from many sources like
Kafka, HDFS and can be processed using complex algorithms expressed with high level func-
tions like Map, Reduce, Join and Window. Finally, processed data can be pushed out to file
systems and databases.

13. Micro Batch Processing of Spark Streaming

The basic principles of Spark Streaming is to segment the input data by second or millisec-
ond, and periodically submits the data after segmentation which decomposes streaming pro-
gramming into a series of short batch jobs. Spark streaming receives live input data streams
and divides the data into batches. And then are processed by the Spark engine to generate the
final stream of results in batches. Spark streaming provides a high level of abstraction called
discretize stream or Dstream, which represents a continuous stream of data. Dstreams can be
created either from input data streams from sources such Kafka, Flume or by applying high
level operation on other Dstreams. Internally, a Dstream is represented as a sequence of RDDs.

14. Fault Tolerance Mechanism of Spark Streaming

Centro de Inovação EDGE - Big Data Course 19


Spark Streaming performs computing based RDDs. So any partition encountering errors
can be regenerated based on the parent RDD using the RDD lineage mechanism. If the parent
RDD is lost, look up for its parent RDD until the original data in the disk is found.

15. Spark and Other Components

Centro de Inovação EDGE - Big Data Course 20


Centro de Inovação EDGE - Big Data Course 21
Centro de Inovação EDGE - Big Data Course 22

You might also like