Professional Documents
Culture Documents
Spark2x: Big Data Huawei Course
Spark2x: Big Data Huawei Course
Spark2x
AVISO
Este documento foi gerado a partir de um material de estudo
da Huawei. Considere as informações nesse documento
como material de apoio.
5.8. Differences Between YARN client mode and cluster mode .......... 12
Spark is a distributed computing engine based on memory. It stores the intermediate pro-
cessing data in memory. This is the feature that different from traditional computing engine
like MapReduce. Then what is spark provide? Spark provides a one-stop data analysis capability
and supports flow-based processing in small batches, offline batch processing, SQL query and
data mining, graph computing and machine learning. Users can synchronously use this func-
tions in an application.
In practical use, for example, batch processing can be used for data ETL(extraction, trans-
form and load). Machine learning can be used for shopping websites to judge whether cus-
tomer reviews are good or bad. And SQL query can be used for query data in Hive. And stream
processing is applicable for real time services such as click stream analysis, recommendation
system and public opinion analysis.
Take a look at some features of spark. First, light. Spark uses Scala language, which is much
more simple and vivid. Besides, it uses infrastructure of Hadoop and Mesos, which is also an
open source framework to manage computer clusters. So spark has only thirty thousand lines
of core codes. Second, Fast, Spark has sub-second level latency for small datasets. Specifically,
Spark has higher speed compared with MapReduce or Hive for large datasets application such
as iterative machine learning, ad hoc query and graph computing. Besides, Spark is featured
with in-memory computing, data locality and transmission optimization and scheduling opti-
mization. Third, Smart, it uses existing big data components. To be more specific, Spark syn-
chronously integrate with Hadoop and graph computing uses Pregel, Power graph APIs as well
as Power graph point of diffusion. Pregel and Power graph both are distributed graph parallel
computing framework. The last one is Flexible, Spark provides flexibility at different levels. For
example, Spark supports new data operators, new data sources and new language bytings.
And spark supports a variety of functions, such as in-memory computing, multi-iteration batch
processing, ad hoc query, streaming and graph computing.
3. Spark Ecosystem
4. Spark vs MapReduce
We've learned MapReduce before, it is also a computing model framework and it's used for
parallel computing of a massive dataset. And in MapReduce intermediate data generated by
Map tasks will be written to local disk as MOF, waiting to be obtained by Reduce tasks. But the
intermediate data of Spark is stored in memory. So Spark improves the computing efficiency
and reduces the delay of iterative operations and batch processing. Besides, Spark is suitable
for iterative computation. If repeated operations increase, more data needs to be read. And
Spark computing brings greater benefit because intermediate data of Spark is stored in mem-
ory. So the performance of Spark is over one hundred times higher than that of MapReduce in
specific scenarios with multiple iterations. For scenarios with fewer iterations, Spark does not
have many advantages over MapReduce. Spark also provides more flexible programming
model and higher development efficiency. Because Spark provides more dataset operation
types. And Spark has higher fault tolerance capability because of lineage mechanism.
Spark revolves around the concept of RDD, which is Resilient Distributed Datasets. RDD is
a fundamental data structure of Spark. It's a read-only partitioned collection of records. RDDs
are stored in memory by default and will overflow to disks in case of insufficient memory. Each
RDD is divided into logical partitions which may be computed on different nodes of the cluster.
This feature improves performance by their locality. There are two ways to create RDDs. An
RDD can be rated from a Hadoop file system such as HDFS or a storage system that Hadoop
is compatible with. Alternatively, an RDD can be converted from a parent RDD. Beside, RDD
supports a lineage mechanism which indicates the dependency chain. An RDD remembers how
it's involved from another RDD using a lineage. So in this case, data can be recovered quickly
when data loss occurs.
Driver is responsible for the application business logic and operation planning. Application-
Master is used to manage application resources and applies for resources based on applica-
tions. Client is to submit applications.
The Spark computing and scheduling can be implemented using YARN mode. Spark enjoys
the computing resources provided by YARN clusters and runs tasks in a distributed way. Spark
in YARN involves two modes, YARN client and YARN cluster. In YARN client mode, the Driver is
deployed and runs on the client. During the operation process, the client first sends the Spark
Client first generates the application information and then sends the information to Re-
sourceManager. The ResourceManager allocates the Container(ApplicationMaster) to
SparkApplication and then starts Driver on the Container node. After that, ApplicationMaster
applies for resources from ResourceManager to run Executor. The ResourceManager allocates
the Container to ApplicationMaster. ApplicationMaster communicates with the related Node-
Manager and starts Executor on obtained Container. After Executor started, it registers in Driver
Spark SQL is a module for processing structured data. In Spark application, SQL statements
or DataFrame APIs can be synchronously used for querying structured data. Simple speaking,
Spark SQL is just a module that can parse SQL language to RDDs and then use Spark Core to
execute. Here, DataFrame is a distributed collection in which data is organized into named col-
umns. Spark SQL and DataFrame also provide a universal method for accessing multiple data
sources such Hive, CSV, ORC and JSON. These data sources also allow data interaction. Spark
SQL reuses the Hive front-end processing logic and metadata processing module. With Spark
SQL, you can directly QUERY existing Hive data. In addition, Spark SQL also provides API, CLI
and JDBC interfaces, allowing diverse inputs to the Clients.
7. Introduction to Dataset
8. Introduction to DataFrame
Each Dataset also has an untyped view called a DataFrame. WIch is a dataset of row.
DataFrame is a structured distributed Dataset composed of several columns which is similar to
a table in the relational database or the DataFrame in R or python. DataFrame is a basic con-
cept in Spark SQL and can be created by using multiple methods, such as structured dataset,
Hive table, external Database or RDD. DataFrame has data structured information which is
schema.
In a Dataset, the type of each row is not fixed. After defining the case class, you can obtain
the infromation in each row. Information in columns of DataFrame is clear but doesn't in rows
are unclear.
Structured Streaming is an engine built on Spark SQL to process streaming data. It's pro-
grammed by using Scala and it's fault tolerance capability. Streaming computing processes are
compiled similar to the application of static RDD data. If streaming data is incrementally and
continuously produced, Spark SQL will continue to process the data and synchronize the results
to the result set.
Each query operation will generate a result table. At each trigger interval, updated data will
be synchronized to the result table. Whenever the result table is updated, the updated result
The basic principles of Spark Streaming is to segment the input data by second or millisec-
ond, and periodically submits the data after segmentation which decomposes streaming pro-
gramming into a series of short batch jobs. Spark streaming receives live input data streams
and divides the data into batches. And then are processed by the Spark engine to generate the
final stream of results in batches. Spark streaming provides a high level of abstraction called
discretize stream or Dstream, which represents a continuous stream of data. Dstreams can be
created either from input data streams from sources such Kafka, Flume or by applying high
level operation on other Dstreams. Internally, a Dstream is represented as a sequence of RDDs.