Professional Documents
Culture Documents
Spark Seminar Report
Spark Seminar Report
Spark Seminar Report
A
Seminar/Major Project/IT ACT Seminar
submitted
in partial fulfilment
for the award of the Degree of
Bachelor of Technology
in Department of Computer Science and Engineering
Hritik Singh
Enrolment No.: 1613310089
Counter Signed by
Name(s) of Supervisor(s)
.....................................
.....................................
CERTIFICATE
This is to certify that Hritik Singh of VIII Semester, B.Tech (Computer Science & Engineering),
has presented a major project/seminar/IT ACT seminar titled “Big Data Analytics using Apache
Spark” in partial fulfilment for the award of the degree of Bachelor of Technology under NIET.
Date:
I take this opportunity to express my gratitude to all those people who have been directly and
indirectly with me during the competition of this project/seminar/IT ACT seminar.
I pay thank to Mr. Surya Prakash Sharma who has given guidance and a light to me during this
major project. His versatile knowledge about “title name” has eased me in the critical times
during the span of this major project/seminar/IT ACT Seminar.
I acknowledge here out debt to those who contributed significantly to one or more steps. I take
full responsibility for any remaining sins of omission and commission.
Hritik Singh
B.Tech IV Year
(Computer Science & Engineering)
ABSTRACT
In today’s world data is being generated a faster rate than ever before, the velocity of data is now
is a headache for industries as the volume was, the conventional batch processing systems and
frameworks are not sufficient now, the need of the hour is to use stream data processing systems,
and Apache Spark is that framework that provides various advantages.
Apache Spark can act as both batch processing as well as stream processing engine, with a strong
API ecosystem unified into it. Spark is a general-purpose distributed data processing engine that
is suitable for use in a wide range of circumstances. On top of the Spark core data processing
engine, there are libraries for SQL, machine learning, graph computation, and stream processing,
which can be used together in an application. Programming languages supported by Spark
include: Java, Python, Scala, and R. Application developers and data scientists incorporate Spark
into their applications to rapidly query, analyze, and transform data at scale. Tasks most
frequently associated with Spark include ETL and SQL batch jobs across large data sets,
processing of streaming data from sensors, IoT, or financial systems, and machine learning tasks
CONTENTS
Certificate ..................................................................................................................................i
Acknowledgement.................................................................................................................... ii
Abstract.................................................................................................................................... iii
List of Figures …......................................................................................................................iv
Chapter 1: Introduction to Big Data……..................................................................................1
1.1 What is Big Data……......................................................................................................1
1.2 Characteristics of Big Data……………………………………………………………...1
1.3 Volume of data ………………………………………………………………………….1
1.4 Velocity of data ……………………………………………………………………........1
1.5 Variety of data ………………………………………………………………….……….2
Chapter 2: Processing of Big Data ............................................................................................6
2.1 Types of processing systems…........................................................................................6
2.2 Batch Processing Model…………………………………………………………………7
2.3 Stream Processing Model……………………………………………………………….9
2.4 Advantages and Limitations ……………………………………………………………9
Chapter 3: Apache Spark Framework…………………………………………………….....14
3.1 What is Apache Spark…………………………………………………………………14
3.2 History of Spark……………………………………………………………………….15
3.3 Map Reduce Word Count v/s Spark Word Count……………………………………..16
3.4 How a Spark Application runs on a cluster……………………………………………18
3.5 What sets Spark apart? ...................................................................................................21
3.6 The power of data pipelines……………………………………………………………22
References ….....................................................................................................................23
LIST OF FIGURES
Exponential increase
Fig 1.1 - Exponential
in Increase in Volume of data
collected/generated
data
➢ Velocity - Velocity refers to the speed with which data is being generated. In current
situation the velocity has become bigger concern than the volume. Sources responsible
for this high velocity are the IoT applications that generated data every second through
sensors, flights also generate data, satellites etc.
Mobile devices
Scientific Application
Social media (tracking all objects
all the time)
➢ Variety – Variety refers to the different types of data like – images, videos, audio, text
etc.
There are 4 types of data:
i) Structured Data
ii) Semi Structured Data
iii) Quasi structured Data
iv) Unstructured Data
Fig 1.3- Types of Data
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data. Over the period of time, talent in computer science has achieved
greater success in developing techniques for working with such kind of data (where the
format is well known in advance) and also deriving value out of it. However, nowadays,
we are foreseeing issues when a size of such data grows to a huge extent, typical sizes
are being in the rage of multiple zettabytes.
Table 1.1
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now day organizations have
wealth of data available with them but unfortunately, they don't know how to derive value out of
it since this data is in its raw form or unstructured format
Examples of Un-structured Data
This chapter deals with the big data processing frameworks. Processing frameworks
compute over the data in the system, either by reading from non-volatile storage or as it is
ingested into the system. Computing over data is the process of extracting information
and insight from large quantities of individual data points.
Some of the big data processing frameworks are:
1. Batch-only frameworks
a. Apache Hadoop
2. Stream-only frameworks
a. Apache Storm
b. Apache Samza
3. Hybrid frameworks
a. Apache Spark
b. Apache Flink
c.
What Are Big Data Processing Frameworks?
Processing frameworks and processing engines are responsible for computing over
data in a data system. While there is no authoritative definition setting apart "engines"
from "frameworks", it is sometimes useful to define the former as the actual component
responsible for operating on data and the latter as a set of components designed to do the
same.
For instance, Apache Hadoop can be considered a processing framework with Map
Reduce as its default processing engine. Engines and frameworks can often be swapped
out or used in tandem. For instance, Apache Spark, another framework, can hook into
Hadoop to replace Map Reduce. This interoperability between components is one reason
that big data systems have great flexibility.
While the systems which handle this stage of the data life cycle can be complex, the
goals on a broad level are very similar: operate over data in order to increase
understanding, surface patterns, and gain insight into complex interactions.
These processing frameworks are grouped by the state of the data they are designed to
handle. Some systems handle data in batches, while others process data in a continuous
stream as it flows into the system. Still others can handle data in either of these ways.
2.2 Batch Processing Systems
Batch processing has a long history within the data world. Batch processing
involves operating over a large, static dataset and returning the result at a later time
when the computation is complete.
The datasets in batch processing are typically...
bounded: batch datasets represent a finite collection of data
persistent: data is almost always backed by some type of permanent storage
large: batch operations are often the only option for processing extremely large sets of
data
Batch processing is well- suited for calculations where access to a complete set of
records is required. For instance, when calculating totals and averages, datasets must be
treated holistically instead of as a collection of individual records. These operations
require that state be maintained for the duration of the calculations.
Tasks that require very large volumes of data are often best handled by batch
operations. Whether the datasets are processed directly from permanent storage or
loaded into memory, batch systems are built with large quantities in mind and have the
resources to handle them. Because batch processing excels at handling large volumes of
persistent data, it frequently is used with historical data.
The trade- off for handling large quantities of data is longer computation time. Because
of this, batch processing is not appropriate in situations where processing time is
especially significant.
Apache Hadoop
Apache Hadoop is a processing framework that exclusively provides batch processing.
Hadoop was the first big data framework to gain significant traction in the open-source
community. Based on several papers and presentations by Google about how they were
dealing with tremendous amounts of data at the time, Hadoop re-implemented the
algorithms and component stack to make large scale batch processing more accessible.
Modern versions of Hadoop are composed of several components or layers that work
together to process batch data:
HDFS: HDFS is the distributed filesystem layer that coordinates storage and
replication across the cluster nodes. HDFS ensures that data remains available
in spite of inevitable host failures. It is used as the source of data, to store
intermediate processing results, and to persist the final calculated results.
YARN: YARN, which stands for Yet Another Resource Negotiator, is the cluster
coordinating component of the Hadoop stack. It is responsible for coordinating
and managing the underlying resources and scheduling jobs to be run. YARN
makes it possible to run much more diverse workloads on a Hadoop cluster than
was possible in earlier iterations by acting as an interface to the cluster resources.
MapReduce: MapReduce is Hadoop's native batch processing engine.
Batch Processing Model
The processing functionality of Hadoop comes from the MapReduce engine.
MapReduce's processing technique follows the map, shuffle, reduce algorithm using
key-value pairs. The basic procedure involves:
Reading the dataset from the HDFS filesystem
Dividing the dataset into chunks and distributed among the available nodes
Applying the computation on each node to the subset of data (the intermediate
results are written back to HDFS)
Redistributing the intermediate results to group by key
"Reducing" the value of each key by summarizing and combining the
results calculated by the individual nodes
The idea behind Storm is to define small, discrete operations using the above
components and then compose them into a topology. By default, Storm offers at-least-
once processing guarantees, meaning that it can guarantee that each message is processed
at least once, but there may be duplicates in some failure scenarios. Storm does not
guarantee that messages will be processed in order
Advantages and Limitations
Storm is probably the best solution currently available for near real-time processing. It is
able to handle data with extremely low latency for workloads that must be processed
with minimal delay. Storm is often a good choice when processing time directly affects
user experience, for example when feedback from the processing is fed directly back to a
visitor's page on a website.
Storm with Trident gives you the option to use micro-batches instead of pure stream
processing. While this gives users greater flexibility to shape the tool to an intended
use, it also tends to negate some of the software's biggest advantages over other
solutions. That being said, having a choice for the stream processing style is still
helpful.
Core Storm does not offer ordering guarantees of messages. Core Storm offers at-least-
once processing guarantees, meaning that processing of each message can be
guaranteed but duplicates may occur. Trident offers exactly-once guarantees and can
offer ordering between batches, but not within.
1. Distribute data: when a data file is uploaded into the cluster, it is split into
chunks, called data blocks, and distributed amongst the data nodes and replicated
across the cluster.
o The reducer process executes on its assigned node and works only on
its subset of the data (its sequence file). The output from the reducer
process is written to an output file.
3. Tolerate faults: both data and computation can tolerate failures by failing over
to another node for data or processing.
Apache Spark™ began life in 2009 as a project within the AMPLab at the University of
California, Berkeley. Spark became an incubated project of the Apache Software
Foundation in 2013, and it was promoted early in 2014 to become one of the
Foundation’s top-level projects. Spark is currently one of the most active projects
managed by the Foundation, and the community that has grown up around the project
includes both prolific individual contributors and well-funded corporate backers, such as
Databricks, IBM, and China’s Huawei.
The goal of the Spark project was to keep the benefits of MapReduce’s scalable,
distributed, fault-tolerant processing framework, while making it more efficient
and easier to use.
The advantages of Spark over MapReduce are:
Spark executes much faster by caching data in memory across multiple
parallel operations, whereas MapReduce involves more reading and
writing from disk.
Spark runs multi-threaded tasks inside of JVM processes, whereas
MapReduce runs as heavier weight JVM processes. This gives Spark faster
startup, better parallelism, and better CPU utilization.
Results are sent back to the driver application or can be saved to disk.
Apache Mesos – a general cluster manager that can also run Hadoop
applications Apache Hadoop YARN – the resource manager in Hadoop 2
Kubernetes – an open source system for automating deployment,
scaling, and management of containerized applications
Spark also has a local mode, where the driver and executors run as threads on your
computer instead of a cluster, which is useful for developing your applications from
a personal computer.
Machine learning: As data volumes grow, machine learning approaches become more
feasible and increasingly accurate. Software can be trained to identify and act upon
triggers within well-understood data sets before applying the same solutions to new and
unknown data. Spark’s ability to store data in memory and rapidly run repeated queries
makes it a good choice for training machine learning algorithms. Running broadly similar
queries again and again, at scale, significantly reduces the time required to go through
a set of possible solutions in order to find the most efficient algorithms.
Data integration: Data produced by different systems across a business is rarely clean
or consistent enough to simply and easily be combined for reporting or analysis. Extract,
transform, and load (ETL) processes are often used to pull data from different systems,
clean and standardize it, and then load it into a separate system for analysis. Spark (and
Hadoop) are increasingly being used to reduce the cost and time required for this ETL
process.
A wide range of technology vendors have been quick to support Spark, recognizing the
opportunity to extend their existing big data products into areas where Spark delivers real
value, such as interactive querying and machine learning. Well-known companies such as
IBM and Huawei have invested significant sums in the technology, and a growing
number of startups are building businesses that depend in whole or in part upon Spark.
For example, in 2013 the Berkeley team responsible for creating Spark founded
Databricks, which provides a hosted end-to-end data platform powered by Spark. The
company is well-funded, having received $247 million across four rounds of investment
in 2013, 2014, 2016 and 2017, and Databricks employees continue to play a prominent
role in improving and extending the open source code of the Apache Spark project.
The major Hadoop vendors, including MapR, Cloudera, and Hortonworks, have all
moved to support YARN-based Spark alongside their existing products, and each
vendor is working to add value for its customers. Elsewhere, IBM, Huawei, and others
have all made significant investments in Apache Spark, integrating it into their own
products and contributing enhancements and extensions back to the Apache project.
Web-based companies, like Chinese search engine Baidu, e-commerce operation
Taobao, and social networking company Tencent, all run Spark-based operations at
scale, with Tencent’s
3.5 What Sets Spark Apart?
There are many reasons to choose Spark, but the following three are key:
Simplicity: Spark’s capabilities are accessible via a set of rich APIs, all designed
specifically for interacting quickly and easily with data at scale. These APIs are well-
documented and structured in a way that makes it straightforward for data scientists
and application developers to quickly put Spark to work.
Speed: Spark is designed for speed, operating both in memory and on disk. Using Spark,
a team from Databricks tied for first place with a team from the University of California,
San Diego, in the 2014 Daytona GraySort benchmarking challenge
(https://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html).
The challenge involves processing a static data set; the Databricks team was able to
process 100 terabytes of data stored on solid-state drives in just 23 minutes, and the
previous winner took 72 minutes by using Hadoop and a different cluster configuration.
Spark can perform even better when supporting interactive queries of data stored in
memory. In those situations, there are claims that Spark can be 100 times faster than
Hadoop’s MapReduce.
Spark jobs perform multiple operations consecutively, in memory, and only spilling to
disk when required by memory limitations. Spark simplifies the management of these
disparate processes, offering an integrated whole – a data pipeline that is easier to
configure, easier to run, and easier to maintain. In use cases such as ETL, these pipelines
can become extremely rich and complex, combining large numbers of inputs and a wide
range of processing steps into a unified whole that consistently delivers the desired
result.
SUMMARY
1. This chapter introduces Apache Spark and its history and explore some of the
areas in which its particular set of capabilities show the most promise.