Spark Seminar Report

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

“Big Data Analytics Using Apache Spark”

A
Seminar/Major Project/IT ACT Seminar

submitted
in partial fulfilment
for the award of the Degree of
Bachelor of Technology
in Department of Computer Science and Engineering

Supervisor: Submitted By:


Surya Prakash Sharma Hritik Singh
Assistant Professor, CSE 1613310089

Department of Computer Science and Engineering


Noida Institute of Engineering and Technology
AKTU LUCKNOW
2020
Candidate’s Declaration
I hereby declare that the work, which is being presented in the Major Project/Seminar/IT ACT
Seminar, entitled “Big Data Analytics Using Apache Spark” in partial fulfilment for the award of
Degree of “Bachelor of Technology” in Deptt. of Computer Science and Engineering and submitted
to the Department of Computer Science and Engineering, NIET Gr.Noida, Abdul Kalam
Technical University is a record of my own investigations carried under the Guidance of Mr. Surya
Prakash Sharma Department of Computer Science and Engineering, Noida Institute of Engineering
and Technology .
I have not submitted the matter presented in this report anywhere for the award of any other Degree.

Hritik Singh
Enrolment No.: 1613310089

Noida Institute of Engineering and Technology

Counter Signed by
Name(s) of Supervisor(s)
.....................................
.....................................
CERTIFICATE

This is to certify that Hritik Singh of VIII Semester, B.Tech (Computer Science & Engineering),
has presented a major project/seminar/IT ACT seminar titled “Big Data Analytics using Apache
Spark” in partial fulfilment for the award of the degree of Bachelor of Technology under NIET.

Date:

<Name> Surya Prakash Sharma Dr. CS Yadav


Project Co-ordinator Supervisor H.O.D
ACKNOWLEDGMENET

I take this opportunity to express my gratitude to all those people who have been directly and
indirectly with me during the competition of this project/seminar/IT ACT seminar.
I pay thank to Mr. Surya Prakash Sharma who has given guidance and a light to me during this
major project. His versatile knowledge about “title name” has eased me in the critical times
during the span of this major project/seminar/IT ACT Seminar.
I acknowledge here out debt to those who contributed significantly to one or more steps. I take
full responsibility for any remaining sins of omission and commission.

Hritik Singh
B.Tech IV Year
(Computer Science & Engineering)
ABSTRACT

In today’s world data is being generated a faster rate than ever before, the velocity of data is now
is a headache for industries as the volume was, the conventional batch processing systems and
frameworks are not sufficient now, the need of the hour is to use stream data processing systems,
and Apache Spark is that framework that provides various advantages.

Apache Spark can act as both batch processing as well as stream processing engine, with a strong
API ecosystem unified into it. Spark is a general-purpose distributed data processing engine that
is suitable for use in a wide range of circumstances. On top of the Spark core data processing
engine, there are libraries for SQL, machine learning, graph computation, and stream processing,
which can be used together in an application. Programming languages supported by Spark
include: Java, Python, Scala, and R. Application developers and data scientists incorporate Spark
into their applications to rapidly query, analyze, and transform data at scale. Tasks most
frequently associated with Spark include ETL and SQL batch jobs across large data sets,
processing of streaming data from sensors, IoT, or financial systems, and machine learning tasks
CONTENTS
Certificate ..................................................................................................................................i
Acknowledgement.................................................................................................................... ii
Abstract.................................................................................................................................... iii
List of Figures …......................................................................................................................iv
Chapter 1: Introduction to Big Data……..................................................................................1
1.1 What is Big Data……......................................................................................................1
1.2 Characteristics of Big Data……………………………………………………………...1
1.3 Volume of data ………………………………………………………………………….1
1.4 Velocity of data ……………………………………………………………………........1
1.5 Variety of data ………………………………………………………………….……….2
Chapter 2: Processing of Big Data ............................................................................................6
2.1 Types of processing systems…........................................................................................6
2.2 Batch Processing Model…………………………………………………………………7
2.3 Stream Processing Model……………………………………………………………….9
2.4 Advantages and Limitations ……………………………………………………………9
Chapter 3: Apache Spark Framework…………………………………………………….....14
3.1 What is Apache Spark…………………………………………………………………14
3.2 History of Spark……………………………………………………………………….15
3.3 Map Reduce Word Count v/s Spark Word Count……………………………………..16
3.4 How a Spark Application runs on a cluster……………………………………………18
3.5 What sets Spark apart? ...................................................................................................21
3.6 The power of data pipelines……………………………………………………………22
References ….....................................................................................................................23
LIST OF FIGURES

Fig. 1.1 Exponential Increase in Volume of data.................................................................................1


Fig. 1.2 Sources of High Velocity data …...........................................................................................2
Fig. 1.3 Types of Data… …................................................................................................................15
Chapter 1
INTRODUCTION TO BIG DATA

1.1 What is Big Data?


Big Data is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data processing
applications.
Another definition as per Wikipedia – “Big data is a field that treats ways to analyze,
systematically extract information from, or otherwise deal with data sets that are too large or
complex to be dealt with by traditional data-processing application software”
1.2 Characteristics of Big Data
Big data is mainly characterised by 3 V’s :
i) Volume
ii) Velocity
iii) Variety
➢ Volume - Volume refers to the amount of data that is being generated through websites,
portals, sensors etc.
Usually the data that falls under the big data category ranges in volume from 100 of tera
bytes(TB) to zeta bytes(ZB)
Over the past few years and currently data is increasing exponentially for a fact from
2009 to 2020 data has increased by 44 times. The data of the entire world in 2009 was 0.8
ZB and in 2020 it has reached 35ZB. The following graph illustrates this exponential
growth

Exponential increase
Fig 1.1 - Exponential
in Increase in Volume of data
collected/generated
data
➢ Velocity - Velocity refers to the speed with which data is being generated. In current
situation the velocity has become bigger concern than the volume. Sources responsible
for this high velocity are the IoT applications that generated data every second through
sensors, flights also generate data, satellites etc.

Mobile devices
Scientific Application
Social media (tracking all objects
all the time)

Sensor technology and networks


Fig1.2 – Sources of high velocity data

➢ Variety – Variety refers to the different types of data like – images, videos, audio, text
etc.
There are 4 types of data:
i) Structured Data
ii) Semi Structured Data
iii) Quasi structured Data
iv) Unstructured Data
Fig 1.3- Types of Data

Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data. Over the period of time, talent in computer science has achieved
greater success in developing techniques for working with such kind of data (where the
format is well known in advance) and also deriving value out of it. However, nowadays,
we are foreseeing issues when a size of such data grows to a huge extent, typical sizes
are being in the rage of multiple zettabytes.

Examples of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Department

2365 Rajesh Kulkarni Finance

3398 Pratibha Joshi Admin


7465 Shushil Roy Admin

7500 Shubhojit Das Finance

7699 Priya Sane Finance

Table 1.1

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now day organizations have
wealth of data available with them but unfortunately, they don't know how to derive value out of
it since this data is in its raw form or unstructured format
Examples of Un-structured Data

An Image is unstructured data


Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.
Examples of Semi-structured Data
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
Chapter 2
PROCESSING OF BIG DATA

This chapter deals with the big data processing frameworks. Processing frameworks
compute over the data in the system, either by reading from non-volatile storage or as it is
ingested into the system. Computing over data is the process of extracting information
and insight from large quantities of individual data points.
Some of the big data processing frameworks are:
1. Batch-only frameworks
a. Apache Hadoop
2. Stream-only frameworks
a. Apache Storm
b. Apache Samza
3. Hybrid frameworks
a. Apache Spark
b. Apache Flink
c.
What Are Big Data Processing Frameworks?
Processing frameworks and processing engines are responsible for computing over
data in a data system. While there is no authoritative definition setting apart "engines"
from "frameworks", it is sometimes useful to define the former as the actual component
responsible for operating on data and the latter as a set of components designed to do the
same.
For instance, Apache Hadoop can be considered a processing framework with Map
Reduce as its default processing engine. Engines and frameworks can often be swapped
out or used in tandem. For instance, Apache Spark, another framework, can hook into

Hadoop to replace Map Reduce. This interoperability between components is one reason
that big data systems have great flexibility.
While the systems which handle this stage of the data life cycle can be complex, the
goals on a broad level are very similar: operate over data in order to increase
understanding, surface patterns, and gain insight into complex interactions.
These processing frameworks are grouped by the state of the data they are designed to
handle. Some systems handle data in batches, while others process data in a continuous
stream as it flows into the system. Still others can handle data in either of these ways.
2.2 Batch Processing Systems
Batch processing has a long history within the data world. Batch processing
involves operating over a large, static dataset and returning the result at a later time
when the computation is complete.
The datasets in batch processing are typically...
bounded: batch datasets represent a finite collection of data
persistent: data is almost always backed by some type of permanent storage
large: batch operations are often the only option for processing extremely large sets of
data
Batch processing is well- suited for calculations where access to a complete set of
records is required. For instance, when calculating totals and averages, datasets must be
treated holistically instead of as a collection of individual records. These operations
require that state be maintained for the duration of the calculations.
Tasks that require very large volumes of data are often best handled by batch
operations. Whether the datasets are processed directly from permanent storage or
loaded into memory, batch systems are built with large quantities in mind and have the
resources to handle them. Because batch processing excels at handling large volumes of
persistent data, it frequently is used with historical data.
The trade- off for handling large quantities of data is longer computation time. Because
of this, batch processing is not appropriate in situations where processing time is
especially significant.
Apache Hadoop
Apache Hadoop is a processing framework that exclusively provides batch processing.
Hadoop was the first big data framework to gain significant traction in the open-source
community. Based on several papers and presentations by Google about how they were
dealing with tremendous amounts of data at the time, Hadoop re-implemented the
algorithms and component stack to make large scale batch processing more accessible.
Modern versions of Hadoop are composed of several components or layers that work
together to process batch data:
HDFS: HDFS is the distributed filesystem layer that coordinates storage and
replication across the cluster nodes. HDFS ensures that data remains available
in spite of inevitable host failures. It is used as the source of data, to store
intermediate processing results, and to persist the final calculated results.

YARN: YARN, which stands for Yet Another Resource Negotiator, is the cluster
coordinating component of the Hadoop stack. It is responsible for coordinating
and managing the underlying resources and scheduling jobs to be run. YARN
makes it possible to run much more diverse workloads on a Hadoop cluster than
was possible in earlier iterations by acting as an interface to the cluster resources.
MapReduce: MapReduce is Hadoop's native batch processing engine.
Batch Processing Model
The processing functionality of Hadoop comes from the MapReduce engine.
MapReduce's processing technique follows the map, shuffle, reduce algorithm using
key-value pairs. The basic procedure involves:
Reading the dataset from the HDFS filesystem
Dividing the dataset into chunks and distributed among the available nodes
Applying the computation on each node to the subset of data (the intermediate
results are written back to HDFS)
Redistributing the intermediate results to group by key
"Reducing" the value of each key by summarizing and combining the
results calculated by the individual nodes

Write the calculated final results back to HDFS


2.3 Advantages and Limitations
Because this methodology heavily leverages permanent storage, reading and writing
multiple times per task, it tends to be fairly slow. On the other hand, since disk space is
typically one of the most abundant server resources, it means that MapReduce can handle
enormous datasets. This also means that Hadoop's MapReduce can typically run on less
expensive hardware than some alternatives since it does not attempt to store everything in
memory. MapReduce has incredible scalability potential and has been used in production
on tens of thousands of nodes.
As a target for development, MapReduce is known for having a rather steep learning
curve. Other additions to the Hadoop ecosystem can reduce the impact of this to
varying degrees, but it can still be a factor in quickly implementing an idea on a
Hadoop cluster.
Hadoop has an extensive ecosystem, with the Hadoop cluster itself frequently used as a
building block for other software. Many other processing frameworks and engines
have Hadoop integrations to utilize HDFS and the YARN resource manager
2.4 Stream Processing Systems
Stream processing systems compute over data as it enters the system. This requires a
different processing model than the batch paradigm. Instead of defining operations to
apply to an entire dataset, stream processors define operations that will be applied to
each individual data item as it passes through the system.
The datasets in stream processing are considered "unbounded". This has a few
important implications:
The total dataset is only defined as the amount of data that has entered the
system so far.
The working dataset is perhaps more relevant, and is limited to a single item at a
time.
Processing is event-based and does not "end" until explicitly stopped. Results are
immediately available and will be continually updated as new data arrives.
Stream processing systems can handle a nearly unlimited amount of data, but they only
process one (true stream processing) or very few (micro-batch processing) items at a time,
with minimal state being maintained in between records. While most systems provide
methods of maintaining some state, steam processing is highly optimized for more functional
processing wih few side effects
Functional operations focus on discrete steps that have limited state or side-effects. Performing
the same operation on the same piece of data will produce the same output independent of other
factors. This kind of processing fits well with streams because state between items is usually
some combination of difficult, limited, and sometimes undesirable. So while some type of state
management is usually possible, these frameworks are much simpler and more efficient in their
absence.
This type of processing lends itself to certain types of workloads. Processing with near real-
time requirements is well served by the streaming model. Analytics, server or application error
logging, and other time-based metrics are a natural fit because reacting to changes in these
areas can be critical to business functions. Stream processing is a good fit for data where you
must respond to changes or spikes and where you're interested in trends over time.
Apache Storm
Apache Storm is a stream processing framework that focuses on extremely low latency and is
perhaps the best option for workloads that require near real-time processing. It can handle very
large quantities of data with and deliver results with less latency than other solutions.
Stream Processing Model
Storm stream processing works by orchestrating DAGs (Directed Acyclic Graphs) in a
framework it calls topologies. These topologies describe the various transformations or steps that
will be taken on each incoming piece of data as it enters the system.
The topologies are composed of:
Streams: Conventional data streams. This is unbounded data that is continuously arriving at the
system.
Spouts: Sources of data streams at the edge of the topology. These can be APIs, queues, etc.
that produce data to be operated on.
Bolts: Bolts represent a processing step that consumes streams, applies an operation to them,
and outputs the result as a stream. Bolts are connected to each of the spouts, and then connect
to each other to arrange all of the necessary processing. At the end of the topology, final bolt
output may be used as an input for a connected system.

The idea behind Storm is to define small, discrete operations using the above
components and then compose them into a topology. By default, Storm offers at-least-
once processing guarantees, meaning that it can guarantee that each message is processed
at least once, but there may be duplicates in some failure scenarios. Storm does not
guarantee that messages will be processed in order
Advantages and Limitations

Storm is probably the best solution currently available for near real-time processing. It is
able to handle data with extremely low latency for workloads that must be processed
with minimal delay. Storm is often a good choice when processing time directly affects
user experience, for example when feedback from the processing is fed directly back to a
visitor's page on a website.

Storm with Trident gives you the option to use micro-batches instead of pure stream
processing. While this gives users greater flexibility to shape the tool to an intended
use, it also tends to negate some of the software's biggest advantages over other
solutions. That being said, having a choice for the stream processing style is still
helpful.

Core Storm does not offer ordering guarantees of messages. Core Storm offers at-least-
once processing guarantees, meaning that processing of each message can be
guaranteed but duplicates may occur. Trident offers exactly-once guarantees and can
offer ordering between batches, but not within.

In terms of interoperability, Storm can integrate with Hadoop's YARN resource


negotiator, making it easy to hook up to an existing Hadoop deployment. More than
most processing frameworks, Storm has very wide language support, giving users many
options for defining topologies.
Chapter 3
APACHE SPARK FRAMEWORK

3.1 What Is Apache Spark?


Spark is a general-purpose distributed data processing engine that is suitable for use in
a wide range of circumstances. On top of the Spark core data processing engine, there
are libraries for SQL, machine learning, graph computation, and stream processing,
which can be used together in an application. Programming languages supported by
Spark include: Java, Python, Scala, and R. Application developers and data scientists
incorporate Spark into their applications to rapidly query, analyze, and transform data
at scale. Tasks most frequently associated with Spark include ETL and SQL batch jobs
across large data sets, processing of streaming data from sensors, IoT, or financial
systems, and machine learning task

Fig 3.1 - Spark Architecture


3.2 History of Spark
In order to understand Spark, it helps to understand its history. Before Spark, there was
MapReduce, a resilient distributed processing framework, which enabled Google to
index the exploding volume of content on the web, across large clusters of commodity
servers.

Fig 3.2 Spark Node


There were 3 core concepts to the Google strategy:

1. Distribute data: when a data file is uploaded into the cluster, it is split into
chunks, called data blocks, and distributed amongst the data nodes and replicated
across the cluster.

2. Distribute computation: users specify a map function that processes a key/value


pair to generate a set of intermediate key/value pairs and a reduce function that
merges all intermediate values associated with the same intermediate key.
Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines in the following way:
o The mapping process runs on each assigned data node, working only on its
block of data from a distributed file.
o The results from the mapping processes are sent to the reducers in a
process called "shuffle and sort": key/value pairs from the mappers are
sorted by key, partitioned by the number of reducers, and then sent
across the network and written to key sorted "sequence files" on the
reducer nodes.

o The reducer process executes on its assigned node and works only on
its subset of the data (its sequence file). The output from the reducer
process is written to an output file.

3. Tolerate faults: both data and computation can tolerate failures by failing over
to another node for data or processing.

3.3 MapReduce word count execution example:


Some iterative algorithms, like PageRank, which Google used to rank websites in their
search engine results, require chaining multiple MapReduce jobs together, which causes
a lot of reading and writing to disk. When multiple MapReduce jobs are chained
together, for each MapReduce job, data is read from a distributed file block into a map
process, written to and read from a SequenceFile in between, and then written to an
output file from a reducer process.

A year after Google published a white paper describing the MapReduce


framework (2004), Doug Cutting and Mike Cafarella created Apache Hadoop.

Apache Spark™ began life in 2009 as a project within the AMPLab at the University of
California, Berkeley. Spark became an incubated project of the Apache Software
Foundation in 2013, and it was promoted early in 2014 to become one of the
Foundation’s top-level projects. Spark is currently one of the most active projects
managed by the Foundation, and the community that has grown up around the project
includes both prolific individual contributors and well-funded corporate backers, such as
Databricks, IBM, and China’s Huawei.
The goal of the Spark project was to keep the benefits of MapReduce’s scalable,
distributed, fault-tolerant processing framework, while making it more efficient
and easier to use.
The advantages of Spark over MapReduce are:
Spark executes much faster by caching data in memory across multiple
parallel operations, whereas MapReduce involves more reading and
writing from disk.
Spark runs multi-threaded tasks inside of JVM processes, whereas
MapReduce runs as heavier weight JVM processes. This gives Spark faster
startup, better parallelism, and better CPU utilization.

Spark provides a richer functional programming model than


MapReduce. Spark is especially useful for parallel processing of
distributed data with iterative algorithms.

3.4 How a Spark Application Runs on a Cluster:


The figure 3.5 below shows a Spark application running on a cluster.
A Spark application runs as independent processes, coordinated by the Spark
Session object in the driver program.
The resource or cluster manager assigns tasks to workers, one task per partition.
A task applies its unit of work to the dataset in its partition and outputs a new
partition dataset. Because iterative algorithms apply operations repeatedly to
data, they benefit from caching datasets across iterations.

Results are sent back to the driver application or can be saved to disk.

Spark supports the following resource/cluster managers:

Spark Standalone – a simple cluster manager included with Spark

Apache Mesos – a general cluster manager that can also run Hadoop
applications Apache Hadoop YARN – the resource manager in Hadoop 2
Kubernetes – an open source system for automating deployment,
scaling, and management of containerized applications

Spark also has a local mode, where the driver and executors run as threads on your
computer instead of a cluster, which is useful for developing your applications from
a personal computer.

What Does Spark Do?

Spark is capable of handling several petabytes of data at a time, distributed across a


cluster of thousands of cooperating physical or virtual servers. It has an extensive set of
developer libraries and APIs and supports languages such as Java, Python, R, and Scala;
its flexibility makes it well-suited for a range of use cases. Spark is often used with
distributed data stores such as MapR XD, Hadoop’s HDFS, and Amazon’s S3, with
popular NoSQL databases such as MapR-DB, Apache HBase, Apache Cassandra, and
MongoDB, and with distributed messaging stores such as MapR-ES and Apache Kafka.
Typical use cases include:
Stream processing: From log files to sensor data, application developers are increasingly
having to cope with "streams" of data. This data arrives in a steady stream, often from
multiple sources simultaneously. While it is certainly feasible to store these data streams
on disk and analyze them retrospectively, it can sometimes be sensible or important to
process and act upon the data as it arrives. Streams of data related to financial
transactions, for example, can be processed in real time to identify– and refuse–
potentially fraudulent transactions.

Machine learning: As data volumes grow, machine learning approaches become more
feasible and increasingly accurate. Software can be trained to identify and act upon
triggers within well-understood data sets before applying the same solutions to new and
unknown data. Spark’s ability to store data in memory and rapidly run repeated queries
makes it a good choice for training machine learning algorithms. Running broadly similar
queries again and again, at scale, significantly reduces the time required to go through
a set of possible solutions in order to find the most efficient algorithms.

Interactive analytics: Rather than running pre-defined queries to create static


dashboards of sales or production line productivity or stock prices, business analysts and
data scientists want to explore their data by asking a question, viewing the result, and
then either altering the initial question slightly or drilling deeper into results. This
interactive query process requires systems such as Spark that are able to respond and
adapt quickly.

Data integration: Data produced by different systems across a business is rarely clean
or consistent enough to simply and easily be combined for reporting or analysis. Extract,
transform, and load (ETL) processes are often used to pull data from different systems,
clean and standardize it, and then load it into a separate system for analysis. Spark (and
Hadoop) are increasingly being used to reduce the cost and time required for this ETL
process.

Who Uses Spark?

A wide range of technology vendors have been quick to support Spark, recognizing the
opportunity to extend their existing big data products into areas where Spark delivers real
value, such as interactive querying and machine learning. Well-known companies such as
IBM and Huawei have invested significant sums in the technology, and a growing
number of startups are building businesses that depend in whole or in part upon Spark.
For example, in 2013 the Berkeley team responsible for creating Spark founded
Databricks, which provides a hosted end-to-end data platform powered by Spark. The
company is well-funded, having received $247 million across four rounds of investment
in 2013, 2014, 2016 and 2017, and Databricks employees continue to play a prominent
role in improving and extending the open source code of the Apache Spark project.

The major Hadoop vendors, including MapR, Cloudera, and Hortonworks, have all
moved to support YARN-based Spark alongside their existing products, and each
vendor is working to add value for its customers. Elsewhere, IBM, Huawei, and others
have all made significant investments in Apache Spark, integrating it into their own
products and contributing enhancements and extensions back to the Apache project.
Web-based companies, like Chinese search engine Baidu, e-commerce operation
Taobao, and social networking company Tencent, all run Spark-based operations at
scale, with Tencent’s
3.5 What Sets Spark Apart?
There are many reasons to choose Spark, but the following three are key:
Simplicity: Spark’s capabilities are accessible via a set of rich APIs, all designed
specifically for interacting quickly and easily with data at scale. These APIs are well-
documented and structured in a way that makes it straightforward for data scientists
and application developers to quickly put Spark to work.

Speed: Spark is designed for speed, operating both in memory and on disk. Using Spark,
a team from Databricks tied for first place with a team from the University of California,
San Diego, in the 2014 Daytona GraySort benchmarking challenge
(https://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html).
The challenge involves processing a static data set; the Databricks team was able to
process 100 terabytes of data stored on solid-state drives in just 23 minutes, and the
previous winner took 72 minutes by using Hadoop and a different cluster configuration.
Spark can perform even better when supporting interactive queries of data stored in
memory. In those situations, there are claims that Spark can be 100 times faster than
Hadoop’s MapReduce.

Support: Spark supports a range of programming languages, including Java, Python,


R, and Scala. Spark includes support for tight integration with a number of leading
storage solutions in the Hadoop ecosystem and beyond, including: MapR (file system,
database, and event store), Apache Hadoop (HDFS), Apache HBase, and Apache
Cassandra.

Furthermore, the Apache Spark community is large, active, and international. A


growing set of commercial providers, including Databricks, IBM, and all of the main
Hadoop vendors, deliver comprehensive support for Spark-based solutions.
3.6 The Power of Data Pipelines
Much of Spark's power lies in its ability to combine very different techniques and
processes together into a single, coherent whole. Outside Spark, the discrete tasks of
selecting data, transforming that data in various ways, and analyzing the transformed
results might easily require a series of separate processing frameworks, such as
Apache Oozie. Spark, on the other hand, offers the ability to combine these together,
crossing boundaries between batch, streaming, and interactive workflows in ways that
make the user more productive.

Spark jobs perform multiple operations consecutively, in memory, and only spilling to
disk when required by memory limitations. Spark simplifies the management of these
disparate processes, offering an integrated whole – a data pipeline that is easier to
configure, easier to run, and easier to maintain. In use cases such as ETL, these pipelines
can become extremely rich and complex, combining large numbers of inputs and a wide
range of processing steps into a unified whole that consistently delivers the desired
result.

SUMMARY
1. This chapter introduces Apache Spark and its history and explore some of the
areas in which its particular set of capabilities show the most promise.

2. Shows MapReduce word count execution.

3. Areas that spark covers in terms of its application.

4. Explains the uses of spark framework.

5. Explains how spark sets itself different from other frameworks.


References

[1] Hadoop http://hadoop.apache.org & http://sortbenchmark.org/YahooHadoop.pdf


[2] Hadoop. http://hadoop.apache.org, 2009.
[3] HDFS (hadoop distributed file system) architecture.
http://hadoop.apache.org/common/docs/current/hdfsdesign.html, 2009.
[4] http://spark.apache.org/
[5] http://spark.apache.org/examples.html
[6] www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch01.html
[7] https://www.infoq.com/articles/apache-spark-introduction
[8] Jonathan Stuart Ward and Adam Barker “Undefined By Data: A Survey of Big Data
Definitions” Stamford, CT: Gartner, 2012.

You might also like