Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

SRI RAMAKRISHNA ENGINEERING COLLEGE

[Educational Service: SNR Sons Charitable Trust]


[Autonomous Institution, Accredited by NAAC with ‘A’ Grade]
[Approved by AICTE and Permanently Affiliated to Anna University, Chennai]
[ISO 9001: 2015 Certified and all eligible programmes Accredited by NBA]
VATTAMALAIPALAYAM, N.G.G.O. COLONY POST, COIMBATORE – 641 022

DEPARTMENT OF INFORMATION TECHNOLOGY

16CS217- DATA SCIENCE

SPARK

Team Members:

Ananthi P(1805009)
Deva Dharishini V(1805020)
Darshini J(1805023)
Gokulavalli A L(1805033)
AGENDA

• INTRODUCTION TO SPARK
• HISTORY
• WHY SPARK?
• SPARK ARCHITECTURE
• SPARK COMPONENTS
• SPARK RDD
• SPARK STREAMING
• FEATURES OF APACHE SPARK
• LIMITATION
• USE CASES
• REFERENCES
INTRODUCTION TO SPARK

• Apache Spark is an open-source cluster computing framework.

• Its primary purpose is to handle the real-time generated data.

• Spark was built on the top of the Hadoop MapReduce.

• It was optimized to run in memory whereas alternative approaches like Hadoop's MapReduce writes data to
and from computer hard drives.

• So, Spark process the data much quicker than other alternatives.

• Apache Spark offers high-level APIs to users, such as Java, Scala, Python, and R


HISTORY

• At first, in 2009 Apache Spark was introduced in the UC Berkeley R&D Lab, which is now known as AMP
Lab.

• Afterward, in 2010 it became open source under BSD license.

• Further, the spark was donated to Apache Software Foundation, in 2013.

• Then in 2014, it became top-level Apache project.


WHY SPARK?

• Apache Spark programming enters, it is a powerful open source engine.

• It offers,

 Real-time stream processing

 Interactive processing

 Graph processing in-memory processing

 Batch processing

• Even with very fast speed, ease of use and standard interface.
SPARK ARCHITECTURE

• The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple slaves.

• The Spark architecture depends upon two abstractions:

 Resilient Distributed Dataset (RDD)

 Directed Acyclic Graph (DAG)

• The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker nodes

• Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each node
is an RDD partition, and the edge is a transformation on top of data.
CONTD…
SPARK COMPONENTS

• The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.
SPARK RDD

• The RDD (Resilient Distributed Dataset) is the Spark's core abstraction.

• It is a collection of elements, partitioned across the nodes of the cluster so that we can execute various
parallel operations on it.
• Ways to create Spark RDD
 Parallelized collections
 External datasets
 Existing RDDs
• Spark RDDs operations
 Transformation Operations
 Action Operations
CONTD…

• Sparkling Features of Spark RDD


 In-memory computation
 Lazy Evaluation
 Fault Tolerance
 Immutability
 Persistence
 Partitioning
 Parallel
 Location-Stickiness
 Coarse-grained Operation
 Typed
SPARK STREAMING

• Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and
streaming workloads.

• Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time
data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. 

• Four Major Aspects of Spark Streaming

 Fast recovery from failures and stragglers

 Better load balancing and resource usage

 Combining of streaming data with static datasets and interactive queries

 Native integration with advanced processing libraries (SQL, machine learning, graph processing)
FEATURES OF SPARK

• Swift Processing 
• Dynamic in Nature
• In-Memory Computation in Spark
• Reusability
• Spark Fault Tolerance 
• Real-Time Stream Processing
• Lazy Evaluation in Spark
• Support Multiple Languages
• Support for Sophisticated Analysis
• Integrated with Hadoop
LIMITATIONS

• No Support for Real-time Processing

• Problem with Small File

• No File Management System

• Expensive

• Less number of Algorithms

• Manual Optimization

• Iterative Processing

• Latency
USE CASES

• Finance Industry
• E-Commerce Industry
• Media & Entertainment Industry
• Travel Industry

REFERENCES
• https://data-flair.training/blogs/spark-tutorial/
• https://www.javatpoint.com/apache-spark-rdd
• https://www.tutorialspoint.com/apache_spark/index.htm
• https://spark.apache.org/docs/latest/quick-start.html
THANK YOU!

You might also like