16Cs217-Data Science Spark: Sri Ramakrishna Engineering College

SRI RAMAKRISHNA ENGINEERING COLLEGE
[Educational Service: SNR Sons Charitable Trust]

[Autonomous Institution, Accredited by NAAC with ‘A’ Grade]
[Approved by AICTE and Permanently Affiliated to Anna University, Chennai]
[ISO 9001: 2015 Certified and all eligible programmes Accredited by NBA]
VATTAMALAIPALAYAM, N.G.G.O. COLONY POST, COIMBATORE – 641 022
DEPARTMENT OF INFORMATION TECHNOLOGY
16CS217- DATA SCIENCE
SPARK
Team Members:
Ananthi P(1805009)
Deva Dharishini V(1805020)
Darshini J(1805023)
Gokulavalli A L(1805033)
AGENDA
• INTRODUCTION TO SPARK
• HISTORY
• WHY SPARK?
• SPARK ARCHITECTURE
• SPARK COMPONENTS
• SPARK RDD
• SPARK STREAMING
• FEATURES OF APACHE SPARK
• LIMITATION
• USE CASES
• REFERENCES
INTRODUCTION TO SPARK
• Apache Spark is an open-source cluster computing framework.
• Its primary purpose is to handle the real-time generated data.
• Spark was built on the top of the Hadoop MapReduce.
• It was optimized to run in memory whereas alternative approaches like Hadoop's MapReduce writes data to
and from computer hard drives.
• So, Spark process the data much quicker than other alternatives.
• Apache Spark offers high-level APIs to users, such as Java, Scala, Python, and R

HISTORY
• At first, in 2009 Apache Spark was introduced in the UC Berkeley R&D Lab, which is now known as AMP
Lab.
• Afterward, in 2010 it became open source under BSD license.
• Further, the spark was donated to Apache Software Foundation, in 2013.
• Then in 2014, it became top-level Apache project.

WHY SPARK?
• Apache Spark programming enters, it is a powerful open source engine.
• It offers,
 Real-time stream processing
 Interactive processing
 Graph processing in-memory processing
 Batch processing
• Even with very fast speed, ease of use and standard interface.
SPARK ARCHITECTURE
• The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple slaves.
• The Spark architecture depends upon two abstractions:
 Resilient Distributed Dataset (RDD)
 Directed Acyclic Graph (DAG)
• The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker nodes
• Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each node
is an RDD partition, and the edge is a transformation on top of data.
CONTD…
SPARK COMPONENTS
• The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.
SPARK RDD
• The RDD (Resilient Distributed Dataset) is the Spark's core abstraction.
• It is a collection of elements, partitioned across the nodes of the cluster so that we can execute various
parallel operations on it.
• Ways to create Spark RDD
 Parallelized collections
 External datasets
 Existing RDDs
• Spark RDDs operations
 Transformation Operations
 Action Operations
CONTD…
• Sparkling Features of Spark RDD

 In-memory computation
 Lazy Evaluation
 Fault Tolerance
 Immutability
 Persistence
 Partitioning
 Parallel
 Location-Stickiness
 Coarse-grained Operation
 Typed
SPARK STREAMING
• Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and
streaming workloads.
• Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time
data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis.
• Four Major Aspects of Spark Streaming
 Fast recovery from failures and stragglers
 Better load balancing and resource usage
 Combining of streaming data with static datasets and interactive queries
 Native integration with advanced processing libraries (SQL, machine learning, graph processing)
FEATURES OF SPARK
• Swift Processing
• Dynamic in Nature
• In-Memory Computation in Spark
• Reusability
• Spark Fault Tolerance
• Real-Time Stream Processing
• Lazy Evaluation in Spark
• Support Multiple Languages
• Support for Sophisticated Analysis
• Integrated with Hadoop
LIMITATIONS
• No Support for Real-time Processing
• Problem with Small File
• No File Management System
• Expensive
• Less number of Algorithms
• Manual Optimization
• Iterative Processing
• Latency
USE CASES
• Finance Industry
• E-Commerce Industry
• Media & Entertainment Industry
• Travel Industry
REFERENCES
• https://data-flair.training/blogs/spark-tutorial/
• https://www.javatpoint.com/apache-spark-rdd
• https://www.tutorialspoint.com/apache_spark/index.htm
• https://spark.apache.org/docs/latest/quick-start.html
THANK YOU!

16Cs217-Data Science Spark: Sri Ramakrishna Engineering College

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

16Cs217-Data Science Spark: Sri Ramakrishna Engineering College

Uploaded by

Copyright:

Available Formats

SRI RAMAKRISHNA ENGINEERING COLLEGE

[Educational Service: SNR Sons Charitable Trust]

DEPARTMENT OF INFORMATION TECHNOLOGY

16CS217- DATA SCIENCE

• Apache Spark is an open-source cluster computing framework.

• Its primary purpose is to handle the real-time generated data.

• Spark was built on the top of the Hadoop MapReduce.

• Apache Spark offers high-level APIs to users, such as Java, Scala, Python, and R

• Afterward, in 2010 it became open source under BSD license.

• Further, the spark was donated to Apache Software Foundation, in 2013.

• Then in 2014, it became top-level Apache project.

• Apache Spark programming enters, it is a powerful open source engine.

 Real-time stream processing

 Graph processing in-memory processing

• The Spark architecture depends upon two abstractions:

 Resilient Distributed Dataset (RDD)

 Directed Acyclic Graph (DAG)

• The RDD (Resilient Distributed Dataset) is the Spark's core abstraction.

• Sparkling Features of Spark RDD

• Four Major Aspects of Spark Streaming

 Fast recovery from failures and stragglers

 Better load balancing and resource usage

 Combining of streaming data with static datasets and interactive queries

• No Support for Real-time Processing

• Problem with Small File

• No File Management System

• Less number of Algorithms

You might also like