Professional Documents
Culture Documents
16Cs217-Data Science Spark: Sri Ramakrishna Engineering College
16Cs217-Data Science Spark: Sri Ramakrishna Engineering College
SPARK
Team Members:
Ananthi P(1805009)
Deva Dharishini V(1805020)
Darshini J(1805023)
Gokulavalli A L(1805033)
AGENDA
• INTRODUCTION TO SPARK
• HISTORY
• WHY SPARK?
• SPARK ARCHITECTURE
• SPARK COMPONENTS
• SPARK RDD
• SPARK STREAMING
• FEATURES OF APACHE SPARK
• LIMITATION
• USE CASES
• REFERENCES
INTRODUCTION TO SPARK
• It was optimized to run in memory whereas alternative approaches like Hadoop's MapReduce writes data to
and from computer hard drives.
• So, Spark process the data much quicker than other alternatives.
• At first, in 2009 Apache Spark was introduced in the UC Berkeley R&D Lab, which is now known as AMP
Lab.
• It offers,
Interactive processing
Batch processing
• Even with very fast speed, ease of use and standard interface.
SPARK ARCHITECTURE
• The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple slaves.
• The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker nodes
• Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each node
is an RDD partition, and the edge is a transformation on top of data.
CONTD…
SPARK COMPONENTS
• The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.
SPARK RDD
• It is a collection of elements, partitioned across the nodes of the cluster so that we can execute various
parallel operations on it.
• Ways to create Spark RDD
Parallelized collections
External datasets
Existing RDDs
• Spark RDDs operations
Transformation Operations
Action Operations
CONTD…
• Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and
streaming workloads.
• Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time
data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis.
Native integration with advanced processing libraries (SQL, machine learning, graph processing)
FEATURES OF SPARK
• Swift Processing
• Dynamic in Nature
• In-Memory Computation in Spark
• Reusability
• Spark Fault Tolerance
• Real-Time Stream Processing
• Lazy Evaluation in Spark
• Support Multiple Languages
• Support for Sophisticated Analysis
• Integrated with Hadoop
LIMITATIONS
• Expensive
• Manual Optimization
• Iterative Processing
• Latency
USE CASES
• Finance Industry
• E-Commerce Industry
• Media & Entertainment Industry
• Travel Industry
REFERENCES
• https://data-flair.training/blogs/spark-tutorial/
• https://www.javatpoint.com/apache-spark-rdd
• https://www.tutorialspoint.com/apache_spark/index.htm
• https://spark.apache.org/docs/latest/quick-start.html
THANK YOU!