Apache Spark: Dhineshkumar S K

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Apache Spark

DHINESHKUMAR S K
Agenda

 SPARK
 SPARK
 SPARK
What is Spark?

Criteria Spark Hadoop MapReduce


Speed 100 times faster than Equal to the speed of
MapReduce MapReduce
Interactive mode Yes No
Processing type Stream processing Batch processing
Latency Low latency due to in- High latency due to disk-
memory processing oriented processing
Apache Hadoop: Purpose

 “Framework that allows distributed processing of large data sets across clusters of
computers…
 using simple programming models.
 It is designed to scale up from single servers to thousands of machines, each offering local
computation and storage.
 Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service
on top of a cluster of computers, each of which may be prone to failures.”
Job example Driver

val log = sc.textFile(“hdfs://...”)


val errors = file.filter(_.contains(“ERROR”))
Action!
errors.cache()

errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)) Worker Worker Worker
.count()
Cache1 Cache2 Cache2

Block1 Block2 Block3


Survey results

 “Why companies should use in-memory computing framework like Apache Spark?”
 91% use Apache Spark because of its performance gains.
 77% use Apache Spark as it is easy to use.
 71% use Apache Spark due to the ease of deployment.
 64% use Apache Spark to leverage advanced analytics
 52% use Apache Spark for real-time streaming.
Features

 Fast processing
 Flexibility
 In-memory computing
 Real-time processing 
 Better analytics 
 Compatible with Hadoop
Features
Usecases
Applications of Spark

 Spark is a highly versatile big data processing engine. Here we list some of the top applications of Spark cutting
across industry verticals.
 Providing a holistic customer service by analyzing data from multiple customer touchpoints
 Building an ecommerce recommender engine based on customer past buying habits
 Creating customized ad targeting on websites based on customer profiles
 Text analysis to identify customer sentiments on social media channels like Twitter
 Machine learning applications for supporting AI initiatives using Spark MLlib.
Usecases
Finance Industry

 Banks are using the Hadoop alternative – Spark


 financial institutions can detect fraudulent transactions in real-time, based on previous
fraud footprints.
e-commerce Industry

 Information about real time transaction can be passed to streaming clustering algorithms
like alternating least squares

 Companies Using Spark in e-commerce Industry


 Apache Spark at Alibaba
 Apache Spark at eBay
Healthcare

 hospitals prevent hospital re-admittance as they can deploy home healthcare services to
the identified patient, saving on costs for both the hospitals and patients.

 Apache Spark at MyFitnessPal


Media & Entertainment

 gaming industry to identify patterns from the real-time in-game events and respond to
them to harvest lucrative business opportunities like targeted advertising, auto adjustment
of gaming levels based on complexity, player retention and many more.
 Apache Spark at Yahoo,MSN for News Personalization
 Apache Spark at Netflix – online recomendations
 Apache Spark at Pinterest
Travel Industry

 Apache Spark at TripAdvisor


 Apache Spark at OpenTable
Real time analysis
Spark Components
 Spark Core
 Spark Streaming
 Spark SQL
 GraphX
 MLlib (Machine Learning)
Spark core

 Spark Core is the base engine for large-scale parallel and distributed data processing. 
 It is responsible for:
 Memory management and fault recovery
 Scheduling, distributing and monitoring jobs on a cluster
 Interacting with storage systems
Spark Streaming

 Spark Streaming is the component of Spark which is used to process real-time streaming
data. 
Spark SQL

 Spark SQL is a new module in Spark which integrates relational processing with Spark’s
functional programming API. 
 The following are the four libraries of Spark SQL.
 Data Source API
 DataFrame API
 Interpreter & Optimizer
 SQL Service
GraphX

 GraphX is the Spark API for graphs and graph-parallel computation.


Machine learning

 MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine
learning in Apache Spark. 
Spark Applications
Popular apps

 Uber – Uses Kafka, Spark Streaming, and HDFS for building a continuous ETL pipeline.
 Pinterest – Uses Spark Streaming in order to gain deep insight into customer engagement
details.
 Conviva – The pinnacle video company Conviva deploys Spark for optimizing the videos
and handling live traffic.

You might also like