Professional Documents
Culture Documents
Apache Spark: Dhineshkumar S K
Apache Spark: Dhineshkumar S K
Apache Spark: Dhineshkumar S K
DHINESHKUMAR S K
Agenda
SPARK
SPARK
SPARK
What is Spark?
“Framework that allows distributed processing of large data sets across clusters of
computers…
using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local
computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service
on top of a cluster of computers, each of which may be prone to failures.”
Job example Driver
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)) Worker Worker Worker
.count()
Cache1 Cache2 Cache2
“Why companies should use in-memory computing framework like Apache Spark?”
91% use Apache Spark because of its performance gains.
77% use Apache Spark as it is easy to use.
71% use Apache Spark due to the ease of deployment.
64% use Apache Spark to leverage advanced analytics
52% use Apache Spark for real-time streaming.
Features
Fast processing
Flexibility
In-memory computing
Real-time processing
Better analytics
Compatible with Hadoop
Features
Usecases
Applications of Spark
Spark is a highly versatile big data processing engine. Here we list some of the top applications of Spark cutting
across industry verticals.
Providing a holistic customer service by analyzing data from multiple customer touchpoints
Building an ecommerce recommender engine based on customer past buying habits
Creating customized ad targeting on websites based on customer profiles
Text analysis to identify customer sentiments on social media channels like Twitter
Machine learning applications for supporting AI initiatives using Spark MLlib.
Usecases
Finance Industry
Information about real time transaction can be passed to streaming clustering algorithms
like alternating least squares
hospitals prevent hospital re-admittance as they can deploy home healthcare services to
the identified patient, saving on costs for both the hospitals and patients.
gaming industry to identify patterns from the real-time in-game events and respond to
them to harvest lucrative business opportunities like targeted advertising, auto adjustment
of gaming levels based on complexity, player retention and many more.
Apache Spark at Yahoo,MSN for News Personalization
Apache Spark at Netflix – online recomendations
Apache Spark at Pinterest
Travel Industry
Spark Core is the base engine for large-scale parallel and distributed data processing.
It is responsible for:
Memory management and fault recovery
Scheduling, distributing and monitoring jobs on a cluster
Interacting with storage systems
Spark Streaming
Spark Streaming is the component of Spark which is used to process real-time streaming
data.
Spark SQL
Spark SQL is a new module in Spark which integrates relational processing with Spark’s
functional programming API.
The following are the four libraries of Spark SQL.
Data Source API
DataFrame API
Interpreter & Optimizer
SQL Service
GraphX
MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine
learning in Apache Spark.
Spark Applications
Popular apps
Uber – Uses Kafka, Spark Streaming, and HDFS for building a continuous ETL pipeline.
Pinterest – Uses Spark Streaming in order to gain deep insight into customer engagement
details.
Conviva – The pinnacle video company Conviva deploys Spark for optimizing the videos
and handling live traffic.