UNIT 4 Part 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Subject: BIG DATA (KCS 061)

Faculty Name: Miss Farheen Siddiqui

UNIT 4: Spark, SCALA


Spark:
Spark has been proposed by Apache Software Foundation to speed up the software process of
Hadoop computational computing. Spark includes its cluster management, while Hadoop is only
one of the forms for implementing Spark.

Spark applies Hadoop in two forms. The first form is storage and another one is processing.
Thus, Spark includes its computation for cluster management and applies Hadoop for only
storage purposes.

Apache Spark is a distributed and open-source processing system. It is used for the workloads
of 'Big data'. Spark utilizes optimized query execution and in-memory caching for rapid queries
across any size of data. It is simply a general and fast engine for much large-scale processing of
data.

It is much faster as compared to the previous concepts to implement with Big Data such as
classical MapReduce. Spark is faster die to it executes on RAM/memory and enables the
processing faster as compared to the disk drivers.

Spark is simple due to it could be used for more than one thing such as working with data streams
or graphs, Machine Learning algorithms, inhaling data into the database, building data pipelines,
executing distributed SQL, and others.

Apache Spark is a lightning-fast unified analytics engine used for cluster computing for large
data sets like BigData and Hadoop with the aim to run programs parallel across multiple nodes.
It is a combination of multiple stack libraries such as SQL and Dataframes, GraphX, MLlib, and
Spark Streaming.

Spark operates in 4 different modes:

 Standalone Mode: Here all processes run within the same JVM process.
 Standalone Cluster Mode: In this mode, it uses the Job-Scheduling framework in-built in
Spark.
 Apache Mesos: In this mode, the work nodes run on various machines, but the driver runs
only in the master node.
 Hadoop YARN: In this mode, the drivers run inside the application’s master node and is
handled by YARN on the Cluster.

Spark Installation:
There are some different things to use and install Spark. We can install Spark on our machine as
any stand-alone framework or use the images of Spark VM (Virtual Machine) available from
many vendors such as MapR, HortonWorks, and Cloudera. Also, we can use Spark configured
and installed inside the cloud (such as Databricks Clouds).

Spark Application:
Spark applications are programs written using the Apache Spark framework, which is designed
for large-scale data processing. These applications can perform various tasks on big data sets,
such as data transformation, analysis, machine learning, and graph processing. Here are some
common types of Spark applications used in big data:

1. Data Processing: Spark can efficiently process large volumes of data in parallel.
Applications might involve filtering, aggregating, joining, and transforming data.
2. Machine Learning: Spark's MLlib library provides scalable machine learning algorithms
for classification, regression, clustering, collaborative filtering, and dimensionality
reduction.
3. Graph Processing: Spark GraphX enables the processing of graph data structures and
implements graph-parallel algorithms for tasks like page rank, community detection, and
graph coloring.
4. Streaming Analytics: Spark Streaming allows real-time processing of streaming data.
Applications might involve processing continuous streams of data from various sources
like Kafka, Flume, Twitter, etc., for tasks such as anomaly detection, sentiment analysis,
and real-time recommendations.
5. SQL and Data Warehousing: Spark SQL provides a DataFrame API and SQL interface
for working with structured data, enabling users to run SQL queries and perform analytics
on large-scale datasets.
6. ETL (Extract, Transform, Load): Spark is often used for ETL tasks, where data is
extracted from various sources, transformed into a suitable format, and loaded into a data
warehouse or data lake for further analysis.
7. Data Exploration and Visualization: Spark can be used for exploratory data analysis and
visualization tasks, allowing users to gain insights from large datasets using tools like
Spark SQL, DataFrame operations, and visualization libraries like Matplotlib or Seaborn.
Natural
8. Language Processing (NLP): Spark's ecosystem includes libraries like Spark NLP, which
provides scalable natural language processing capabilities for tasks such as text
classification, entity recognition, sentiment analysis, and topic modeling.

Concept of Jobs, Stages and Tasks:


In the data processing landscape, Apache Spark stands as one of the most popular and efficient
frameworks that handles big data analytics. Spark’s unique feature lies in its ability to process
large datasets with lightning speed, thanks to its in-memory computing capabilities. As a
programmer working with Spark and Scala, it’s important to understand its internal workings —
particularly the core concepts of Jobs, Stages, and Tasks. In this blog, we’ll delve deep into these
concepts.
Concept of Job in Spark
A job in Spark refers to a sequence of transformations on data. Whenever an action like count(),
first(), collect(), and save() is called on RDD (Resilient Distributed Datasets), a job is created. A
job could be thought of as the total work that your Spark application needs to perform, broken
down into a series of steps.
Consider a scenario where you’re executing a Spark program, and you call the action count() to
get the number of elements. This will create a Spark job. If further in your program, you
call collect(), another job will be created. So, a Spark application could have multiple jobs,
depending upon the number of actions.
Concept of Stage in Spark
A stage in Spark represents a sequence of transformations that can be executed in a single pass,
i.e., without any shuffling of data. When a job is divided, it is split into stages. Each stage
comprises tasks, and all the tasks within a stage perform the same computation.
The boundary between two stages is drawn when transformations cause data shuffling across
partitions. Transformations in Spark are categorized into two types: narrow and wide. Narrow
transformations, like map(), filter(), and union(), can be done within a single partition. But for
wide transformations like groupByKey(), reduceByKey(), or join(), data from all partitions may
need to be combined, thus necessitating shuffling and marking the start of a new stage.
Concept of Task in Spark
A task in Spark is the smallest unit of work that can be scheduled. Each stage is divided into tasks.
A task is a unit of execution that runs on a single machine. When a stage comprises
transformations on an RDD, those transformations are packaged into a task to be executed on a
single executor.
For example, if you have a Spark job that is divided into two stages and you’re running it on a
cluster with two executors, each stage could be divided into two tasks. Each executor would then
run a task in parallel, performing the transformations defined in that task on its subset of the data.
In summary, a Spark job is split into multiple stages at the points where data shuffling is needed,
and each stage is split into tasks that run the same code on different data partitions.
Resilient Distributed Database:
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an
immutable distributed collection of elements of your data, partitioned across nodes in your
cluster that can be operated in parallel with a low-level API that offers transformations and
actions.

5 reasons of when to use RDDs

 You want low-level transformation and actions and control on your dataset;
 Your data is unstructured, such as media streams or streams of text;
 You want to manipulate your data with functional programming constructs than domain
specific expressions;
 You don’t care about imposing a schema, such as columnar format while processing or
accessing data attributes by name or column; and
 You can forgo some optimization and performance benefits available with DataFrames
and Datasets for structured and semi-structured data.

Anatomy of Spark Job Run:

The highest level of a spark job run is composed of two entities:


 Driver: Host the application (SparkContext)
 Executors: Execution of application tasks.

Job Submission: A spark job is submitted automatically when an action is performed on an RDD.
This internally calls runJob() on the SparkContext which passes the call to the scheduler that
runs as a a part of the driver.
The scheduler is made up of two components:
 DAG Scheduler - Breaks down Job into stages.
 Task Scheduler - Submits the task from each stage to the cluster.
DAG Construction: A spark job is split into multiple stages. Each stage runs a specific task.
There are mainly two types of task: shuffle map tasks and result task.
 Shuffle map tasks: Each shuffle map task runs computation on one RDD and writes it
output to a new set of partition which are fetched in a later stage. Shuffle map tasks run
in all stages except the final stage.
 Result tasks: It runs in the final stage and returns result to the user's program. Each result
task runs the computation on its RDD partition, then sends the result back to the driver,
which assembles results from all partitions into a final result. It is to be noted that each
task is given a placement preference by the DAG scheduler to allow task scheduler to
take advantage of data locality. Once the DAG scheduler completes construction of the
DAG of stages, it submits each stage's set of task to the task scheduler. Child stages are
only submitted once their parents have completed successfully.
Task Scheduling: When the task scheduler receives the set of tasks, it uses its list of executors
that are running for the application and constructs a mapping of tasks to executors on the basis
of placement preference. For a given executor, the scheduler will first assign process-local tasks,
then node-local task, then rack-local task, before assigning any arbitrary task or random task.
Executors send status update to the driver when a task is completed or has failed. In case of task
failure, task scheduler resubmits the task on another executor. It also launches speculative tasks
for tasks that are running slowly if this feature is enabled. Speculative tasks are duplicates of
existing tasks, which the scheduler may run as a backup if a task is running more slowly than
expected.
Task Execution: Executor makes sure that the JAR and file dependencies are up to date. It keeps
local cache of dependencies from previous tasks. It deserialize the task code (which consists of
the user's functions) from the serialized bytes that were sent as a part of the launch task message.
Finally the task code is executed. Task can return a result to the driver. The result is serialized
and sent to executor backend, and finally to driver as status update message.

Spark on YARN:

The Apache Spark YARN is either a single job (job refers to a spark job, a hive query or anything
similar to the construct) or a DAG (Directed Acyclic Graph) of jobs. Apache Spark YARN is a
division of functionalities of resource management into a global resource manager. And onto
Application matter for per application. A unit of scheduling on a YARN cluster is called an
application manager. A framework of generic resource management for distributed workloads is
called a YARN. YARN supports a lot of different computed frameworks such as Spark and Tez
as well as Map-reduce functions. And this is a part of the Hadoop system.

SCALA:
Scala is a programming language that can be used in conjunction with Hadoop, a distributed computing
framework, to build scalable and high performance data processing applications. Scala is a versatile
language that runs on the Java Virtual Machine (JVM) and is compatible with the Hadoop ecosystem.

Here’s how Scala can be used with Hadoop:


1. Hadoop MapReduce: Scala can be used to write Hadoop MapReduce applications.
MapReduce is a programming model and processing framework used to process large
datasets in a distributed manner. You can write both map and reduce functions in Scala
to process data stored in HDFS (Hadoop Distributed File System).

2. Hadoop Streaming: Hadoop Streaming is a feature that allows you to write MapReduce
jobs in any programming language, including Scala. You can write Scala scripts to
process data using Hadoop Streaming and submit them to a Hadoop cluster for execution.

3. Hadoop Ecosystem Integration: Scala can be integrated with various components of


the Hadoop ecosystem. For example:

 Apache Hive: Hive provides a SQL-like query language for querying and
analyzing data in Hadoop. You can write Hive queries in Scala using the Hive
JDBC or ODBC driver.
 Apache Pig: Pig is a high-level platform for creating MapReduce programs
using a scripting language called Pig Latin. You can write Pig scripts in Scala
for data transformation tasks.
 Apache Spark: Apache Spark, a fast and general-purpose cluster computing
framework, provides native support for Scala. You can write Spark
applications in Scala to process large-scale data in-memory and perform batch
processing, stream processing, machine learning, and graph processing tasks.

4. Hadoop Libraries: Scala can leverage various Hadoop libraries, such as Hadoop
Common, Hadoop HDFS, and Hadoop YARN, to interact with Hadoop clusters, manage
files in HDFS, and submit jobs for execution.

5. Scala and Functional Programming: Scala’s functional programming capabilities,


such as immutability, pattern matching, and higher-order functions, make it well-suited
for writing data processing code that can be parallelized and distributed effectively in a
Hadoop cluster.

Classes & Object of SCALA:

Generally speaking in OOP, it’s perfect to say that objects are instances of a class. However,
Scala has an object keyword that we can use when defining a singleton object.

When we say singleton, we mean an object that can only be instantiated once. Creating an
object requires just the object keyword and an identifier.

Classes are blueprints for creating objects. When we define a class, we can then create new
objects (instances) from the class.

We define a class using the class keyword followed by whatever name we give for that class.
Basic Types & Operators of SCALA:

A data type is a categorization of data which tells the compiler that which type of value a
variable has. For example, if a variable has an int data type, then it holds numeric value. In
Scala, the data types are similar to Java in terms of length and storage. In Scala, data types
are treated same objects so the first letter of the data type is in capital letter. The data types
that are available in Scala as shown in the below table

Default
DataType value Description
Boolean False True or False
Byte 0 8 bit signed value. Range:-128 to 127
Short 0 16 bit signed value. Range:-215 to 215-1
16 bit unsigned unicode character.
Char ‘\u000’ Range:0 to 216-1
Int 0 32 bit signed value. Range:-231 to 231-1
Long 0L 64 bit signed value. Range:-263 to 263-1
Float 0.0F 32 bit IEEE 754 single-Precision float
Double 0.0D 64 bit IEEE 754 double-Precision float
String null A sequence of character
Unit – Coinsides to no value.
It is a subtype of every other type and
Nothing – it contains no value.
Any – It is a supertype of all other types
AnyVal – It serve as value types.
AnyRef – It serves as reference types.

An operator is a symbol that represents an operation to be performed with one or more operand.
Operators are the foundation of any programming language. Operators allow us to perform
different kinds of operations on operands. There are different types of operators used in Scala as
follows:

Arithmetic Operators
These are used to perform arithmetic/mathematical operations on operands.
 Addition(+) operator adds two operands. For example, x+y.
 Subtraction(-) operator subtracts two operands. For example, x-y.
 Multiplication(*) operator multiplies two operands. For example, x*y.
 Division(/) operator divides the first operand by the second. For example, x/y.
 Modulus(%) operator returns the remainder when the first operand is divided by the
second. For example, x%y.
 Exponent(**) operator returns exponential(power) of the operands. For example, x**y.

Relational Operators
 Relational operators or Comparison operators are used for comparison of two values. Let’s
see them one by one:
 Equal To(==) operator checks whether the two given operands are equal or not. If so, it
returns true. Otherwise it returns false. For example, 5==5 will return true.
 Not Equal To(!=) operator checks whether the two given operands are equal or not. If not,
it returns true. Otherwise it returns false. It is the exact boolean complement of the ‘==’
operator. For example, 5!=5 will return false.
 Greater Than(>) operator checks whether the first operand is greater than the second
operand. If so, it returns true. Otherwise it returns false. For example, 6>5 will return true.
 Less than(<) operator checks whether the first operand is lesser than the second operand.
If so, it returns true. Otherwise it returns false. For example, 6<5 will return false.
 Greater Than Equal To(>=) operator checks whether the first operand is greater than or
equal to the second operand. If so, it returns true. Otherwise it returns false. For example,
5>=5 will return true.
 Less Than Equal To(<=) operator checks whether the first operand is lesser than or equal
to the second operand. If so, it returns true. Otherwise it returns false. For example, 5<=5
will also return true.

Logical Operators
They are used to combine two or more conditions/constraints or to complement the evaluation
of the original condition in consideration. They are described below:
 Logical AND(&&) operator returns true when both the conditions in consideration are
satisfied. Otherwise it returns false. Using “and” is an alternate for && operator. For
example, a && b returns true when both a and b are true (i.e. non-zero).
 Logical OR(||) operator returns true when one (or both) of the conditions in
consideration is satisfied. Otherwise it returns false. Using “or” is an alternate for ||
operator. For example, a || b returns true if one of a or b is true (i.e. non-zero). Of course,
it returns true when both a and b are true.
 Logical NOT(!) operator returns true the condition in consideration is not satisfied.
Otherwise it returns false. Using “not” is an alternate for ! operator. For example, !true
returns false.

Assignment Operators
Assignment operators are used to assigning a value to a variable. The left side operand of the
assignment operator is a variable and right side operand of the assignment operator is a value. The
value on the right side must be of the same data-type of the variable on the left side otherwise the
compiler will raise an error. Different types of assignment operators are shown below
 Simple Assignment (=) operator is the simplest assignment operator. This operator is used
to assign the value on the right to the variable on the left.
 Add AND Assignment (+=) operator is used for adding left operand with right operand and
then assigning it to variable on the left.
 Subtract AND Assignment (-=) operator is used for subtracting left operand with right
operand and then assigning it to variable on the left.
 Multiply AND Assignment (*=) operator is used for multiplying the left operand with right
operand and then assigning it to the variable on the left.
 Divide AND Assignment (/=) operator is used for dividing left operand with right operand
and then assigning it to variable on the left.
 Modulus AND Assignment (%=) operator is used for assigning modulo of left operand
with right operand and then assigning it to the variable on the left.
 Exponent AND Assignment (**=) operator is used for raising power of the left operand to
the right operand and assigning it to the variable on the left.
 Left shift AND Assignment(<<=)operator is used to perform binary left shift of the left
operand with the right operand and assigning it to the variable on the left.
 Right shift AND Assignment(>>=)operator is used to perform binary right shift of the left
operand with the right operand and assigning it to the variable on the left.
 Bitwise AND Assignment(&=)operator is used to perform Bitwise And of the left operand
with the right operand and assigning it to the variable on the left.
 Bitwise exclusive OR and Assignment(^=)operator is used to perform Bitwise exclusive
OR of the left operand with the right operand and assigning it to the variable on the left.
 Bitwise inclusive OR and Assignment(|=)operator is used to perform Bitwise inclusive OR
of the left operand with the right operand and assigning it to the variable on the left.

Bitwise Operators
In Scala, there are 7 bitwise operators which work at bit level or used to perform bit by bit
operations. Following are the bitwise operators:
 Bitwise AND (&): Takes two numbers as operands and does AND on every bit of two
numbers. The result of AND is 1 only if both bits are 1.
 Bitwise OR (|): Takes two numbers as operands and does OR on every bit of two numbers.
The result of OR is 1 any of the two bits is 1.
 Bitwise XOR (^): Takes two numbers as operands and does XOR on every bit of two
numbers. The result of XOR is 1 if the two bits are different.
 Bitwise left Shift (<<): Takes two numbers, left shifts the bits of the first operand, the
second operand decides the number of places to shift.
 Bitwise right Shift (>>): Takes two numbers, right shifts the bits of the first operand, the
second operand decides the number of places to shift.
 Bitwise ones Complement (~): This operator takes a single number and used to perform
the complement operation of 8-bit.
 Bitwise shift right zero fill(>>>): In shift right zero fill operator, left operand is shifted
right by the number of bits specified by the right operand, and the shifted values are filled
up with zeros.

Built-in Control Structures


Control structures in Scala are the building blocks that dictate the flow of control through a
program. Scala's control structures include loops, conditionals, and functions, among others.
Let's start by examining the basic control structures in Scala:
 If-Else Statement
 Looping Structures
 For Loop
 While Loop
 Functions

Functions & Closures:


A function is a collection of statements that perform a certain task. One can divide up the code
into separate functions, keeping in mind that each function must perform a specific task.
Functions are used to put some common and repeated task into a single function, so instead
of writing the same code again and again for different inputs, we can simply call the function.
Scala is assumed as functional programming language so these play an important role. It
makes easier to debug and modify the code. Scala functions are first class values.
In general, function declaration & definition have 6 components:
 def keyword: “def” keyword is used to declare a function in Scala. function_name: It
should be valid name in lower camel case. Function name in Scala can have characters
like +, ~, &, –, ++, \, / etc.
 parameter_list: In Scala, comma-separated list of the input parameters are defined,
preceded with their data type, within the enclosed parenthesis.
 return_type: User must mention return type of parameters while defining function
and return type of a function is optional. If you don’t specify any return type of a
function, default return type is Unit which is equivalent to void in Java.
 =: In Scala, a user can create function with or without = (equal) operator. If the user
uses it, the function will return the desired value. If he doesn’t use it, the function will
not return any value and will work like a subroutine.
 Method body: Method body is enclosed between braces { }. The code you need to be
executed to perform your intended operations.

SCALA Closures:
Scala Closures are functions which uses one or more free variables and the return value of
this function is dependent of these variable. The free variables are defined outside of the
Closure Function and is not included as a parameter of this function. So the difference
between a closure function and a normal function is the free variable. A free variable is
any kind of variable which is not defined within the function and not passed as the
parameter of the function. A free variable is not bound to a function with a valid value.
The function does not contain any values for the free variable.

Inheritance in SCALA:
Inheritance is an important pillar of OOP (Object Oriented Programming). It is the
mechanism in Scala by which one class is allowed to inherit the features(fields and
methods) of another class.
Important terminology:
Super Class: The class whose features are inherited is known as superclass (or a base class
or a parent class).
Sub Class: The class that inherits the other class is known as subclass (or a derived class,
extended class, or child class). The subclass can add its own fields and methods in addition
to the superclass fields and methods.
Reusability: Inheritance supports the concept of “reusability”, i.e. when we want to create
a new class and there is already a class that includes some of the code that we want, we
can derive our new class from the existing class. By doing this, we are reusing the fields
and methods of the existing class.

You might also like