Professional Documents
Culture Documents
BIG_DATA_Unit 4
BIG_DATA_Unit 4
(KCS061)
Unit-4
3rd YEAR (6th Sem)
(2023-24)
HDFS High Availability &
Federation
High Availability
3
High Availability
4
High Availability
5
HDFS Federation
• Hdfs federation basically enhances the existing architecture
of HDFS. Earlier with Hadoop 1, entire cluster was using a
single namespace. And there a single namenode was
managing that namespace. Now if that namenode was
failing, the Hadoop cluster used to be down.
• The Hadoop cluster used to be unavailable until that
namenode was coming back to the cluster causing the loss
of resources and time.
• Hdfs federation with Hadoop 2 comes over this limitation
by allowing the uses of more than one namenode and thus
more than one namespace.
6
HDFS Federation
7
HDFS Federation Benefits
• Isolation
• Namespace Scalability
• Enhanced Performance
8
Introduction to
NoSQL
What is RDBMS
2
Attributes
Tuples
10
Name
Issues with RDBMS - Scalability
3
Sharding
11
Scaling RDBMS
4
Master-Slave Shardin
All writes are written to the g
master. All reads are performed
Scales well for both reads
against the replicated slave and writes.
databases. Not transparent, application needs
Critical reads may be incorrect as to be partition-aware.
Can no longer have relationships or
writes may not have been joins across partitions.
propagated down.
Loss of referential integrity across
Large data sets can pose problems shards.
as master needs to duplicate data
to slaves.
12
What is NoSQL
5
• Stands for Not Only SQL. Term was redefined by Eric Evans after
Carlo Strozzi.
• Class of non-relational data storage systems.
• Do not require a fixed table schema nor do they use the concept of
joins.
• Relaxation for one or more of the ACID properties (Atomicity,
Consistency, Isolation, Durability) using CAP theorem.
13
Need of NoSQL
6 Explosion of social media sites (Facebook, Twitter, Google etc.)
with large data needs.
15
Key Value Pair Based
8
16
Column based
It store data as Column families
9
19
CAP Theorem
12
According to Eric Brewer a distributed system has 3 properties :
Consistency
Availability
Partitions
We can have at most two of these three properties for any shared-
data system
Easy to distribute
Don't require a strict schema
21
What is not provided by NoSQL
14
Joins
Group by
ACID transactions
SQ L
Integration with applications that are based on SQ L
22
Where to use NoSQL
15
23
MongoDB
What is MongoDB?
• MongoDB is a document-oriented NoSQL
database used for high-volume data storage.
• It contains the data model, which allows you to
represent hierarchical relationships.
• It uses JSON-like documents with optional
schema instead of using tables and rows in
traditional relational databases.
• Documents containing key-value pairs are the
basic units of data in MongoDB.
25
MongoDB Characteristics
• Document Oriented
• Support for ad hoc queries
• Replication
• Indexing
• Load balancing
• Schema-less database
• Sharding
• High performance
• GridFS
26
MongoDB Applications
27
MongoDB v/s RDBMS
MongoDB RDBMS
• Document-oriented and non- • Relational database
relational database • Row based
• Document based • Column based
• Field based • Table based
• Collection based and key–value
• Supports Joins
pair
• Has predefined schema and
• Embedded documents
not good for hierarchical data
• Has dynamic schema and ideal
storage
for hierarchical data storage
• By increasing RAM, vertical
• 100 times faster and
scaling can happen
horizontally scalable through
sharding
28
Database Commands
• Show Collections
> show collections
• Create a collection named 'comments'
> db.createCollection('comments')
• Delete a collection named 'comments'
> db.comments.drop()
30
Row(Document) Commands
• Show all Rows in a Collection
> db.comments.find()
• Show all Rows in a Collection (Prettified)
> db.comments.find().pretty()
• Find the first row matching the object
> db.comments.findOne({name: 'Harry'})
31
Row(Document) Commands
• db.list.find() to print out all documents in it. It
will print the below output :
32
Row(Document) Commands
• If you call the pretty() method, it will print the
result like below :
33
Row(Document) Commands
• Insert One Row
• db.comments.insert({ 'name': 'Harry', 'lang': 'JavaScript',
'member_since': 5 })
• Insert many Rows
• db.comments.insertMany([
{ 'name': 'Harry', 'lang': 'JavaScript', 'member_since': 5 },
{'name': 'Rohan', 'lang': 'Python', 'member_since': 3 },
{'name': 'Lovish', 'lang': 'Java', 'member_since': 4 }
])
34
Row(Document) Commands
• Search in a MongoDb Database
> db.comments.find({lang:'Python'})
• Limit the number of rows in output
> db.comments.find().limit(2)
• Count the number of rows in the output
> db.comments.find().count()
35
Row(Document) Commands
• Update a row
• db.comments.updateOne( {name: ‘John'},
{$set: {'name': 'Harry', 'lang': 'JavaScript',
'member_since': 51 } } )
36
Row(Document) Commands
• Mongodb Rename Operator
• db.comments.update({name: 'Rohan'}, {$rename:
{ member_since: 'member' }})
• Delete Row
• db.comments.remove({name: 'Harry'})
• db.comments.remove({name: 'Harry'},
{justOne:True})
• db.comments.deleteOne()
• db.comments.deleteMany()
37
38
39
Advantages of MongoDB
• Flexible Database
• Sharding
• High Speed
• High Availability
• Scalability
• Ad-hoc Query Support
• Easy Environment Setup
40
Disadvantages of MongoDB
• Joins not Supported
• High Memory Usage
• Limited Data Size
• Limited Nesting
41
Spark
Apache Spark Overview
43
DA
G
• Stands for Directed Acyclic
Graph
• For every spark job a DAG
of tasks is created
44
Spark APIs
• APIs for
• Java
• Python
• Scala
• R
• Spark itself is written in Scala
45
46
Spark Libraries
• Spark SQL
• For working with structured data. Allows you to
seamlessly mix SQL queries with Spark programs
• Spark Streaming
• Allows you to build scalable fault-tolerant streaming
applications
• MLlib
• Implements common machine learning algorithms
• GraphX
• For graphs and graph-parallel computation
47
Spark Architecture
48
What is Spark Used For?
• Stream processing
• log files
• sensor data
• financial transactions
• Machine learning
• store data in memory and rapidly run repeated queries
• Interactive analytics
• business analysts and data scientists increasingly want to
explore their data by asking a question, viewing the result, and
then either altering the initial question slightly or
drilling deeper into results.
This interactive query process requires systems such as
Spark that are able to respond and adapt quickly
• Data integration
• Extract, transform, and load (ETL) processes
49
Reasons to choose Spark
• Simplicity
• All capabilities are accessible via a set of rich APIs
• designed for interacting quickly and easily with data at scale
• well documented
• Speed
• designed for speed, operating both in memory and on disk
• Support
• supports a range of programming languages
• includes native support for tight integration with a number of leading
storage solutions in the Hadoop ecosystem and beyond
• large, active, and international community
• A growing set of commercial providers
• including Databricks, IBM, and all of the main Hadoop vendors deliver
comprehensive support for Spark-based solutions
50
Spark execution model
• Application
• Driver
• Executer
• Job
• Stage
51
Spark execution model
52
Spark execution model
• At runtime, a Spark application maps to a single driver process and a
set of executor
processes distributed across the hosts in a cluster
• The driver process manages the job flow and schedules tasks and is
available the entire
time the application is running.
• Typically, this driver process is the same as the client process used to initiate
the job
• In interactive mode, the shell itself is the driver process
• The executors are responsible for executing work, in the form of
tasks, as well as for storing any data that you cache.
• Invoking an action inside a Spark application triggers the launch of
a job to fulfill it
• Spark examines the dataset on which that action depends and
formulates an execution plan.
• The execution plan assembles the dataset transformations into
stages. A stage is a collection of tasks that run the same code,
each on a different subset of the data.
53
Cluster Managers
• Yarn
• Spark Standalone
• Mesos
54
Basic Programming Model
• Spark’s basic data model is called a Resilient
Distributed Dataset (RDD)
• It is designed to support in-memory data storage,
distributed across a cluster
• fault-tolerant
• tracking the lineage of transformations applied to data
• Efficient
• parallelization of processing across multiple nodes in the cluster
• Immutable
• Partitioned
55
RDDs
• Two basic types of operations on RDDs
• Transformations
• Transform an RDD into another RDD, such as mapping,
filtering, and more
• Actions
• Process an RDD into a result , such as count, collect, save , …
• The original RDD remains unchanged throughout
• The chain of transformations from RDD1 to RDDn
are logged
• and can be repeated in the event of data loss or the
failure of a cluster node
56
RDDs
• Data not have to fit in a single machine
• Data is separated into partitions
• RDDs remain in memory
• greatly increasing the performance of the cluster,
particularly in use cases with a requirement for
iterative queries or processes
57
RDDs
58
Transformations
• Transformations are lazily processed, only upon an action
• Transformations create a new RDD from an existing one
• Transformations might trigger an RDD repartitioning, called a
shuffle
• Intermediate results can be manually cached in memory/on
disk
• Spill to disk can be handled automatically
59
RDDs
• Transformation
• There are two kinds of transformations:
a. Narrow Transformations
It is the result of map, filter and such that the data is from a single
partition only, i.e. it is self-sufficient. An output RDD has partitions
with records that originate from a single partition in the parent
RDD. Only a limited subset of partitions used to calculate the
result.
60
RDDs
b. Wide Transformations
It is the result of groupByKey() and reduceByKey() like functions.
The data required to compute the records in a single partition may
live in many partitions of the parent RDD. Wide transformations
are also known as shuffle transformations because they may or
may not depend on a shuffle.
61
Spark transformations
62
Spark transformations
63
Spark actions
64
Spark actions
65
RDD and cache
• Spark can persist (or cache) a dataset in memory
across operations
• Each node stores in memory any slices of it that it
computes and reuses them in other actions on that
dataset – often making future actions more than 10x
faster
• The cache is fault-tolerant: if any partition of an RDD
is lost, it will automatically be recomputed using
the transformations that originally created it
• You can mark an RDD to be persisted using the
persist() or cache() methods on it
66
Installing Spark
• Step 1: Installing Java
• We need Java8 onwards
• Step 2: Install Scala
• We will use Spark Scala Shell
• Step 3: Install Spark
67
For Cluster installations
• Each machine will need Spark in the same folder,
and key-based passwordless SSH
• access from the master for the user running Spark
• Slave machines will need to be listed in the slaves
file
• See spark/conf/
68
Spark features
• Fast processing
• In memory computing
• flexible
• Fault tolerance
• Better analytics
69
Advantages of Spark
• Runs Programs up to 100x faster than Hadoop MapReduce
• Does the processing in the main memory of the worker
nodes
• Prevents unnecessary I/O operations with the disks
• Ability to chain the tasks at an application programming
level
• Minimizes the number of writes to the disks
• Uses Directed Acyclic Graph (DAG) data processing engine
• MapReduce is just one set of supported constructs
70
WordCount in Java
71
WordCount In Scala
• Map
• In this step, using Spark context variable, sc, we read a text file.
var map = sc.textFile("/path/to/text/file")
• then we split each line using space " " as separator.
var split = map.flatMap(line => line.split(" "))
• and we map each word to a tuple (word, 1), 1 being the number of
occurrences of word.
var mapf = split. map(word => (word,1))
• We use the tuple (word,1) as (key, value) in reduce stage.
• Reduce
• We reduce all the words based on Key
var counts = map.reduceByKey(_ + _);
• Save counts to local file
• The counts could be saved to local file.
var reducef = counts.saveAsTextFile("/path/to/output/") 72
Scala
What is Scala?
74
Features of Scala
76
Operators
• Arithmatic operators
• Assignment operators
• Logical operators
• Bitwise operators
• Relational operators
77
Data Types
78
Casting
• asInstanceOf
79
Operations
scala> var a=955
a : I n t = 955
// Import all definitions from scala.math. This only means you don't have to call math.sin(x), you can call sin(x) instead. To do this,
begin your session with
scala> import scala.math._
import scala.math._
scala> p r i n t ( 4 - 3 / 5 . 0 )
//compile time they w i l l change to
primitive not object any more to run
faster 3.4
scala> println(math.sin(4*Pi/3).abs)
//1)you can import classes 2)=>math.abs(math.sin(4*Pi/3))
0.8660254037844385
80
If Expression
scala> var m="wood"
m: String = wood
81
Arrays
scala> var a r r = A r r a y ( 5 , 4 , 4 7 , 7 , 8 , 7 )
a r r : A r r a y [ I n t ] = Ar r ay ( 5, 4 , 47, 7 , 8 , 7)
scala>
println(arr(1));println(arr(2));println(arr(3));
4
47
7
scala> var array=new A r r a y [ S t r i n g ] ( 3 ) ;
array: Array[ Stri n g ] = A r r a y ( n u l l , n u l l , n u l l )
82
Tuples
scala> p r i n t ( t u p l e s . _ 1 )
10
scala> p r i n t ( t u p l e s . _ 2 )
ten
scala>
print(tuples._6._2) //
2nd element of 6th
element in the tuple.
NINE
83
Lists
84
While
• while (Boolean Expression) { Expression }
• do { Expression } while (Boolean
Expression)
scala> var i = 0 ;
i : Int = 0
scala> while(i<10){
i+=1;
print(i+" ")
}
1 2 3 4 5 6 7 8 9 10
85
Functions
• Syntax
def 'method_name' ('parameters':'return_type_parameters') : ('return_type_of_method') =
{ 'method_body'
return 'value'
}
• Note: if you add “return” you need to specify return type else you are not obligated
and one line function no need to bracket.
86
Functions
object FuncEx {
def main(args: Array[String]): Unit = {
// Calling the function
println("Sum is: " + functionToAdd(5,3));
}
// declaration and definition of function
def functionToAdd(a:Int, b:Int) : Int =
{
var sum:Int = 0
sum = a + b
// returning the value of sum
return sum
}
}
Output:
Sum is: 8
87
Lazy val & Eager Evaluation
• lazy val vs. val
The difference between them is, that a val is executed when it is defined whereas
a lazy val is executed when it is accessed the first time.
In contrast to a method (defined with def) a lazy val is executed once and then
never again. This can be useful when an operation takes long time to complete
and when it is not sure if it is later used. languages (like Scala) are strict by default,
but lazy if explicitly specified for given variables or parameters.
E.g.
scala> val x = 15
x: Int = 15
scala> lazy val y = 13
y: Int = <lazy>
scala> x
res0: Int = 15
Scala> y
res1: Int = 13
88
Closures
89
Classes
• Scala classes are blueprint or template for
creating objects. Moreover they contain
information of fields, methods, constructor,
super classes, etc. So with the help
of class keyword we can define the class.
• To access the members of the class we need
to create an object of the class. So with the
help of the new keyword we can create an
object of class.
90
Classes
class Smartphone
{
// Class variables
var number: Int = 16
var company: String = "ABC"
// Class method
def Display()
{
println("Name of the company : " + company);
println("Total number of Smartphone generation: " + number);
}
}
91
Objects
• We call the elements of a class type objects.
• We create an object by prefixing an application of the constructor of
the class with the operator new.
• It is a basic unit of Object Oriented Programming and represents the
real-life entities.
• In Scala use the object keyword, in place of class, to define an object.
92
Objects
object Main
{
// Main method
def main(args: Array[String]) :Unit = {
// Class object
var obj = new Smartphone();
obj.Display();
}
}
93
Scala Inheritance
class Employee {
var salary:Float = 10000
}
object MainObject {
def main(args:Array[String]) {
new Programmer()
}
}
Output:
Salary = 10000.0 Bonus = 5000
95
Scala Inheritance
96
Scala Inheritance
object MainObject {
def main(args:Array[String]) {
var b = new Bike()
b.run()
}
}
Output:
Bike is running
97
Scala Inheritance
98
Thank You!
99