Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 99

Big Data

(KCS061)
Unit-4
3rd YEAR (6th Sem)
(2023-24)
HDFS High Availability &
Federation
High Availability

3
High Availability

4
High Availability

5
HDFS Federation
• Hdfs federation basically enhances the existing architecture
of HDFS. Earlier with Hadoop 1, entire cluster was using a
single namespace. And there a single namenode was
managing that namespace. Now if that namenode was
failing, the Hadoop cluster used to be down.
• The Hadoop cluster used to be unavailable until that
namenode was coming back to the cluster causing the loss
of resources and time.
• Hdfs federation with Hadoop 2 comes over this limitation
by allowing the uses of more than one namenode and thus
more than one namespace.

6
HDFS Federation

7
HDFS Federation Benefits
• Isolation

• Namespace Scalability

• Enhanced Performance

8
Introduction to
NoSQL
What is RDBMS
2

 RDBMS : the relational


database management system.

 Relation: a relation is a 2D table


which has the following features:
 Name

 Attributes
 Tuples

10
Name
Issues with RDBMS - Scalability
3

 Issues with scaling up when the


dataset is just too big e.g. Big Data.
 Not designed to be distributed.
 Looking at multi-node database
solutions. Known as ‘horizontal
scaling’.
 Different approaches include:
 Master-slave

 Sharding

11
Scaling RDBMS
4

Master-Slave Shardin
 All writes are written to the g
master. All reads are performed
 Scales well for both reads
against the replicated slave and writes.
databases.  Not transparent, application needs
 Critical reads may be incorrect as to be partition-aware.
 Can no longer have relationships or
writes may not have been joins across partitions.
propagated down.
 Loss of referential integrity across
 Large data sets can pose problems shards.
as master needs to duplicate data
to slaves.

12
What is NoSQL
5

• Stands for Not Only SQL. Term was redefined by Eric Evans after
Carlo Strozzi.
• Class of non-relational data storage systems.
• Do not require a fixed table schema nor do they use the concept of
joins.
• Relaxation for one or more of the ACID properties (Atomicity,
Consistency, Isolation, Durability) using CAP theorem.

13
Need of NoSQL
6  Explosion of social media sites (Facebook, Twitter, Google etc.)
with large data needs.

 Rise of cloud-based solutions such as Amazon S3 (simple storage


solution).

 Just as moving to dynamically-typed languages (Ruby/Groovy), a


shift to dynamically-typed data with frequent schema changes.

 Expansion of Open-source community.

 NoSQL solution is more acceptable to a client now than a year


ago.
14
NoSQL Types
7
NoSQL database are classified into four types:
• Key Value pair based
• Column based
• Document based
• Graph based

15
Key Value Pair Based
8

• Designed for processing dictionary. Dictionaries contain a


collection of records having fields containing data.
• Records are stored and retrieved using a key that
uniquely identifies the record, and is used to quickly
find the data within the database.
Example: CouchDB, Oracle NoSQL Database, Riak etc.

We use it for storingsession information, user


profiles, preferences, shopping cart data.

We would avoid it when we need to query data having


relationships between entities.

16
Column based
It store data as Column families
9

containing rows that have many


columns associated with a row key.
Each row can have different columns.

Column families are groups of


related data that is accessed
together.

Example: Cassandra, HBase,


Hypertable, and Amazon
DynamoDB.

We use it for content management


systems, blogging platforms, log
aggregation.

We would avoid it for systems that


are in early development, changing 17
Document Based
The database stores and retrieves
10

documents. It stores documents in


the value part of the key-value
store.

Self- describing, hierarchical tree


data structures consisting of maps,
collections, and scalar values.

Example: Lotus Notes, MongoDB,


Couch DB, Orient DB, Raven DB.
We use it for content management
systems, blogging platforms, web
analytics, real-time analytics, e-
commerce applications.
18
We would avoid it for systems that need complex transactions
Graph Based
11

Store entities and relationships between these entities


as nodes and edges of a graph respectively. Entities
have properties.

Traversing the relationships is very fast as relationship


between nodes is not calculated at query time but is
actually persisted as a relationship.

Example: Neo4J, Infinite Graph, OrientDB, FlockDB.

It is well suited for connected data, such as social


networks, spatial data, routing information for goods
and supply.

19
CAP Theorem
12
 According to Eric Brewer a distributed system has 3 properties :
 Consistency
 Availability
 Partitions

 We can have at most two of these three properties for any shared-
data system

 To scale out, we have to partition. It leaves a choice between


consistency and availability. ( In almost all cases, we would choose
availability over consistency)

 Everyone who builds big applications builds them on CAP :


Google, Yahoo, Facebook, Amazon, eBay, etc.
20
Advantages of NoSQL
13

 Cheap and easy to implement (open source)


 Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be partitioned
 When data is written, the latest version is on at least

one node and then replicated to other nodes


 No single point of failure

 Easy to distribute
 Don't require a strict schema

21
What is not provided by NoSQL
14

 Joins
 Group by
 ACID transactions
 SQ L
 Integration with applications that are based on SQ L

22
Where to use NoSQL
15

 NoSQL Data storage systems makes sense for applications that


process very large semi-structured data –like Log Analysis, Social
Networking Feeds, Time-based data.
 To improve programmer productivity by using a database that
better matches an application's needs.
 To improve data access performance via some combination of
handling larger data volumes, reducing latency, and improving
throughput.

23
MongoDB
What is MongoDB?
• MongoDB is a document-oriented NoSQL
database used for high-volume data storage.
• It contains the data model, which allows you to
represent hierarchical relationships.
• It uses JSON-like documents with optional
schema instead of using tables and rows in
traditional relational databases.
• Documents containing key-value pairs are the
basic units of data in MongoDB.
25
MongoDB Characteristics

• Document Oriented
• Support for ad hoc queries
• Replication
• Indexing
• Load balancing
• Schema-less database
• Sharding
• High performance
• GridFS
26
MongoDB Applications

• In E-commerce product catalogue.


• Big data
• Content management
• Real-time analytics
• Maintain Geolocations
• E.g.
• location: { type: "Polygon", coordinates: [ [ -73.564453125,
41.178653972331674 ], [ -71.69128417968749,
41.178653972331674 ], [ -71.69128417968749,
42.114523952464246 ], [ -73.564453125, 42.114523952464246 ],
[ -73.564453125, 41.178653972331674 ] ]] }
• Maintaining data from social websites.

27
MongoDB v/s RDBMS
MongoDB RDBMS
• Document-oriented and non- • Relational database
relational database • Row based
• Document based • Column based
• Field based • Table based
• Collection based and key–value
• Supports Joins
pair
• Has predefined schema and
• Embedded documents
not good for hierarchical data
• Has dynamic schema and ideal
storage
for hierarchical data storage
• By increasing RAM, vertical
• 100 times faster and
scaling can happen
horizontally scalable through
sharding
28
Database Commands

• View all databases


> show dbs
• Create a new or switch databases
> use dbName
• View current Database
> db
• Delete Database
> db.dropDatabase()
29
Collection Commands

• Show Collections
> show collections
• Create a collection named 'comments'
> db.createCollection('comments')
• Delete a collection named 'comments'
> db.comments.drop()

30
Row(Document) Commands
• Show all Rows in a Collection
> db.comments.find()
• Show all Rows in a Collection (Prettified)
> db.comments.find().pretty()
• Find the first row matching the object
> db.comments.findOne({name: 'Harry'})

31
Row(Document) Commands
• db.list.find() to print out all documents in it. It
will print the below output :

32
Row(Document) Commands
• If you call the pretty() method, it will print the
result like below :

33
Row(Document) Commands
• Insert One Row
• db.comments.insert({ 'name': 'Harry', 'lang': 'JavaScript',
'member_since': 5 })
• Insert many Rows
• db.comments.insertMany([
{ 'name': 'Harry', 'lang': 'JavaScript', 'member_since': 5 },
{'name': 'Rohan', 'lang': 'Python', 'member_since': 3 },
{'name': 'Lovish', 'lang': 'Java', 'member_since': 4 }
])

34
Row(Document) Commands
• Search in a MongoDb Database
> db.comments.find({lang:'Python'})
• Limit the number of rows in output
> db.comments.find().limit(2)
• Count the number of rows in the output
> db.comments.find().count()

35
Row(Document) Commands
• Update a row
• db.comments.updateOne( {name: ‘John'},
{$set: {'name': 'Harry', 'lang': 'JavaScript',
'member_since': 51 } } )

*The above example uses the db.collection.updateOne() method


on the comments collection to update the first document where
name equals “John”

36
Row(Document) Commands
• Mongodb Rename Operator
• db.comments.update({name: 'Rohan'}, {$rename:
{ member_since: 'member' }})
• Delete Row
• db.comments.remove({name: 'Harry'})
• db.comments.remove({name: 'Harry'},
{justOne:True})
• db.comments.deleteOne()
• db.comments.deleteMany()
37
38
39
Advantages of MongoDB
• Flexible Database
• Sharding
• High Speed
• High Availability
• Scalability
• Ad-hoc Query Support
• Easy Environment Setup

40
Disadvantages of MongoDB
• Joins not Supported
• High Memory Usage
• Limited Data Size
• Limited Nesting

41
Spark
Apache Spark Overview

• An in-memory big data platform that performs especially


well with iterative algorithms
• Originally developed by UC Berkeley starting in 2009
Moved to an Apache project in 2013
• 10-100x speedup over Hadoop with some algorithms,
especially iterative ones as found in machine learning

43
DA
G
• Stands for Directed Acyclic
Graph
• For every spark job a DAG
of tasks is created

44
Spark APIs
• APIs for
• Java
• Python
• Scala
• R
• Spark itself is written in Scala

45
46
Spark Libraries
• Spark SQL
• For working with structured data. Allows you to
seamlessly mix SQL queries with Spark programs
• Spark Streaming
• Allows you to build scalable fault-tolerant streaming
applications
• MLlib
• Implements common machine learning algorithms
• GraphX
• For graphs and graph-parallel computation

47
Spark Architecture

48
What is Spark Used For?
• Stream processing
• log files
• sensor data
• financial transactions
• Machine learning
• store data in memory and rapidly run repeated queries
• Interactive analytics
• business analysts and data scientists increasingly want to
explore their data by asking a question, viewing the result, and
then either altering the initial question slightly or
drilling deeper into results.
This interactive query process requires systems such as
Spark that are able to respond and adapt quickly
• Data integration
• Extract, transform, and load (ETL) processes
49
Reasons to choose Spark
• Simplicity
• All capabilities are accessible via a set of rich APIs
• designed for interacting quickly and easily with data at scale
• well documented
• Speed
• designed for speed, operating both in memory and on disk
• Support
• supports a range of programming languages
• includes native support for tight integration with a number of leading
storage solutions in the Hadoop ecosystem and beyond
• large, active, and international community
• A growing set of commercial providers
• including Databricks, IBM, and all of the main Hadoop vendors deliver
comprehensive support for Spark-based solutions

50
Spark execution model
• Application
• Driver
• Executer
• Job
• Stage

51
Spark execution model

52
Spark execution model
• At runtime, a Spark application maps to a single driver process and a
set of executor
processes distributed across the hosts in a cluster
• The driver process manages the job flow and schedules tasks and is
available the entire
time the application is running.
• Typically, this driver process is the same as the client process used to initiate
the job
• In interactive mode, the shell itself is the driver process
• The executors are responsible for executing work, in the form of
tasks, as well as for storing any data that you cache.
• Invoking an action inside a Spark application triggers the launch of
a job to fulfill it
• Spark examines the dataset on which that action depends and
formulates an execution plan.
• The execution plan assembles the dataset transformations into
stages. A stage is a collection of tasks that run the same code,
each on a different subset of the data.
53
Cluster Managers
• Yarn
• Spark Standalone
• Mesos

54
Basic Programming Model
• Spark’s basic data model is called a Resilient
Distributed Dataset (RDD)
• It is designed to support in-memory data storage,
distributed across a cluster
• fault-tolerant
• tracking the lineage of transformations applied to data
• Efficient
• parallelization of processing across multiple nodes in the cluster
• Immutable
• Partitioned

55
RDDs
• Two basic types of operations on RDDs
• Transformations
• Transform an RDD into another RDD, such as mapping,
filtering, and more
• Actions
• Process an RDD into a result , such as count, collect, save , …
• The original RDD remains unchanged throughout
• The chain of transformations from RDD1 to RDDn
are logged
• and can be repeated in the event of data loss or the
failure of a cluster node

56
RDDs
• Data not have to fit in a single machine
• Data is separated into partitions
• RDDs remain in memory
• greatly increasing the performance of the cluster,
particularly in use cases with a requirement for
iterative queries or processes

57
RDDs

58
Transformations
• Transformations are lazily processed, only upon an action
• Transformations create a new RDD from an existing one
• Transformations might trigger an RDD repartitioning, called a
shuffle
• Intermediate results can be manually cached in memory/on
disk
• Spill to disk can be handled automatically

59
RDDs
• Transformation
• There are two kinds of transformations:
a. Narrow Transformations
It is the result of map, filter and such that the data is from a single
partition only, i.e. it is self-sufficient. An output RDD has partitions
with records that originate from a single partition in the parent
RDD. Only a limited subset of partitions used to calculate the
result.

60
RDDs
b. Wide Transformations
It is the result of groupByKey() and reduceByKey() like functions.
The data required to compute the records in a single partition may
live in many partitions of the parent RDD. Wide transformations
are also known as shuffle transformations because they may or
may not depend on a shuffle.

61
Spark transformations

62
Spark transformations

63
Spark actions

64
Spark actions

65
RDD and cache
• Spark can persist (or cache) a dataset in memory
across operations
• Each node stores in memory any slices of it that it
computes and reuses them in other actions on that
dataset – often making future actions more than 10x
faster
• The cache is fault-tolerant: if any partition of an RDD
is lost, it will automatically be recomputed using
the transformations that originally created it
• You can mark an RDD to be persisted using the
persist() or cache() methods on it
66
Installing Spark
• Step 1: Installing Java
• We need Java8 onwards
• Step 2: Install Scala
• We will use Spark Scala Shell
• Step 3: Install Spark

• Step 4: within the “spark” directory, run:


• ./bin/spark-shell

67
For Cluster installations
• Each machine will need Spark in the same folder,
and key-based passwordless SSH
• access from the master for the user running Spark
• Slave machines will need to be listed in the slaves
file
• See spark/conf/

68
Spark features
• Fast processing
• In memory computing
• flexible
• Fault tolerance
• Better analytics

69
Advantages of Spark
• Runs Programs up to 100x faster than Hadoop MapReduce
• Does the processing in the main memory of the worker
nodes
• Prevents unnecessary I/O operations with the disks
• Ability to chain the tasks at an application programming
level
• Minimizes the number of writes to the disks
• Uses Directed Acyclic Graph (DAG) data processing engine
• MapReduce is just one set of supported constructs

70
WordCount in Java

71
WordCount In Scala
• Map
• In this step, using Spark context variable, sc, we read a text file.
var map = sc.textFile("/path/to/text/file")
• then we split each line using space " " as separator.
var split = map.flatMap(line => line.split(" "))
• and we map each word to a tuple (word, 1), 1 being the number of
occurrences of word.
var mapf = split. map(word => (word,1))
• We use the tuple (word,1) as (key, value) in reduce stage.
• Reduce
• We reduce all the words based on Key
var counts = map.reduceByKey(_ + _);
• Save counts to local file
• The counts could be saved to local file.
var reducef = counts.saveAsTextFile("/path/to/output/") 72
Scala
What is Scala?

• The Scala programming language was created in 2001 by Martin Odersky to


combine functional programming and object-oriented programming into one
language.
• Scala is a general-purpose, high-level, multi-paradigm programming language.
• It is a pure object-oriented programming language which also provides the
support to the functional programming approach.
• Scala programs can convert to bytecodes and can run on the JVM(Java Virtual
Machine).
• Scala stands for SCAalable LAnguage.
• Scala is highly influenced by Java and some other programming languages like
Lisp, Haskell, Pizza etc.

74
Features of Scala

• There are many features which makes it different from


other languages.
• Object- Oriented: Every value in Scala is an object so it is
a purely object-oriented programming language.
• Functional: It provides the support for the high-order
functions, nested functions,etc.
• Statically Typed: The process of verifying and enforcing the
constraints of types is done at compile time in Scala.
• Run on JVM & Can Execute Java Code: The Scala compiler
compiles the program into .class file, containing the
Bytecode that can be executed by JVM.
75
The Basics of Language
• Data Types
• Casting
• Opperations
• If Expression
• Arrays & Tuples & Lists
• For & while Comprehension
• Functions
• Call By ...

76
Operators
• Arithmatic operators
• Assignment operators
• Logical operators
• Bitwise operators
• Relational operators

77
Data Types

• scala> var a:Int=2


Byte a: I nt = 2
• Short
• Int scala> var a:Double=2.2
a : Double = 2.2
• Long
• Float scala> var a:Float=2.2f
• a : Float = 2.2
Double
• Char scala> var a:Byte=4
• Boolean a : Byte = 4
• Unit scala> var a:Boolean=false
a : Boolean = false

78
Casting

• asInstanceOf

scala> var a:Int=955


a : I n t = 955

scala> var b=a.asInstanceOf[Char]


b: Char = λ

79
Operations
scala> var a=955
a : I n t = 955
// Import all definitions from scala.math. This only means you don't have to call math.sin(x), you can call sin(x) instead. To do this,
begin your session with
scala> import scala.math._
import scala.math._

scala> p r i n t ( 4 - 3 / 5 . 0 )
//compile time they w i l l change to
primitive not object any more to run
faster 3.4

scala> pr i nt ( a +3 ) / / a . a dd( 3 ) add=>+


958

scala> println(math.sin(4*Pi/3).abs)
//1)you can import classes 2)=>math.abs(math.sin(4*Pi/3))
0.8660254037844385

scala> if(a==955) | print("lambda")


println("HOLY" >=("EVIL"))
Lambda
true

80
If Expression
scala> var m="wood"
m: String = wood

scala> p r int l n ( m.lengt h( ) )


4

scala> p r int l n ( "$m i s brown" )


woodi s brown
scala> i f ( m== "wood") { / /== i s equi valent to
equal
print("brown")
}
else if (m.equals("grass")) {
println("green")
}
else{
p r i n t ( " i don't know")
}
brown

81
Arrays
scala> var a r r = A r r a y ( 5 , 4 , 4 7 , 7 , 8 , 7 )
a r r : A r r a y [ I n t ] = Ar r ay ( 5, 4 , 47, 7 , 8 , 7)
scala>
println(arr(1));println(arr(2));println(arr(3));
4
47
7
scala> var array=new A r r a y [ S t r i n g ] ( 3 ) ;
array: Array[ Stri n g ] = A r r a y ( n u l l , n u l l , n u l l )

scala> var at=Array(4,4.7,"Gasai Yuno")


at: Array[Any] = Ar r ay ( 4, 4 . 7 , Gasai Yuno)

82
Tuples

scala> var t u p l e s = ( 2 * 5 , " t e n " , 1 0 . 0 d , 1 0 . 0 f , 1 0 . 0 , ( 9 , " N I N E " ) )


t u p l e s : ( I n t , String , Double, F l o a t , Double, ( I n t , S t r i n g ) ) =
( 1 0 ,te n , 1 0 .0 ,1 0 .0 ,1 0 .0 ,( 9 ,N IN E) )

scala> p r i n t ( t u p l e s . _ 1 )
10

scala> p r i n t ( t u p l e s . _ 2 )
ten

scala>
print(tuples._6._2) //
2nd element of 6th
element in the tuple.
NINE

83
Lists

scala> var l s t=Li s t( "b", "c", "d")


l s t : Li s t [ St r i ng ] = L i s t ( b , c , d)
scala> p r int ( ls t . h ead)
b
scala> p r i n t ( l s t . t a i l )
L i s t ( c , d)

scala> var l s t 2 = " a " : : l s t


lst2: Li st [ St r i ng] = L i s t ( a , b, c , d)

84
While
• while (Boolean Expression) { Expression }
• do { Expression } while (Boolean
Expression)
scala> var i = 0 ;
i : Int = 0

scala> while(i<10){
i+=1;
print(i+" ")
}
1 2 3 4 5 6 7 8 9 10

Note : In functional , It is preferred to not use while and in general imperative


style.

85
Functions
• Syntax
def 'method_name' ('parameters':'return_type_parameters') : ('return_type_of_method') =
{ 'method_body'
return 'value'
}

• Note: if you add “return” you need to specify return type else you are not obligated
and one line function no need to bracket.

86
Functions
object FuncEx {
def main(args: Array[String]): Unit = {
// Calling the function
println("Sum is: " + functionToAdd(5,3));
}
// declaration and definition of function
def functionToAdd(a:Int, b:Int) : Int =
{
var sum:Int = 0
sum = a + b
// returning the value of sum
return sum
}
}
Output:
Sum is: 8

87
Lazy val & Eager Evaluation
• lazy val vs. val
 The difference between them is, that a val is executed when it is defined whereas
a lazy val is executed when it is accessed the first time.
 In contrast to a method (defined with def) a lazy val is executed once and then
never again. This can be useful when an operation takes long time to complete
and when it is not sure if it is later used. languages (like Scala) are strict by default,
but lazy if explicitly specified for given variables or parameters.
 E.g.
scala> val x = 15
x: Int = 15
scala> lazy val y = 13
y: Int = <lazy>
scala> x
res0: Int = 15
Scala> y
res1: Int = 13

88
Closures

A closure is a function, whose return value depends on the value of


one or more variables declared outside this function.

scala> var votingAge = 18


votingAge: I n t = 18

scala> val isVotingAge = (age: I n t ) => age >= votingAge


isVotingAge: I n t => Boolean = <function1>

scala> isVotingAge( 16)


res2: Boolean = false

89
Classes
• Scala classes are blueprint or template for
creating objects. Moreover they contain
information of fields, methods, constructor,
super classes, etc. So with the help
of class keyword we can define the class.
• To access the members of the class we need
to create an object of the class. So with the
help of the new keyword we can create an
object of class.
90
Classes
class Smartphone
{

// Class variables
var number: Int = 16
var company: String = "ABC"

// Class method
def Display()
{
println("Name of the company : " + company);
println("Total number of Smartphone generation: " + number);
}
}

91
Objects
• We call the elements of a class type objects.
• We create an object by prefixing an application of the constructor of
the class with the operator new.
• It is a basic unit of Object Oriented Programming and represents the
real-life entities.
• In Scala use the object keyword, in place of class, to define an object.

92
Objects
object Main
{

// Main method
def main(args: Array[String]) :Unit = {
// Class object
var obj = new Smartphone();
obj.Display();
}
}
93
Scala Inheritance

• Inheritance is an object oriented concept which is


used to reusability of code.
• You can achieve inheritance by using extends
keyword.
• To achieve inheritance a class must extend to other
class.
• A class which is extended called super or parent
class.
• A class which extends class is called derived or base
class.
94
Scala Inheritance

class Employee {
var salary:Float = 10000
}

class Programmer extends Employee {


var bonus:Int = 5000
println("Salary = "+salary)
println("Bonus = "+bonus)
}

object MainObject {
def main(args:Array[String]) {
new Programmer()
}
}
Output:
Salary = 10000.0 Bonus = 5000

95
Scala Inheritance

//Scala Method Overriding Example


class Vehicle{
def run(){
println("vehicle is running")
}
}

class Bike extends Vehicle{


override def run(){
println("Bike is running")
}
}

96
Scala Inheritance

object MainObject {
def main(args:Array[String]) {
var b = new Bike()
b.run()
}
}
Output:
Bike is running

97
Scala Inheritance

98
Thank You!

99

You might also like