Professional Documents
Culture Documents
Unit 2 Hadoop
Unit 2 Hadoop
Examples:
Key-Value : Dynamo,Redis,Riak, etc.
Document : Mongo DB, Apache CouchDB, CouchDB, MarkLogic,etc.
Column: Cassandra, Hbase, etc.
Graph : NEo4j, HyperGraphDB, etc.
Hierarchical : IBM IMS,RDM mobile etc
Object Oriented : Realm etc
RDF or TripleStore: 3store, Apache Jena, Apache Rya etc
Advantages
1. HDFS
❖ Storage Component
❖ Distributes data across several nodes
❖ Natively redundant
2. MapReduce
❖ Computational Framework
❖ Splits a task across multiple nodes
❖ Processes data in parallel
Hadoop architecture
1. NameNode
2. DataNode
3. Secondary Name Node
4. Resource Manager
5. Node Manager
HDFS
Features:
► It never stores the data that is present in the file.
► As Namenode works on the Master System, the Master system should have good
processing power and more RAM than Slaves.
► It stores the information of DataNode such as their Block id’s and Number of
Blocks
RACK AWARENESS
► The name node has the feature of finding the closest data node for
faster performance for that Name node holds the ids of all the Racks
present in the Hadoop cluster. This concept of choosing the closest
data node for serving a purpose is Rack Awareness.
Rack Awareness Policies:
❖ There should not be more than 1 replica on the same Datanode.
❖ More then 2 replica’s of a single block is not allowed on the same
Rack.
❖ The number of racks used inside a Hadoop cluster must be smaller
than the number of replicas.
Responsibilities of Namenode
Namespace management.
► The namenode is responsible for maintaining the file namespace, which includes metadata, directory
structure, file to block mapping, location of blocks, and access permissions. These data are held in
memory for fast access and all mutations are persistently logged.
Coordinating file operations.
► The namenode directs application clients to datanodes for read operations, and allocates blocks on
suitable datanodes for write operations. All data transfers occur directly between clients and datanodes.
When a file is deleted, HDFS does not immediately reclaim the available physical storage; rather, blocks
are lazily garbage collected.
Maintaining overall health of the file system.
► The namenode is in periodic contact with the datanodes via heartbeat messages to ensure the integrity of
the system. If the namenode observes that a data block is under-replicated (fewer copies are stored on
datanodes than the desired replication factor), it will direct the creation of new replicas. Finally, the
namenode is also responsible for rebalancing the file system. During the course of normal operations,
certain datanodes may end up holding more blocks than others; rebalancing involves moving blocks from
datanodes with more blocks to datanodes with fewer blocks. This leads to better load balancing and more
even disk utilization.
DataNode
► https://data-flair.training/blogs/hadoop-hdfs-commands/
YARN - Resource Manager
► Resource Manager is also known as the Global Master Daemon that
works on the Master System. The Resource Manager Manages the
resources for the applications that are running in a Hadoop Cluster.
► The Resource Manager Mainly consists of 2 things.
1. ApplicationsManager
2. Scheduler
► An Application Manager is responsible for accepting the request for
a client and also makes a memory resource on the Slaves in a
Hadoop cluster to host the Application Master.
► The scheduler is utilized for providing resources for applications in a
Hadoop cluster and for monitoring this application.
Node Manager
► The Node Manager works on the Slaves System that manages the
memory resource within the Node and Memory Disk.
► Each Slave Node in a Hadoop cluster has a single NodeManager
Daemon running in it. It also sends this monitoring information to
the Resource Manager.
Master – Slave Structure
Functional Programming
► Haskell
► SML
► Clojure
► Scala
► Erlang
► Clean
► F#
► ML/OCaml Lisp / Scheme
► XSLT
► SQL
► Mathematica
Imperative &Declarative languages
► Pure functions
► Recursion
► Referential transparency
► Functions are First-Class
► Functions can be Higher-Order
► Variables are Immutable
Pure Function
► A pure function is one whose results are dependent only upon the input
parameters, and whose operation initiates no side effect, that is, makes no external
impact besides the return value. The pure function’s only result is the value it
returns. They are deterministic.
Example : Function Pure(a,b)
{
return a+b;
}
These functions have two main properties.
► First, they always produce the same output for same arguments irrespective of
anything else.
► Secondly, they have no side-effects i.e. they do not modify any arguments or
local/global variables or input/output streams. This is called immutability.
Recursion
► There are no “for” or “while” loop in functional languages. Iteration
in functional languages is implemented through recursion. Recursive
functions repeatedly call themselves, until it reaches the base case.
Example:
fib(n)
if (n <= 1)
return 1;
else
return fib(n - 1) + fib(n - 2);
Referential transparency
► Functional programs should perform operations just like as if it is for
the first time. So, you will know what may or may not have
happened during the program’s execution, and its side effects. In FP
term it is called Referential transparency.
► pure functions + immutable data = referential transparency
► In functional programs variables once defined do not change their
value throughout the program. Functional programs do not have
assignment statements. If we have to store some value, we define
new variables instead. This eliminates any chances of side effects
because any variable can be replaced with its actual value at any
point of execution. State of any variable is constant at any instant.
Functions are First-Class
► First-class functions are treated as first-class variable. The first class variables can
be passed to functions as parameter, can be returned from functions or stored in
data structures. Functions as data i.e.. using functions as variables, arguments, and
return values to create elegant code.
► Functions as first-class entities can:
❖ refer to it from constants and variables
❖ pass it as a parameter to other functions
❖ return it as result from other functions
Example:
show_output(f) // function show_output is declared taking argument f which are another
function
f(); // calling passed function
Func=f(); //function assigned to variable
Functions can be Higher-Order
► Higher order functions are the functions that take other functions as arguments and
they can also return functions.
❖ takes one or more functions as arguments, or
❖ returns a function as its result
Example:
print_gfg() // declaring another function
print("hello gfg");
show_output(print_gfg) // passing function in another function
This technique applies a function to its arguments one at a time, as each application
returning a new function which accepts the next argument.
Variables are Immutable