Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

UNIT -II

OPEN SOURCE DISTRIBUTED FILE SYSTEMS


SYLLABUS

OPEN SOURCE DISTRIBUTED FILE SYSTEMS


► Hadoop- Hadoop distributions- Hadoop Components -Architecture-HDFS-Basics
of functional programming-Map reduce fundamentals-Data flow
(Architecture)-Real world problems-Scalability goal-Fault tolerance-Optimization
and data locality-Parallel Efficiency of Map-Reduce.
NoSQL???
Not Only SQL -> Non-relational, open source, distributed
databases
Features
► Non-relational
-- either key-value pairs or document-oriented or graph-based databases
► Distributed
--Data distributed across several nodes in cluster
► No support for strict ACID properties
--CAP theorem
► No fixed table schema
--support for flexibility to the schema
Types of NoSQL Databases:
► Key-value or the big hash table
► Schema-less

Examples:
Key-Value : Dynamo,Redis,Riak, etc.
Document : Mongo DB, Apache CouchDB, CouchDB, MarkLogic,etc.
Column: Cassandra, Hbase, etc.
Graph : NEo4j, HyperGraphDB, etc.
Hierarchical : IBM IMS,RDM mobile etc
Object Oriented : Realm etc
RDF or TripleStore: 3store, Apache Jena, Apache Rya etc
Advantages

❖ Can easily scale up and down - elasticity


❖ Doesn’t require a pre-defined schema- flexibility
❖ Cheap, easy to implement – lower operational cost
❖ Relaxes the data consistency requirement- favor A and P ,
go for eventual C
❖ Data can be replicated to multiple nodes and can be
partitioned – sharding, Replication
Hadoop

❖ Hadoop systems ensure that the criteria for a distributed


Big Data Operating System are met, as well as that
Hadoop is a data management system that works as
expected while processing analytical data.
❖ Hadoop has mainly been used to store and compute
massive, heterogeneous datasets stored in data lakes rather
than warehouses, as well as for rapid data processing and
prototyping.
❖ Basic knowledge of distributed computing and storage is
needed to fully understand the working of Hadoop and
how to build data processing algorithms and workflows.
❖ Hadoop distributes the computational processing of a large
dataset to several machines that each run on their own
chunk of data in parallel to perform computation at scale.
Features
❖ Optimized to handle massive quantities of structured ,
semi-structured and unstructured data using commodity
hardware
❖ Has shared noting architecture
❖ Replicates data across multiple computers
❖ High throughput rather than low latency – batch
processing
❖ Complements OLAP and OLTP
❖ Not good for non parallel , dependent data
❖ Not good for small files
Basic Principles in Hadoop
► Clusters - working out how to manage data storage and distributed
computing in a cluster.
► Data distribution - As data is applied to the cluster and stored on
several nodes, it is distributed instantly. To reduce network traffic,
each node processes locally stored data
► Data Storage - Data is held in typically 128 MB fixed-size blocks,
and copies of each block are made several times for achieving
redundancy and data protection.
► Jobs - In Hadoop, a job is any computation performed; jobs may be
divided into several tasks, with each node performing the work on a
single block of data.
► Programming Language - Jobs written in high level allow us to
ignore low level details, allowing developers to concentrate their
attention only on data and computation.
► Fault tolerance - When task replication is used, jobs are fault
tolerant, ensuring that the final computation is not incorrect or
incomplete if a single node or task fails.
► Communication - The amount of communication occurring
between nodes should be kept at minimum and should be done in a
transparent manner by the system. To avoid inter-process
dependencies leading to deadlock situation every task should be
executed independently and nodes should not communicate during
processing to ensure it.
► Work Allocation - Master programmes divide work among worker
nodes so that they can all run in parallel on their own slice of the
larger dataset.
Key Advantages

❖ Stores data in native formats


❖ Scalable
❖ Cost-effective
❖ Resilient to failure
❖ Flexibility
❖ Fast
Hadoop Distribution
Hadoop Components

1. HDFS
❖ Storage Component
❖ Distributes data across several nodes
❖ Natively redundant
2. MapReduce
❖ Computational Framework
❖ Splits a task across multiple nodes
❖ Processes data in parallel
Hadoop architecture

► Hadoop is a distributed Master-Slave Architecture.


► The Hadoop architecture is a package of the file system, MapReduce
engine and the HDFS (Hadoop Distributed File System). The
MapReduce engine can be MapReduce/MR1 or YARN/MR2.
► A Hadoop cluster consists of a single master and multiple slave
nodes. The master node includes Job Tracker, Task Tracker,
NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
Hadoop Architecture
Layers in Hadoop Architecture
1. HDFS – Hadoop Distributed File
System
1. Distributed Storage Layer
❖ Each node in a Hadoop cluster has its own disk space, memory, bandwidth, and
processing.
❖ The incoming data is split into individual data blocks, which are then stored within
the HDFS distributed storage layer.
❖ HDFS assumes that every disk drive and slave node within the cluster is
unreliable. As a precaution, HDFS stores three copies of each data set throughout
the cluster.
❖ The HDFS master node (NameNode) keeps the metadata for the individual data
block and all its replicas.
2. Cluster Resource Management

YARN –Yet Another Resource Negotiator


► YARN is a resource management tool for Hadoop. YARN is able to
allocate resources to different frameworks (Apache Hive, Pig,
SQOOP etc.) written for Hadoop.
3. Processing Framework Layer

► The processing layer consists of frameworks that analyze and


process datasets coming into the cluster.
► The structured and unstructured datasets are mapped, shuffled,
sorted, merged, and reduced into smaller manageable data blocks.
► These operations are spread across multiple nodes as close as
possible to the servers where the data is located.
► Computation frameworks such as Spark, Storm, Tez now enable
real-time processing, interactive query processing and other
programming options that help the MapReduce engine and utilize
HDFS much more efficiently.
4. Application Programming Interface
► Big data continues to expand and the variety of tools needs to follow
that growth. Projects that focus on search platforms, data streaming,
user-friendly interfaces, programming languages, messaging,
failovers, and security are all an intricate part of a comprehensive
Hadoop ecosystem.
► Hbase – supports structured data storage for large tables
► Hive – enables analysis of large data sets using language similar to
SQL
► Apache Pig – data analysis done with out MapReduce proficiency
► Zookeeper – coordination service for distributed applications
► Sqoop – transfer bulk data between Hadoop and structured data
stores
Rack server

► A rack server, also called a rack-mounted server, is a computer


dedicated to use as a server and designed to be installed in a
framework called a rack.
► The rack contains multiple mounting slots called bays, each
designed to hold a hardware unit secured in place with screws.
Hadoop Daemons

1. NameNode
2. DataNode
3. Secondary Name Node
4. Resource Manager
5. Node Manager
HDFS

► The Hadoop Distributed File System (HDFS) is fault-tolerant by


design.
► Data is stored in individual data blocks in three separate copies
across multiple nodes and server racks.
► If a node or even an entire rack fails, the impact on the broader
system is negligible.
► DataNodes process and store data blocks, while NameNodes manage
the many DataNodes, maintain data block metadata, and control
client access.
NameNode
► NameNode works on the Master System. The primary purpose of Namenode is to
manage all the MetaData.
► As we know the data is stored in the form of blocks in a Hadoop cluster. So the
DataNode on which or the location at which that block of the file is stored is mentioned
in MetaData.
► All information regarding the logs of the transactions happening in a Hadoop cluster
(when or who read/wrote the data) will be stored in MetaData.
► MetaData is stored in the memory.
► The NameNode is also responsible to take care of the replication factor of all the blocks
► In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the DataNodes.
► There are two files associated with the metadata:FsImage: It contains the complete state
of the file system namespace since the start of the NameNode.
► EditLogs: It contains all the recent modifications made to the file system with respect to
the most recent FsImage.
Namenode

Features:
► It never stores the data that is present in the file.
► As Namenode works on the Master System, the Master system should have good
processing power and more RAM than Slaves.
► It stores the information of DataNode such as their Block id’s and Number of
Blocks
RACK AWARENESS

► The name node has the feature of finding the closest data node for
faster performance for that Name node holds the ids of all the Racks
present in the Hadoop cluster. This concept of choosing the closest
data node for serving a purpose is Rack Awareness.
Rack Awareness Policies:
❖ There should not be more than 1 replica on the same Datanode.
❖ More then 2 replica’s of a single block is not allowed on the same
Rack.
❖ The number of racks used inside a Hadoop cluster must be smaller
than the number of replicas.
Responsibilities of Namenode
Namespace management.
► The namenode is responsible for maintaining the file namespace, which includes metadata, directory
structure, file to block mapping, location of blocks, and access permissions. These data are held in
memory for fast access and all mutations are persistently logged.
Coordinating file operations.
► The namenode directs application clients to datanodes for read operations, and allocates blocks on
suitable datanodes for write operations. All data transfers occur directly between clients and datanodes.
When a file is deleted, HDFS does not immediately reclaim the available physical storage; rather, blocks
are lazily garbage collected.
Maintaining overall health of the file system.
► The namenode is in periodic contact with the datanodes via heartbeat messages to ensure the integrity of
the system. If the namenode observes that a data block is under-replicated (fewer copies are stored on
datanodes than the desired replication factor), it will direct the creation of new replicas. Finally, the
namenode is also responsible for rebalancing the file system. During the course of normal operations,
certain datanodes may end up holding more blocks than others; rebalancing involves moving blocks from
datanodes with more blocks to datanodes with fewer blocks. This leads to better load balancing and more
even disk utilization.
DataNode

DataNode works on the Slave system. The NameNode always


instructs DataNode for storing the Data.
DataNode is a program that runs on the slave system that serves the
low-level read/write request from the client.
As the data is stored in this DataNode, they should possess high
memory to store more Data.
► They send heartbeats to the NameNode periodically to report the
overall health of HDFS, by default, this frequency is set to 3
seconds.
Data blocks

► Files in HDFS are broken into block-sized chunks called data


blocks. These blocks are stored as independent units.

► The size of these HDFS data blocks is 128 MB by default. We can


configure the block size as per our requirement by changing the
dfs.block.size property in hdfs-site.xml

► Hadoop distributes these blocks on different slave machines, and the


master machine stores the metadata about blocks location.
Advantages of Data Blocks

1. No limitation on the file size


► A file can be larger than any single disk in the network.
2. Simplicity of storage subsystem
► Since blocks are of fixed size, we can easily calculate the number of blocks that can be
stored on a given disk. Thus provide simplicity to the storage subsystem.
3. Fit well with replication for providing Fault Tolerance and High Availability
► Blocks are easy to replicate between DataNodes thus, provide fault tolerance and high
availability.
4. Eliminating metadata concerns
Since blocks are just chunks of data to be stored, we don’t need to store file metadata (such
as permission information) with the blocks, another system can handle metadata separately.
Secondary NameNode – check point node
► Secondary NameNode is used for taking the hourly backup of the
data.
► In case the Hadoop cluster fails, or crashes, the secondary
Namenode will take the hourly backup or checkpoints of that data
and store this data into a file name fsimage. This file then gets
transferred to a new system. A new MetaData is assigned to that new
system and a new Master is created with this MetaData, and the
cluster is made to run again correctly.
Major Function Of Secondary NameNode:
► It groups the Edit logs and Fsimage from NameNode together.
► It continuously reads the MetaData from the RAM of NameNode
and writes into the Hard Disk.
Basic Hadoop Commands

► https://data-flair.training/blogs/hadoop-hdfs-commands/
YARN - Resource Manager
► Resource Manager is also known as the Global Master Daemon that
works on the Master System. The Resource Manager Manages the
resources for the applications that are running in a Hadoop Cluster.
► The Resource Manager Mainly consists of 2 things.
1. ApplicationsManager
2. Scheduler
► An Application Manager is responsible for accepting the request for
a client and also makes a memory resource on the Slaves in a
Hadoop cluster to host the Application Master.
► The scheduler is utilized for providing resources for applications in a
Hadoop cluster and for monitoring this application.
Node Manager

► The Node Manager works on the Slaves System that manages the
memory resource within the Node and Memory Disk.
► Each Slave Node in a Hadoop cluster has a single NodeManager
Daemon running in it. It also sends this monitoring information to
the Resource Manager.
Master – Slave Structure
Functional Programming

► is a programming paradigm — a style of building the structure and


elements of computer programs — that treats computation as the
evaluation of mathematical functions and avoids changing-state and
mutable data
► is a style of programming that emphasizes the evaluation of
expressions, rather than execution of commands. The expressions in
these languages are formed by using functions to combine basic
values. A functional language is a language that supports and
encourages programming in a functional style.
Characteristics of Functional Programming

► Functional programming method focuses on results, not the process


► Emphasis is on what is to be computed
► Data is immutable
► Functional programming Decompose the problem into ‘functions
► It is built on the concept of mathematical functions which uses
conditional expressions and recursion to do perform the calculation
► It does not support iteration like loop statements and conditional
statements like If-Else
History of Functional programming

► The foundation for Functional Programming is Lambda Calculus. It


was developed in the 1930s for the functional application, definition,
and recursion
► LISP was the first functional programming language. McCarthy
designed it in 1960
► In the late 70’s researchers at the University of Edinburgh defined
the ML(Meta Language)
► In the early 80’s Hope language adds algebraic data types for
recursion and equational reasoning
► In the year 2004 Innovation of Functional language ‘Scala.’
Functional Programming Languages

► Haskell
► SML
► Clojure
► Scala
► Erlang
► Clean
► F#
► ML/OCaml Lisp / Scheme
► XSLT
► SQL
► Mathematica
Imperative &Declarative languages

► Imperative means you command the computer to do something by


explicitly stating each step it needs to perform in order to
calculate the result.
The object-oriented paradigm is based on creating abstractions for
data. It allows the programmer to hide the inner representation inside
an object and provide only a view of it to the rest of the world via the
object’s API.
► Declarative means you state what should be done, and the
programming language has the task of figuring out how to do it.
The FP style creates abstractions on the functions.
Concepts of functional programming

► Pure functions
► Recursion
► Referential transparency
► Functions are First-Class
► Functions can be Higher-Order
► Variables are Immutable
Pure Function
► A pure function is one whose results are dependent only upon the input
parameters, and whose operation initiates no side effect, that is, makes no external
impact besides the return value. The pure function’s only result is the value it
returns. They are deterministic.
Example : Function Pure(a,b)
{
return a+b;
}
These functions have two main properties.
► First, they always produce the same output for same arguments irrespective of
anything else.
► Secondly, they have no side-effects i.e. they do not modify any arguments or
local/global variables or input/output streams. This is called immutability.
Recursion
► There are no “for” or “while” loop in functional languages. Iteration
in functional languages is implemented through recursion. Recursive
functions repeatedly call themselves, until it reaches the base case.
Example:
fib(n)
if (n <= 1)
return 1;
else
return fib(n - 1) + fib(n - 2);
Referential transparency
► Functional programs should perform operations just like as if it is for
the first time. So, you will know what may or may not have
happened during the program’s execution, and its side effects. In FP
term it is called Referential transparency.
► pure functions + immutable data = referential transparency
► In functional programs variables once defined do not change their
value throughout the program. Functional programs do not have
assignment statements. If we have to store some value, we define
new variables instead. This eliminates any chances of side effects
because any variable can be replaced with its actual value at any
point of execution. State of any variable is constant at any instant.
Functions are First-Class
► First-class functions are treated as first-class variable. The first class variables can
be passed to functions as parameter, can be returned from functions or stored in
data structures. Functions as data i.e.. using functions as variables, arguments, and
return values to create elegant code.
► Functions as first-class entities can:
❖ refer to it from constants and variables
❖ pass it as a parameter to other functions
❖ return it as result from other functions
Example:
show_output(f) // function show_output is declared taking argument f which are another
function
f(); // calling passed function
Func=f(); //function assigned to variable
Functions can be Higher-Order
► Higher order functions are the functions that take other functions as arguments and
they can also return functions.
❖ takes one or more functions as arguments, or
❖ returns a function as its result
Example:
print_gfg() // declaring another function
print("hello gfg");
show_output(print_gfg) // passing function in another function
This technique applies a function to its arguments one at a time, as each application
returning a new function which accepts the next argument.
Variables are Immutable

► Another tenet of functional programming philosophy is not to modify data


outside the function. In practice, this means to avoid modifying the input
arguments to a function. Instead, the return value of the function should
reflect the work done. This is a way of avoiding side effects. It makes it
easier to reason about the effects of the function as it operates within the
larger system.
► Immutable Data means that you should easily able to create data structures
instead of modifying ones which is already exist.
► In functional programming, we can’t modify a variable after it’s been
initialized. We can create new variables – but we can’t modify existing
variables, and this really helps to maintain state throughout the runtime of
a program. Once we create a variable and set its value, we can have full
confidence knowing that the value of that variable will never change.
Benefits of functional programming

► Allows you to avoid confusing problems and errors in the code


► Easier to test and execute Unit testing and debug FP Code.
► Parallel processing and concurrency
► Hot code deployment and fault tolerance
► Offers better modularity with a shorter code
► Increased productivity of the developer
► Supports Nested Functions
► Functional Constructs like Lazy Map & Lists, etc.
► Allows effective use of Lambda Calculus
Limitations of Functional Programming

► Functional programming paradigm is not easy, so it is difficult to


understand for the beginner
► Hard to maintain as many objects evolve during the coding
► Needs lots of mocking and extensive environmental setup
► Re-use is very complicated and needs constantly refactoring
► Objects may not represent the problem correctly

You might also like