Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

1

UNIT - II
Cloud Services & Platforms :

In this chapter you will earn about various types of cloud computing services including
compute, storage, database, application, content delivery, analytics, deployment & manageFor
each category of cloud services, examples of services provided by various cloud service
providers including Amazon, Google and Microsoft are described.
The cloud computing reference model along with the va1ious cloud service models
(laaS,PaaSandSaaS). Infrastructure-as-a-Service(IaaS) provides virtualized dynamically
scalable resources using a virtualized infrastructure.
Platform as-a-Service(PaaS) simplifies application development by providing developm ent
tools, application programming interfaces(APis), software libraries that can be used for
wide range of applications. Software-as-a-Service (SaaS) provides multi-tenant applications
hosted in the cloud.
The bottom most layer in the cloud reference model is the in frastructure and facilities layer that
includes the physical infrastructure such as data center facilities, electrical and mechanical
equipment, etc. On top of the infrastructure layer is the hardware layer that includes physical
compute, network and storage hardware. On top of the hardware layer the virtualization layer
partitions the physical hardware resources into multiple virtual resources that enabling pooling
of resources. Chapter2 described various types of virtualization approaches such as full
virtualization, Para-virtualization and hardware virtualization. The computing services are
delivered in the form of Virtual Machines (VMs) along with the storage and net work resources.

The platform and middle ware layer builds up on the laaS layers be low and provides stan -
dardized stacks of services such as database service, queuing service, application frame work
sand run-time environments, messaging services, monitoring services, analytics services, etc.
The service management layer provides APls for requesting, managing and monitoring cloud
resources. ThetopmostlayeristheapplicationslayerthatincludesSaasapplicationssuchasEmail,
cloud storage application, productivity app management portals, customer self-service portals,
etc.
Various types of cloud services and the associated layers in the cloud reference model.
Compute Services:

Compute services provide dynamically scalable compute capacity in the cloud. Compute
resources can be provisioned on-demand in the form of virtual machines. Virtual machines can
be created from standard images provided by the cloud service provider
(e.g.Ubuntuimage,Windowsserverimage,etc.) or custom images created by the users. A
2

machine image is a template that contains a software configuration


(operatingsystem,applicationserver,andapplications). Compute services can be accessed from
the web consoles of these services that provide graphical user interfaces for provisioning,
managing and monitoring these services. Cloud service providers also provide APis for various
programming languages (suchasJava,Python,etc.)that allow developers to access and manage
these services programmatically.
3

EX:

Amazon Elastic Compute Cloud :

Amazon Elastic Compute Cloud(EC2) is a compute service provided by Amazon. To


launch a new instance click on the launch instance button. This will open a wizard
where you can select the Amazon machine image(AMI)with which you want to
launch the instance. You can also create their own AMls with custom applications,
libraries and data. Instances can be launched with a variety of operating systems.
When you launch an instance you specify the instance type
(micro,small,medium,large,extra-large,etc.), the number of instances to launch based
on the selected AMI and availability zones for the instances. The instance launch
wizard also allows you to specify the meta-data tags for the instance that simplify the
administration of EC2 instances. When launching a new instance, the user selects a
key-pair from existing key pairs or creates a new key pair for the instance. Key pairs
are used to securely connect to an instance after it launches. The security groups to be
associated with the instance can be selected from the instance launch wizard.
Security groups are used to open or block a specific network port for the launched
instances.
When the instance is launch edits status can be viewed in the EC2 console. Upon
launching a new instance, its state is pending. It takes a couple of minutes for the
instance to come into the running state. When the instance comes into the running
state, it is assigned a public DNS, private DNS, public IP and private IP. The public
DNS can be used to securely connect to the instance using SSH.

Storage Services:

Cloud storage services allow storage and retrieval of any amount of data, at anytime
from anywhere on the web. Most cloud storage services organize data into buckets or
containers. Buckets or containers store objects which are individual pieces of data.
Features
• Scalability: Cloud storage services provide high capacity and scalability.
• Replication: When an object is upload edit is replicated at multiple facilities
and/or multiple devices with in each facility.
4

• Access Policies: Cloud storage services provide several security features such
as Access Control Lists(ACLs), bucket/container level policies, etc. ACLs can
be used to selectively grant access permissions on individual objects.
Bucket/container level

Policies can also defined to permissions across some of the objects with in a
single bucket/container.
• Encryption: Cloud storage services provide Server Side Encryption(SSE)
option encrypt all data stored in the cloud storage.
• Consistency: Therefore, any object that is uploaded can be immediately downloaded
after the upload is complete.
Amazon Simple Storage Service:

Amazon Simple Storage Service (S3) is an online cloud-based data storage


infrastructure for storing and retrieving any amount of data. S3 provide highly
reliable, scalable, fast, fully and storage infrastructure. Data stored on S3 is organized
in the form of buckets. You must create a bucket before you can store S3.S3 console
provides simple wizards for creating a new bucket and uploading files.

Data base Services:

Clouddatabaseservicesallowyoutoset-upandoperaterelationalornon -relational
databases in the cloud. The benefit of using cloud data base services is that it relieves
the application developers from the time consuming database administration tasks.
Popular relational databases provided by various cloud service providers include
MySQL, Oracle, SQLServer, etc. Then on-relational(No-SQL) databases provided by
cloud service providers are mostly proprietary solutions. No-SQL databases are
usually fully-managed and delivers and scalability. The characteristics of relational
and non-relational databases are described.

Features

• Scalability: Cloud database services allow provisioning as much compute and


5

storage resources as required to meet the appl ication work load levels.
Provisioned capacity

Can be scaled-up or down. For read-heavy workloads, read-replicas can be created.


• Reliability : Cloud database services are reliable and provide automated backup
and snapshot options.
• Performance: Cloud database services provide guaranteed performance with
options such as guaranteed input/output operations per second (IOPS) which
can be provisioned up front.
Security: Cloud database services provide several securi ty feature store strict
the access to the database instances and stored data, such as network firewalls and
authentication mechanisms.

Windows Azure SQL Database :

Windows Azure SQL Database is the relational database service from Microsoft.
Azure SQL Database is based on the SQL server, but it does not give each custom
instance of SQL server. Instead the SQL Database is a multi-tenant service, with a
logical SQL Database server for each customer.
Windows Azure Table Service :

Windows Azure Table Service is a non-relational (No-SQL) database service from


Microsoft. The Azure Table Service data model consists of tables havi ng multiple
entities. Tables are divided into some number of partitions, each of which can be stored
on a separate machine. Each partition in a table holds a specified number of entities, each
containing as many as 255 properties. Each property can be one of the several supported
data types such as integers and strings. Tables do not have a fixed schema and different
entities in a table can have different properties.

Application Services :

In this section you will learn about various cloud application services such as
application runtime sand frameworks, queuing services, email services, notification
services and media services.

Application Runtimes & Frameworks


6

Cloud-based application runtimes and frameworks allow developers to develop and host
applications in the cloud. Application runtimes provides up port for programming
languages (e.g.,Java,Python,orRuby). Application runtimes automatically allocate
resources for applications and handle the application scaling, without the need to run and
maintain servers.
Google App Engine

Google App Engine is the platform-as-a-service(PaaS) from Google, which includes both
an application runtime and web frameworks. Figure3.13 shows a screen shot of the
Google App Engine console.
App Engine features include:
• Runtimes: App Engine supports applications developed in Java, Python, PHP and
Go programming languages. App Engine provides runtime environments for Java,
Python, PHP and Go programming language.
• Sandbox: Applications run in a secure sandbox environment isolated from other
applications. The sandbox environment provides a limited access to the underlying
operating system. App Engine can only execute application code called from
HTTP

requests. The sandbox environment allows App Engine to distribute web requests
for the application across multiple servers.
• Web Frameworks: App Engine provides a simple Python web application
framework called web app2. App Engine also supports any frame work written in
pure Python that speaks WSGI, including Django, CherryPy, Pylons, web. py, and
web 2py.
• Data store: App Engine provides a no-SQL data storage service.
• Authentication: App Engine application s can be integrated with Google
Accounts for user authentication.
• URL Fetch service: URL Fetch service allows applications to access resources
on the Internet, such as web services or other data
• Email service: Email service allows applications to send email messages.
• Image Manipulation service: Image Manipulation service allows application
7

store size, crop, images.


• Memcache: Memcache service is a high performance in-memory key-value
cache service that app can use for caching data items that do not need a
persistent storage.
• Task Queues: Task queues allow applications to do work in the background by
breaking up work into small, discrete units, called tasks which are en queued in
task queues.
• Scheduled Tasks service: App Engine provides a Cron service for scheduled
tasks that trigger events at specified time sand regular intervals. This service
allows applications to perform tasks at defined time so regular intervals.

Application Services:
Cloud application services such as application runtimes and frameworks, queuing services, email
services, notification services and media services.
EX:
Application runtimes & frameworks:
Cloud based application runtimes and frameworks allow developers to develop and host
applications in the cloud. Application runtimes provide support for programming languages (e.g..
java, python, ..or..ruby). application runtimes automatically allocate resources for applications
and handle the application scaling, without the need to run and maintain servers.
Content Delivery Services :
Cloud – based content delivery service include content delivery networks. A CDN is a
distributed system of servers located across multiple geographic locations to serve content to end
– users with high availability and high performance . CDNs are useful for serving static content
such as text, image. etc..and streaming media . CDNs have a number of edge locations deployed
in multiple locations, over multiple backbones. Requests for static or streaming media content
that is server by a CDN are Directed to the nearest edge location. CDNs cache the popular
content on the edge servers which helps in reducing bandwidth costs and improving response
times.
EX:
Windows Azure Content Delivery Network
Windows Azure Content Delivery Network(CDN)is the content delivery service from
Microsoft. Azure CDN caches Windows Azure blob sand static content at the edge
locations 1improve the performance of websites. Azure CDN can be enabled on a
Windows Azure storage account.
Analytics Services
Cloud-based analytics services allow analyzing massive data sets stored in the cloud either in
cloud storages or in cloud data bases using programming models such as Map Reduce. Using
cloud analytics services applications can perform data-intensive tasks such as such as
8

data mining. Log file analysis. Machine learning, web indexing.etc.


EX :
Google map reduce service :
Google Map Reduce Service is a part of the App Engine
platform. App Engine Map Reduce is
opti mized for App Engine environment and provide s capabilities such as automatic sharing for
faster execution, standard data i n put readers for iterating over blob and data store data,
standard outpu t wri ters ,etc. The Map Red uce serv ice can be accessed using the Google
Map Reduce API. To execute a Map Red uce job a Map Reduce pipeline object is instantiated
with i n the App Engi ne application. Map Reduce pipeline specifies the mapper, reducer, data
input reader, output writer.

deployment & management services :


Cloud-based deployment & management services allow you to easily deploy and
manage applications in the cloud. These services automatically handle deployment
tasks such as capacity provisioning, load balancing, auto-scaling, and application
health monitoring.
EX:
Amazon cloud formation :
Amazon cloud formation is a deployment service from Amazon . With cloud front you can create
deployments from a collection of AWS resources such as amazon elastic compute cloud .
amazon elastic block store, amazon simple notification service, elastic load balancing and auto
scaling. A collection of AWS resources that you want to manage together are organized into a
stack. Cloud formation stacks are created from cloud formation templates. You can create your
own temples or use the predefined templates.
The AWS infrastructure requirements for the stack are specified in the template.
Identity & Access Management Services :
Identity & Access Management Services(IDAM) services allow managing the authentication and
authorization of users to provide secure access to cloud resources. IDAM services are useful for
organizations which have multiple users who access the cloud resources. Using IDAM services
you can manage user identifiers, user permissions, security credential and access keys.
EX:
Amazon Identity & Access Management Services (IAM):
AWS Identity and Access Management (IAM) allows you to manage users and user permis-
sions for an AWS account. With lAM you can manage users, security credentials such as
access keys, and permissions that control which AWS resource s users can access. Using
IAM you can control what data users can access and what resources users can create. lAM
also allows you to control creation, rotation, and revocation security credentials of users.

open source cloud software:


This section covers open source cloud software that can be used to build private clouds.
EX:
Open Stack:
9

Open Stack is a cloud operating system comprising of a collection of inter acting services that
control computing, storage, and networking resources. The Open Stack compute service (called
nova-compute) ma nages networks of virtual machines running on nodes, providing virtual
servers on demand. The network service (called nova-networking) provides connectivity
between the interfaces of other Open Stack services. The volume service (Cinder) manages
storage volumes for virtual machine s. The object storage service (swift) allows users to store and
retrieve file. The identity service (keystone) provide s authentication and authorization. The
image registry (glance) acts as a catalog and repository for virtual machine images. The Open
Stack scheduler (nova-scheduler) maps the nova-API calls to the appropriate Open stack
components. The scheduler takes the virtual machine requests from the queue and determines
where they should run. The messaging service (rabbit-mq) act as a central node for message
passing between daemons. Orchestration activities such as running an instance are performed by
the nova-api which accepts and responds to end user compute API calls. The Open stack
dashboard (called horizon) provides web-based interface for managing Open stack service.
Apache Hadoop : Apache Hadoop is an open source framework. For distributed batch
processing of big data. The Map Reduce has also been proposed as a parallel programming
model suitable for the cloud. The map reduce algorithm allowed large scale computations to
be parallelized across a large cluster of servers. The hadoop ecosystem consists of projects.
Hadoop common : Hadoop common consists of common utilities that support other hadoop
modules. Hadoop common has utilities and scripts for starting hadoop, components and
interfaces to access the file systems supported by hadoop.
Hadoop distributed file system : HDFS is a distributed file system that runs on large
clusters and providers high throughput access to data. HDFS was built to reliably store very
large files across machines in a large cluster built of commodity hardware . HDFS stores
each file as a sequence of block all of which are of the same size except the last block. The
blocks of each file replicated on multiple machines in a cluster with a default replication
factor of 3 to provide fault tolerance.
Map Reduce – A framework that helps programs do the parallel computation on data. The map
task takes input data and converts it into a dataset that can be computed in key value pairs. The
output of the map task is consumed by reduce tasks to aggregate output and provide the desired
result.
Yarn : It is another resource of negotiators; it manages the bundle of data by scheduling jobs. It
is one of the frameworks of resource of Hadoop data management.
HBase – An open source, non-relational, versioned database that runs on top of Amazon S3
(using EMRFS) or the Hadoop Distributed File System (HDFS). HBase is a massively scalable,
distributed big data store built for random, strictly consistent, real-time access for tables with
billions of rows and millions of columns.
10

Apache Zookeeper : In failed Name Node it reduces the failures by automating.


Apache Pig : It is the development platform, that helps apache to run on Hadoop. Pig Latin
language is used in Apache Pig .
Hive – Allows users to leverage Hadoop Map Reduce using a SQL interface, enabling analytics
at a massive scale, in addition to distributed and fault-tolerant data warehousing.
Chukwa : chukwa is a data collection system for monitoring large distributed systems. Chukwa
is built on top on HDFS and hadoop map reduce and allows collecting and analyzing data.
Mahout : mahout is a scalable machine learning algorithms for clustering, classification and
collaborative filtering which are implemented on top of hadoop using the map reduce parallel
programming model.
Cassandra : Cassandra is scalable multi – master database with no single points of and provides
a highly available service with no single point of failure, Cassandra is a No – SQL solution that
provides a structured key – value store.
Avro : Avro is a data serialization system that provides rich data structures, a compact and fast
binary data format, a container file to store persistent data, cross – language and simple
integration with dynamic languages.
Apache Oozie : It manages the Hadoop jobs by scheduling the system to make it easier.
Flume : flume is a distributed, reliable and available service for collecting, analyzing and
moving large amounts of data from applications to HDFS.
Apache Sqoop : For Hadoop and relational database, it works as a command-line.

Hadoop Map Reduce Job Execution:


Hadoop Map Reduce is the data processing layer. It processes the huge amount of structured
and unstructured data stored in HDFS. Map Reduce processes data in parallel by dividing the job
into the set of independent tasks. So, parallel processing improves speed and reliability.
Hadoop Map Reduce data processing takes place in 2 phases- Map and Reduce phase.
11

 Map phase- It is the first phase of data processing. In this phase, we specify all the complex
logic/business rules/costly code.
 Reduce phase- It is the second phase of processing. In this phase, we specify light-weight
processing like aggregation/summation.
Steps of Map Reduce Job Execution flow

Map Reduce processes the data in various phases with the help of different components. Let’s
discuss the steps of job execution in Hadoop.

1. Input Files

In input files data for Map Reduce job is stored. In HDFS, input files reside. Input files format is
arbitrary. Line-based log files and binary format can also be used.
2. Input Format

After that Input Format defines how to split and read these input files. It selects the files or other
objects for input. Input Format creates Input Split.

3. Input Splits

It represents the data which will be processed by an individual Mapper. For each split, one map
task is created. Thus the number of map tasks is equal to the number of Input Splits. Framework
divide split into records, which mapper process.
4. Record Reader

It communicates with the input Split. And then converts the data into key-value pairs suitable
for reading by the Mapper. Record Reader by default uses Text Input Format to convert data into
a key-value pair.
It communicates to the Input Split until the completion of file reading. It assigns byte offset to
12

each line present in the file. Then, these key-value pairs are further sent to the mapper for further
processing.

5. Mapper

It processes input record produced by the Record Reader and generates intermediate key-value
pairs. The intermediate output is completely different from the input pair. The output of the
mapper is the full collection of key-value pairs.

Hadoop framework doesn’t store the output of mapper on HDFS. It doesn’t store, as data is
temporary and writing on HDFS will create unnecessary multiple copies. Then Mapper passes
the output to the combiner for further processing.

4. Combiner

Combiner is Mini-reducer which performs local aggregation on the mapper’s output. It


minimizes the data transfer between mapper and reducer. So, when the combiner functionality
completes, framework passes the output to the partitioner for further processing.

5. Partitioner

Partitioner comes into the existence if we are working with more than one reducer. It takes the
output of the combiner and performs partitioning.

Partitioning of output takes place on the basis of the key in MapReduce. By hash function, key
(or a subset of the key) derives the partition.

On the basis of key value in Map Reduce, partitioning of each combiner output takes place. And
then the record having the same key value goes into the same partition. After that, each partition
is sent to a reducer.

Partitioning in Map Reduce execution allows even distribution of the map output over the
reducer.

6. Shuffling and Sorting

After partitioning, the output is shuffled to the reduce node. The shuffling is the physical
movement of the data which is done over the network. As all the mappers finish and shuffle the
output on the reducer nodes.
13

Then framework merges this intermediate output and sort. This is then provided as input to
reduce phase.

7. Reducer

Reducer then takes set of intermediate key-value pairs produced by the mappers as the input.
After that runs a reducer function on each of them to generate the output.

The output of the reducer is the final output. Then framework stores the output on HDFS.

8. Record Writer

It writes these output key-value pair from the Reducer phase to the output files.

9. Output Format

Output Format defines the way how Record Reader writes these output key-value pairs in output
files. So, its instances provided by the Hadoop write files in HDFS. Thus Output Format
instances write the final output of reducer on HDFS.

Hadoop Schedulers:

Introduction to Hadoop Scheduler ;

Prior to Hadoop 2, Hadoop MapReduce is a software framework for writing applications that
process huge amounts of data (terabytes to petabytes) in-parallel on the large Hadoop cluster.
This framework is responsible for scheduling tasks, monitoring them, and re-executes the failed
task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic idea
behind the YARN introduction is to split the functionalities of resource management and job
scheduling or monitoring into separate daemons that are Resource Manager, Application Master,
and Node Manager.
Resource Manager is the master daemon that arbitrates resources among all the applications in
the system. Node Manager is the slave daemon responsible for containers, monitoring their
resource usage, and reporting the same to Resource Manager or Schedulers. ApplicationMaster
negotiates resources from the ResourceManager and works with NodeManager in order to
execute and monitor the task.

The ResourceManager has two main components that are Schedulers and ApplicationsManager.
14

Schedulers in YARN ResourceManager is a pure scheduler which is responsible for allocating


resources to the various running applications.
It is not responsible for monitoring or tracking the status of an application. Also, the scheduler
does not guarantee about restarting the tasks that are failed either due to hardware failure or
application failure.

They are three main Scheduler,

1. FIFO Scheduler,
2. Capacity Scheduler,
3. Fair Scheduler

1. FIFO Scheduler

First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives more
preferences to the application coming first than those coming later. It places the applications in a
queue and executes them in the order of their submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the queue are
allocated first. Once the first application request is satisfied, then only the next application in the
queue is served.

Advantage:
 It is simple to understand and doesn’t need any configuration.
 Jobs are executed in the order of their submission.
Disadvantage:
 It is not suitable for shared clusters. If the large application comes before the shorter one, then
the large application will use all the resources in the cluster, and the shorter application has to
wait for its turn. This leads to starvation.
15

 It does not take into account the balance of resource allocation between the long applications
and short applications.

2. Capacity Scheduler

The CapacityScheduler allows multiple-tenants to securely share a large Hadoop cluster. It is


designed to run Hadoop applications in a shared, multi-tenant cluster while maximizing the
throughput and the utilization of the cluster.
It supports hierarchical queues to reflect the structure of organizations or groups that utilizes the
cluster resources. A queue hierarchy contains three types of queues that are root, parent, and leaf.

The root queue represents the cluster itself, parent queue represents organization/group or sub-
organization/sub-group, and the leaf accepts application submission.

The Capacity Scheduler allows the sharing of the large cluster while giving capacity guarantees
to each organization by allocating a fraction of cluster resources to each queue.

Also, when there is a demand for the free resources that are available on the queue who has
completed its task, by the queues running below capacity, then these resources will be assigned
to the applications on queues running below capacity. This provides elasticity for the
organization in a cost-effective manner.

Apart from it, the CapacityScheduler provides a comprehensive set of limits to ensure that a
single application/user/queue cannot use a disproportionate amount of resources in the cluster.

To ensure fairness and stability, it also provides limits on initialized and pending apps from a
single user and queue.

Advantages:
 It maximizes the utilization of resources and throughput in the Hadoop cluster.
16

 Provides elasticity for groups or organizations in a cost-effective manner.


 It also gives capacity guarantees and safeguards to the organization utilizing cluster.
Disadvantage:
 It is complex amongst the other scheduler.
3. Fair Scheduler

FairScheduler allows YARN applications to fairly share resources in large Hadoop clusters. With
FairScheduler, there is no need for reserving a set amount of capacity because it will dynamically
balance resources between all running applications.

It assigns resources to applications in such a way that all applications get, on average, an equal
amount of resources over time.

The FairScheduler, by default, takes scheduling fairness decisions only on the basis of memory.
We can configure it to schedule with both memory and CPU.

When the single application is running, then that app uses the entire cluster resources. When
other applications are submitted, the free up resources are assigned to the new apps so that every
app eventually gets roughly the same amount of resources. FairScheduler enables short apps to
finish in a reasonable time without starving the long-lived apps.

Similar to CapacityScheduler, the FairScheduler supports hierarchical queue to reflect the


structure of the long shared cluster.

Apart from fair scheduling, the FairScheduler allows for assigning minimum shares to queues for
ensuring that certain users, production, or group applications always get sufficient resources.
When an app is present in the queue, then the app gets its minimum share, but when the queue
doesn’t need its full guaranteed share, then the excess share is split between other running
applications.
17

Advantages:
 It provides a reasonable way to share the Hadoop Cluster between the number of users.
 Also, the FairScheduler can work with app priorities where the priorities are used as weights
in determining the fraction of the total resources that each application should get.
Disadvantage:
 It requires configuration.
Hadoop Cluster:
A Hadoop cluster is nothing but a group of computers connected together via LAN. We use it for
storing and processing large data sets. Hadoop clusters have a number of commodity hardware
connected together. They communicate with a high-end machine which acts as a master. These
master and slaves implement distributed computing over distributed data storage. It runs open
source software for providing distributed functionality.
Hadoop cluster has master-slave architecture.
i. Master in Hadoop Cluster
It is a machine with a good configuration of memory and CPU. There are two daemons running
on the master and they are NameNode and Resource Manager.
a. Functions of NameNode

 Manages file system namespace


 Regulates access to files by clients
 Stores metadata of actual data Foe example – file path, number of blocks, block id, the
location of blocks etc.
18

 Executes file system namespace operations like opening, closing, renaming files and
directories
The NameNode stores the metadata in the memory for fast retrieval. Hence we should configure
it on a high-end machine.
b. Functions of Resource Manager

 It arbitrates resources among competing nodes


 Keeps track of live and dead nodes
You must learn about the Distributed Cache in Hadoop
ii. Slaves in the Hadoop Cluster
It is a machine with a normal configuration. There are two daemons running on Slave machines
and they are – DataNode and Node Manager
a. Functions of DataNode
 It stores the business data
 It does read, write and data processing operations
 Upon instruction from a master, it does creation, deletion, and replication of data blocks.
b. Functions of NodeManager
 It runs services on the node to check its health and reports the same to ResourceManager.
We can easily scale Hadoop cluster by adding more nodes to it. Hence we call it a linearly scaled
cluster. Each node added increases the throughput of the cluster.
Client nodes in Hadoop cluster – We install Hadoop and configure it on client nodes.
c. Functions of the client node
 To load the data on the Hadoop cluster.
19

 Tells how to process the data by submitting MapReduce job.


 Collects the output from a specified location.
3. Single Node Cluster VS Multi-Node Cluster

As the name suggests, single node cluster gets deployed over a single machine. And multi-node
clusters gets deployed on several machines.
In single-node Hadoop clusters, all the daemons like NameNode, DataNode run on the same
machine. In a single node Hadoop cluster, all the processes run on one JVM instance. The user
need not make any configuration setting. The Hadoop user only needs to set JAVA_HOME
variable. The default factor for single node Hadoop cluster is one.
In multi-node Hadoop clusters, the daemons run on separate host or machine. A multi-node
Hadoop cluster has master-slave architecture. In this NameNode daemon run on the master
machine. And DataNode daemon runs on the slave machines. In multi-node Hadoop cluster, the
slave daemons like DataNode and NodeManager run on cheap machines. On the other hand,
master daemons like NameNode and ResourceManager run on powerful servers. Ina multi-node
Hadoop cluster, slave machines can be present in any location irrespective of the physical
location of the master server.
4. Communication Protocols Used in Hadoop Clusters

The HDFS communication protocol works on the top of TCP/IP protocol. The client establishes
a connection with NameNode using configurable TCP port. Hadoop cluster establishes the
connection to the client using client protocol. DataNode talks to NameNode using the DataNode
Protocol. A Remote Procedure Call (RPC) abstraction wraps both Client protocol and DataNode
protocol. NameNode does not initiate any RPC instead it responds to RPC from the DataNode.
Don’t forget to check schedulers in Hadoop
5. How to Build a Cluster in Hadoop

Building a Hadoop cluster is a non- trivial job. Ultimately the performance of our system will
depend upon how we have configured our cluster. In this section, we will discuss various
parameters one should take into consideration while setting up a Hadoop cluster.
For choosing the right hardware one must consider the following points
 Understand the kind of workloads, the cluster will be dealing with. The volume of data which
cluster need to handle. And kind of processing required like CPU bound, I/O bound etc.
20

 Data storage methodology like data compression technique used if any.


 Data retention policy like how frequently we need to flush.
Sizing the Hadoop Cluster
For determining the size of Hadoop clusters we need to look at how much data is in hand. We
should also examine the daily data generation. Based on these factors we can decide the
requirements of a number of machines and their configuration. There should be a balance
between performance and cost of the hardware approved.
Configuring Hadoop Cluster
For deciding the configuration of Hadoop cluster, run typical Hadoop jobs on the default
configuration to get the baseline. We can analyze job history log files to check if a job takes
more time than expected. If so then change the configuration. After that repeat the same process
to fine tune the Hadoop cluster configuration so that it meets the business requirement.
Performance of the cluster greatly depends upon resources allocated to the daemons. The
Hadoop cluster allocates one CPU core for small to medium data volume to each DataNode. And
for large data sets, it allocates two CPU cores to the HDFS daemons.
6. Hadoop Cluster Management

When you deploy your Hadoop cluster in production it is apparent that it would scale along all
dimensions. They are volume, velocity, and variety. Various features that it should have to
become production-ready are – robust, round the clock availability, performance and
manageability. Hadoop cluster management is the main aspect of your big data initiative.
A good cluster management tool should have the following features:-
 It should provide diverse work-load management, security, resource provisioning,
performance optimization, health monitoring. Also, it needs to provide policy management,
job scheduling, back up and recovery across one or more nodes.
 Implement NameNode high availability with load balancing, auto-failover, and hot standbys
 Enabling policy-based controls that prevent any application from gulping more resources than
others.
 Managing the deployment of any layers of software over Hadoop clusters by performing
regression testing. This is to make sure that any jobs or data won’t crash or encounter any
bottlenecks in daily operations.
7. Benefits of Hadoop Clusters
21

Here is a list of benefits provided by Clusters in Hadoop –

 Robustness
 Data disks failures, heartbeats and re-replication
 Cluster Rrbalancing
 Data integrity
 Metadata disk failure
 Snapshot
i. Robustness

The main objective of Hadoop is to store data reliably even in the event of failures. Various
kind of failure is NameNode failure, DataNode failure, and network partition. DataNode
periodically sends a heartbeat signal to NameNode. In network partition, a set of DataNodes gets
disconnected with the NameNode. Thus NameNode does not receive any heartbeat from these
DataNodes. It marks these DataNodes as dead. Also, Namenode does not forward any I/O
request to them. The replication factor of the blocks stored in these DataNodes falls below their
specified value. As a result, NameNode initiates replication of these blocks. In this way,
NameNode recovers from the failure.
ii. Data Disks Failure, Heartbeats, and Re-replication

NameNode receives a heartbeat from each DataNode. NameNode may fail to receive heartbeat
because of certain reasons like network partition. In this case, it marks these nodes as dead. This
decreases the replication factor of the data present in the dead nodes. Hence NameNode initiates
replication for these blocks thereby making the cluster fault tolerant.
iii. Cluster Rebalancing

The HDFS architecture automatically does cluster rebalancing. Suppose the free space in a
DataNode falls below a threshold level. Then it automatically moves some data to another
DataNode where enough space is available.
iv. Data Integrity

Hadoop cluster implements checksum on each block of the file. It does so to see if there is any
corruption due to buggy software, faults in storage device etc. If it finds the block corrupted it
seeks it from another DataNode that has a replica of the block.
22

v. Metadata Disk Failure

FSImage and Editlog are the central data structures of HDFS. Corruption of these files can stop
the functioning of HDFS. For this reason, we can configure NameNode to maintain multiple
copies of FSImage and EditLog. Updation of multiple copies of FSImage and EditLog can
degrade the performance of Namespace operations. But it is fine as Hadoop deals more with the
data-intensive application rather than metadata intensive operation.
vi. Snapshot

Snapshot is nothing but storing a copy of data at a particular instance of time. One of the usages
of the snapshot is to rollback a failed HDFS instance to a good point in time. We can take
Snapshots of the sub-tree of the file system or entire file system. Some of the uses of snapshots
are disaster recovery, data backup, and protection against user error. We can take snapshots of
any directory. Only the particular directory should be set as Snap shot table. The administrators
can set any directory as snap shot table. We cannot rename or delete a snap shot table directory if
there are snapshots in it. After removing all the snapshots from the directory, we can rename or
delete it.

You might also like