Professional Documents
Culture Documents
Cloud Computing2
Cloud Computing2
UNIT - II
Cloud Services & Platforms :
In this chapter you will earn about various types of cloud computing services including
compute, storage, database, application, content delivery, analytics, deployment & manageFor
each category of cloud services, examples of services provided by various cloud service
providers including Amazon, Google and Microsoft are described.
The cloud computing reference model along with the va1ious cloud service models
(laaS,PaaSandSaaS). Infrastructure-as-a-Service(IaaS) provides virtualized dynamically
scalable resources using a virtualized infrastructure.
Platform as-a-Service(PaaS) simplifies application development by providing developm ent
tools, application programming interfaces(APis), software libraries that can be used for
wide range of applications. Software-as-a-Service (SaaS) provides multi-tenant applications
hosted in the cloud.
The bottom most layer in the cloud reference model is the in frastructure and facilities layer that
includes the physical infrastructure such as data center facilities, electrical and mechanical
equipment, etc. On top of the infrastructure layer is the hardware layer that includes physical
compute, network and storage hardware. On top of the hardware layer the virtualization layer
partitions the physical hardware resources into multiple virtual resources that enabling pooling
of resources. Chapter2 described various types of virtualization approaches such as full
virtualization, Para-virtualization and hardware virtualization. The computing services are
delivered in the form of Virtual Machines (VMs) along with the storage and net work resources.
The platform and middle ware layer builds up on the laaS layers be low and provides stan -
dardized stacks of services such as database service, queuing service, application frame work
sand run-time environments, messaging services, monitoring services, analytics services, etc.
The service management layer provides APls for requesting, managing and monitoring cloud
resources. ThetopmostlayeristheapplicationslayerthatincludesSaasapplicationssuchasEmail,
cloud storage application, productivity app management portals, customer self-service portals,
etc.
Various types of cloud services and the associated layers in the cloud reference model.
Compute Services:
Compute services provide dynamically scalable compute capacity in the cloud. Compute
resources can be provisioned on-demand in the form of virtual machines. Virtual machines can
be created from standard images provided by the cloud service provider
(e.g.Ubuntuimage,Windowsserverimage,etc.) or custom images created by the users. A
2
EX:
Storage Services:
Cloud storage services allow storage and retrieval of any amount of data, at anytime
from anywhere on the web. Most cloud storage services organize data into buckets or
containers. Buckets or containers store objects which are individual pieces of data.
Features
• Scalability: Cloud storage services provide high capacity and scalability.
• Replication: When an object is upload edit is replicated at multiple facilities
and/or multiple devices with in each facility.
4
• Access Policies: Cloud storage services provide several security features such
as Access Control Lists(ACLs), bucket/container level policies, etc. ACLs can
be used to selectively grant access permissions on individual objects.
Bucket/container level
Policies can also defined to permissions across some of the objects with in a
single bucket/container.
• Encryption: Cloud storage services provide Server Side Encryption(SSE)
option encrypt all data stored in the cloud storage.
• Consistency: Therefore, any object that is uploaded can be immediately downloaded
after the upload is complete.
Amazon Simple Storage Service:
Clouddatabaseservicesallowyoutoset-upandoperaterelationalornon -relational
databases in the cloud. The benefit of using cloud data base services is that it relieves
the application developers from the time consuming database administration tasks.
Popular relational databases provided by various cloud service providers include
MySQL, Oracle, SQLServer, etc. Then on-relational(No-SQL) databases provided by
cloud service providers are mostly proprietary solutions. No-SQL databases are
usually fully-managed and delivers and scalability. The characteristics of relational
and non-relational databases are described.
Features
storage resources as required to meet the appl ication work load levels.
Provisioned capacity
Windows Azure SQL Database is the relational database service from Microsoft.
Azure SQL Database is based on the SQL server, but it does not give each custom
instance of SQL server. Instead the SQL Database is a multi-tenant service, with a
logical SQL Database server for each customer.
Windows Azure Table Service :
Application Services :
In this section you will learn about various cloud application services such as
application runtime sand frameworks, queuing services, email services, notification
services and media services.
Cloud-based application runtimes and frameworks allow developers to develop and host
applications in the cloud. Application runtimes provides up port for programming
languages (e.g.,Java,Python,orRuby). Application runtimes automatically allocate
resources for applications and handle the application scaling, without the need to run and
maintain servers.
Google App Engine
Google App Engine is the platform-as-a-service(PaaS) from Google, which includes both
an application runtime and web frameworks. Figure3.13 shows a screen shot of the
Google App Engine console.
App Engine features include:
• Runtimes: App Engine supports applications developed in Java, Python, PHP and
Go programming languages. App Engine provides runtime environments for Java,
Python, PHP and Go programming language.
• Sandbox: Applications run in a secure sandbox environment isolated from other
applications. The sandbox environment provides a limited access to the underlying
operating system. App Engine can only execute application code called from
HTTP
requests. The sandbox environment allows App Engine to distribute web requests
for the application across multiple servers.
• Web Frameworks: App Engine provides a simple Python web application
framework called web app2. App Engine also supports any frame work written in
pure Python that speaks WSGI, including Django, CherryPy, Pylons, web. py, and
web 2py.
• Data store: App Engine provides a no-SQL data storage service.
• Authentication: App Engine application s can be integrated with Google
Accounts for user authentication.
• URL Fetch service: URL Fetch service allows applications to access resources
on the Internet, such as web services or other data
• Email service: Email service allows applications to send email messages.
• Image Manipulation service: Image Manipulation service allows application
7
Application Services:
Cloud application services such as application runtimes and frameworks, queuing services, email
services, notification services and media services.
EX:
Application runtimes & frameworks:
Cloud based application runtimes and frameworks allow developers to develop and host
applications in the cloud. Application runtimes provide support for programming languages (e.g..
java, python, ..or..ruby). application runtimes automatically allocate resources for applications
and handle the application scaling, without the need to run and maintain servers.
Content Delivery Services :
Cloud – based content delivery service include content delivery networks. A CDN is a
distributed system of servers located across multiple geographic locations to serve content to end
– users with high availability and high performance . CDNs are useful for serving static content
such as text, image. etc..and streaming media . CDNs have a number of edge locations deployed
in multiple locations, over multiple backbones. Requests for static or streaming media content
that is server by a CDN are Directed to the nearest edge location. CDNs cache the popular
content on the edge servers which helps in reducing bandwidth costs and improving response
times.
EX:
Windows Azure Content Delivery Network
Windows Azure Content Delivery Network(CDN)is the content delivery service from
Microsoft. Azure CDN caches Windows Azure blob sand static content at the edge
locations 1improve the performance of websites. Azure CDN can be enabled on a
Windows Azure storage account.
Analytics Services
Cloud-based analytics services allow analyzing massive data sets stored in the cloud either in
cloud storages or in cloud data bases using programming models such as Map Reduce. Using
cloud analytics services applications can perform data-intensive tasks such as such as
8
Open Stack is a cloud operating system comprising of a collection of inter acting services that
control computing, storage, and networking resources. The Open Stack compute service (called
nova-compute) ma nages networks of virtual machines running on nodes, providing virtual
servers on demand. The network service (called nova-networking) provides connectivity
between the interfaces of other Open Stack services. The volume service (Cinder) manages
storage volumes for virtual machine s. The object storage service (swift) allows users to store and
retrieve file. The identity service (keystone) provide s authentication and authorization. The
image registry (glance) acts as a catalog and repository for virtual machine images. The Open
Stack scheduler (nova-scheduler) maps the nova-API calls to the appropriate Open stack
components. The scheduler takes the virtual machine requests from the queue and determines
where they should run. The messaging service (rabbit-mq) act as a central node for message
passing between daemons. Orchestration activities such as running an instance are performed by
the nova-api which accepts and responds to end user compute API calls. The Open stack
dashboard (called horizon) provides web-based interface for managing Open stack service.
Apache Hadoop : Apache Hadoop is an open source framework. For distributed batch
processing of big data. The Map Reduce has also been proposed as a parallel programming
model suitable for the cloud. The map reduce algorithm allowed large scale computations to
be parallelized across a large cluster of servers. The hadoop ecosystem consists of projects.
Hadoop common : Hadoop common consists of common utilities that support other hadoop
modules. Hadoop common has utilities and scripts for starting hadoop, components and
interfaces to access the file systems supported by hadoop.
Hadoop distributed file system : HDFS is a distributed file system that runs on large
clusters and providers high throughput access to data. HDFS was built to reliably store very
large files across machines in a large cluster built of commodity hardware . HDFS stores
each file as a sequence of block all of which are of the same size except the last block. The
blocks of each file replicated on multiple machines in a cluster with a default replication
factor of 3 to provide fault tolerance.
Map Reduce – A framework that helps programs do the parallel computation on data. The map
task takes input data and converts it into a dataset that can be computed in key value pairs. The
output of the map task is consumed by reduce tasks to aggregate output and provide the desired
result.
Yarn : It is another resource of negotiators; it manages the bundle of data by scheduling jobs. It
is one of the frameworks of resource of Hadoop data management.
HBase – An open source, non-relational, versioned database that runs on top of Amazon S3
(using EMRFS) or the Hadoop Distributed File System (HDFS). HBase is a massively scalable,
distributed big data store built for random, strictly consistent, real-time access for tables with
billions of rows and millions of columns.
10
Map phase- It is the first phase of data processing. In this phase, we specify all the complex
logic/business rules/costly code.
Reduce phase- It is the second phase of processing. In this phase, we specify light-weight
processing like aggregation/summation.
Steps of Map Reduce Job Execution flow
Map Reduce processes the data in various phases with the help of different components. Let’s
discuss the steps of job execution in Hadoop.
1. Input Files
In input files data for Map Reduce job is stored. In HDFS, input files reside. Input files format is
arbitrary. Line-based log files and binary format can also be used.
2. Input Format
After that Input Format defines how to split and read these input files. It selects the files or other
objects for input. Input Format creates Input Split.
3. Input Splits
It represents the data which will be processed by an individual Mapper. For each split, one map
task is created. Thus the number of map tasks is equal to the number of Input Splits. Framework
divide split into records, which mapper process.
4. Record Reader
It communicates with the input Split. And then converts the data into key-value pairs suitable
for reading by the Mapper. Record Reader by default uses Text Input Format to convert data into
a key-value pair.
It communicates to the Input Split until the completion of file reading. It assigns byte offset to
12
each line present in the file. Then, these key-value pairs are further sent to the mapper for further
processing.
5. Mapper
It processes input record produced by the Record Reader and generates intermediate key-value
pairs. The intermediate output is completely different from the input pair. The output of the
mapper is the full collection of key-value pairs.
Hadoop framework doesn’t store the output of mapper on HDFS. It doesn’t store, as data is
temporary and writing on HDFS will create unnecessary multiple copies. Then Mapper passes
the output to the combiner for further processing.
4. Combiner
5. Partitioner
Partitioner comes into the existence if we are working with more than one reducer. It takes the
output of the combiner and performs partitioning.
Partitioning of output takes place on the basis of the key in MapReduce. By hash function, key
(or a subset of the key) derives the partition.
On the basis of key value in Map Reduce, partitioning of each combiner output takes place. And
then the record having the same key value goes into the same partition. After that, each partition
is sent to a reducer.
Partitioning in Map Reduce execution allows even distribution of the map output over the
reducer.
After partitioning, the output is shuffled to the reduce node. The shuffling is the physical
movement of the data which is done over the network. As all the mappers finish and shuffle the
output on the reducer nodes.
13
Then framework merges this intermediate output and sort. This is then provided as input to
reduce phase.
7. Reducer
Reducer then takes set of intermediate key-value pairs produced by the mappers as the input.
After that runs a reducer function on each of them to generate the output.
The output of the reducer is the final output. Then framework stores the output on HDFS.
8. Record Writer
It writes these output key-value pair from the Reducer phase to the output files.
9. Output Format
Output Format defines the way how Record Reader writes these output key-value pairs in output
files. So, its instances provided by the Hadoop write files in HDFS. Thus Output Format
instances write the final output of reducer on HDFS.
Hadoop Schedulers:
Prior to Hadoop 2, Hadoop MapReduce is a software framework for writing applications that
process huge amounts of data (terabytes to petabytes) in-parallel on the large Hadoop cluster.
This framework is responsible for scheduling tasks, monitoring them, and re-executes the failed
task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic idea
behind the YARN introduction is to split the functionalities of resource management and job
scheduling or monitoring into separate daemons that are Resource Manager, Application Master,
and Node Manager.
Resource Manager is the master daemon that arbitrates resources among all the applications in
the system. Node Manager is the slave daemon responsible for containers, monitoring their
resource usage, and reporting the same to Resource Manager or Schedulers. ApplicationMaster
negotiates resources from the ResourceManager and works with NodeManager in order to
execute and monitor the task.
The ResourceManager has two main components that are Schedulers and ApplicationsManager.
14
1. FIFO Scheduler,
2. Capacity Scheduler,
3. Fair Scheduler
1. FIFO Scheduler
First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives more
preferences to the application coming first than those coming later. It places the applications in a
queue and executes them in the order of their submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the queue are
allocated first. Once the first application request is satisfied, then only the next application in the
queue is served.
Advantage:
It is simple to understand and doesn’t need any configuration.
Jobs are executed in the order of their submission.
Disadvantage:
It is not suitable for shared clusters. If the large application comes before the shorter one, then
the large application will use all the resources in the cluster, and the shorter application has to
wait for its turn. This leads to starvation.
15
It does not take into account the balance of resource allocation between the long applications
and short applications.
2. Capacity Scheduler
The root queue represents the cluster itself, parent queue represents organization/group or sub-
organization/sub-group, and the leaf accepts application submission.
The Capacity Scheduler allows the sharing of the large cluster while giving capacity guarantees
to each organization by allocating a fraction of cluster resources to each queue.
Also, when there is a demand for the free resources that are available on the queue who has
completed its task, by the queues running below capacity, then these resources will be assigned
to the applications on queues running below capacity. This provides elasticity for the
organization in a cost-effective manner.
Apart from it, the CapacityScheduler provides a comprehensive set of limits to ensure that a
single application/user/queue cannot use a disproportionate amount of resources in the cluster.
To ensure fairness and stability, it also provides limits on initialized and pending apps from a
single user and queue.
Advantages:
It maximizes the utilization of resources and throughput in the Hadoop cluster.
16
FairScheduler allows YARN applications to fairly share resources in large Hadoop clusters. With
FairScheduler, there is no need for reserving a set amount of capacity because it will dynamically
balance resources between all running applications.
It assigns resources to applications in such a way that all applications get, on average, an equal
amount of resources over time.
The FairScheduler, by default, takes scheduling fairness decisions only on the basis of memory.
We can configure it to schedule with both memory and CPU.
When the single application is running, then that app uses the entire cluster resources. When
other applications are submitted, the free up resources are assigned to the new apps so that every
app eventually gets roughly the same amount of resources. FairScheduler enables short apps to
finish in a reasonable time without starving the long-lived apps.
Apart from fair scheduling, the FairScheduler allows for assigning minimum shares to queues for
ensuring that certain users, production, or group applications always get sufficient resources.
When an app is present in the queue, then the app gets its minimum share, but when the queue
doesn’t need its full guaranteed share, then the excess share is split between other running
applications.
17
Advantages:
It provides a reasonable way to share the Hadoop Cluster between the number of users.
Also, the FairScheduler can work with app priorities where the priorities are used as weights
in determining the fraction of the total resources that each application should get.
Disadvantage:
It requires configuration.
Hadoop Cluster:
A Hadoop cluster is nothing but a group of computers connected together via LAN. We use it for
storing and processing large data sets. Hadoop clusters have a number of commodity hardware
connected together. They communicate with a high-end machine which acts as a master. These
master and slaves implement distributed computing over distributed data storage. It runs open
source software for providing distributed functionality.
Hadoop cluster has master-slave architecture.
i. Master in Hadoop Cluster
It is a machine with a good configuration of memory and CPU. There are two daemons running
on the master and they are NameNode and Resource Manager.
a. Functions of NameNode
Executes file system namespace operations like opening, closing, renaming files and
directories
The NameNode stores the metadata in the memory for fast retrieval. Hence we should configure
it on a high-end machine.
b. Functions of Resource Manager
As the name suggests, single node cluster gets deployed over a single machine. And multi-node
clusters gets deployed on several machines.
In single-node Hadoop clusters, all the daemons like NameNode, DataNode run on the same
machine. In a single node Hadoop cluster, all the processes run on one JVM instance. The user
need not make any configuration setting. The Hadoop user only needs to set JAVA_HOME
variable. The default factor for single node Hadoop cluster is one.
In multi-node Hadoop clusters, the daemons run on separate host or machine. A multi-node
Hadoop cluster has master-slave architecture. In this NameNode daemon run on the master
machine. And DataNode daemon runs on the slave machines. In multi-node Hadoop cluster, the
slave daemons like DataNode and NodeManager run on cheap machines. On the other hand,
master daemons like NameNode and ResourceManager run on powerful servers. Ina multi-node
Hadoop cluster, slave machines can be present in any location irrespective of the physical
location of the master server.
4. Communication Protocols Used in Hadoop Clusters
The HDFS communication protocol works on the top of TCP/IP protocol. The client establishes
a connection with NameNode using configurable TCP port. Hadoop cluster establishes the
connection to the client using client protocol. DataNode talks to NameNode using the DataNode
Protocol. A Remote Procedure Call (RPC) abstraction wraps both Client protocol and DataNode
protocol. NameNode does not initiate any RPC instead it responds to RPC from the DataNode.
Don’t forget to check schedulers in Hadoop
5. How to Build a Cluster in Hadoop
Building a Hadoop cluster is a non- trivial job. Ultimately the performance of our system will
depend upon how we have configured our cluster. In this section, we will discuss various
parameters one should take into consideration while setting up a Hadoop cluster.
For choosing the right hardware one must consider the following points
Understand the kind of workloads, the cluster will be dealing with. The volume of data which
cluster need to handle. And kind of processing required like CPU bound, I/O bound etc.
20
When you deploy your Hadoop cluster in production it is apparent that it would scale along all
dimensions. They are volume, velocity, and variety. Various features that it should have to
become production-ready are – robust, round the clock availability, performance and
manageability. Hadoop cluster management is the main aspect of your big data initiative.
A good cluster management tool should have the following features:-
It should provide diverse work-load management, security, resource provisioning,
performance optimization, health monitoring. Also, it needs to provide policy management,
job scheduling, back up and recovery across one or more nodes.
Implement NameNode high availability with load balancing, auto-failover, and hot standbys
Enabling policy-based controls that prevent any application from gulping more resources than
others.
Managing the deployment of any layers of software over Hadoop clusters by performing
regression testing. This is to make sure that any jobs or data won’t crash or encounter any
bottlenecks in daily operations.
7. Benefits of Hadoop Clusters
21
Robustness
Data disks failures, heartbeats and re-replication
Cluster Rrbalancing
Data integrity
Metadata disk failure
Snapshot
i. Robustness
The main objective of Hadoop is to store data reliably even in the event of failures. Various
kind of failure is NameNode failure, DataNode failure, and network partition. DataNode
periodically sends a heartbeat signal to NameNode. In network partition, a set of DataNodes gets
disconnected with the NameNode. Thus NameNode does not receive any heartbeat from these
DataNodes. It marks these DataNodes as dead. Also, Namenode does not forward any I/O
request to them. The replication factor of the blocks stored in these DataNodes falls below their
specified value. As a result, NameNode initiates replication of these blocks. In this way,
NameNode recovers from the failure.
ii. Data Disks Failure, Heartbeats, and Re-replication
NameNode receives a heartbeat from each DataNode. NameNode may fail to receive heartbeat
because of certain reasons like network partition. In this case, it marks these nodes as dead. This
decreases the replication factor of the data present in the dead nodes. Hence NameNode initiates
replication for these blocks thereby making the cluster fault tolerant.
iii. Cluster Rebalancing
The HDFS architecture automatically does cluster rebalancing. Suppose the free space in a
DataNode falls below a threshold level. Then it automatically moves some data to another
DataNode where enough space is available.
iv. Data Integrity
Hadoop cluster implements checksum on each block of the file. It does so to see if there is any
corruption due to buggy software, faults in storage device etc. If it finds the block corrupted it
seeks it from another DataNode that has a replica of the block.
22
FSImage and Editlog are the central data structures of HDFS. Corruption of these files can stop
the functioning of HDFS. For this reason, we can configure NameNode to maintain multiple
copies of FSImage and EditLog. Updation of multiple copies of FSImage and EditLog can
degrade the performance of Namespace operations. But it is fine as Hadoop deals more with the
data-intensive application rather than metadata intensive operation.
vi. Snapshot
Snapshot is nothing but storing a copy of data at a particular instance of time. One of the usages
of the snapshot is to rollback a failed HDFS instance to a good point in time. We can take
Snapshots of the sub-tree of the file system or entire file system. Some of the uses of snapshots
are disaster recovery, data backup, and protection against user error. We can take snapshots of
any directory. Only the particular directory should be set as Snap shot table. The administrators
can set any directory as snap shot table. We cannot rename or delete a snap shot table directory if
there are snapshots in it. After removing all the snapshots from the directory, we can rename or
delete it.