Cloud Capabilities

UNIT -4 CLOUD PROGRAMMING AND SOFTWARE ENVIRONMENT
Cloud Capabilities
Thanks to its speed, scale, and capacity, the cloud offers more functionality with more
automation than nearly every on-premises solution. This is for a few reasons:
• Cloud is built around services The more you have on-premises, the more you
and your technology need to be jacks-of-all-trades — you need to handle
everything yourselves. But in the cloud's service-based model, you have access
to individualized services optimized for specific functions instead of having one
centralized group of servers and staff. That means Cloud Service Providers
(CSPs) are focused on providing discrete elements of the cloud experience, and
both hardware and software have been designed for performance.
• Cloud is automated The computing power inherent to cloud means that
everything mentioned above should be available to you on-demand and in a self-
service manner — the Cloud Service Providers (CSPs) experts involved in
providing the computing power should be behind the scenes, not in between you
and the technology. (Remember, on-demand self-service is one of NIST's five
essential elements of cloud computing).
• Cloud (usually) oversees your networking For the most part, networking — the
joining up of resources within a cloud environment — will be managed by your
Cloud Service Provider (CSP). They have staff dedicated to monitoring and
optimizing the cloud resources, and as the keepers of the infrastructure, the CSP
balances the "cloud" resources (e.g., compute, storage, network). The CSP
manages resources within your defined configurations and in meeting agreed
upon Service Level Agreements (SLAs). The CSP frees you have from having to
monitor, manage, and optimize the resources.
• Containers are basically virtual machines, just one step higher — they're all
running the same operating system. They're faster to spin up and down (and
faster to destroy after they're done) than virtual machines. This makes them
much more efficient while requiring fewer resources at the same time.
• Clusters are, of course, virtual clusters they follow the same rules as physical
clusters, and they're housed within a portion of a data center that's been
assigned to you. In a physical cluster, you're limited to the machines you have
on-premises, and if one machine fails, there can be ramifications on the rest of
the system. Virtual clusters can be made of physical or virtual machines situated
anywhere in the world, they can be spun up and down on demand, and aren't
likely to be brought down by the failure of any one individual component.
• Load balancing provides automatic oversight over every instance of your
production environment and makes sure no servers are receiving an undue
amount of strain or reaching their capacity. This isn't really possible on-
premises; the increased latency and lack of scale wouldn't be worth the
investment.
• Servers in a cloud context typically mean "space rented on a server". Renting
space on a server saves that resource for you until you say otherwise; you can
scale up and down on demand, but there's always capacity ready for you on a
moment's notice.
• Serverless is a cloud computing service model where functions of code are the
unit of deployment. There are no machines, VMs or containers to manage.
Computing power only exists for you when you need it; there are no "reserved"
resources for you; code is dynamically run as needed and then destroyed when
done. Since there are no VMs or even containers to manage, serverless
computing has minimal operational overhead, making it the ideal platform for
developers dev and test environments. Serverless tends to be more affordable
for infrequently or sporadically used applications as you only pay for code as it
runs, but it is slower than traditional cloud services since you don't have reserved
resources dedicated to your applications. For these reasons applications in
continuous use or that are time sensitive are better served by more traditional
cloud service models.
• Storage in a traditional data center, whether databases or files, means having a
physical computer with an operating system and a configuration to run your files.
Storage on the cloud is fundamentally the same, but the resources aren't usually
located in the same place physically. The code might be somewhere in a place
optimized to run code, the hardware might be somewhere else optimized to run
hardware, and the operating system might be in a third place optimized to run
operating systems.
• Virtualization is effectively the same thing as having virtual machines (i.e.,
multiple instances of the same operating system on a physical computer).
Instead of running these virtual machines yourself, however, your CSP will
typically provide the virtualization for you. (Note: Software licensing can become
sticky in virtualization, because providers may charge you based on how many
virtual instances you have instead of how many physical computers you have —
since you don't own those computers yourself anymore.)
What is grid computing?

Grid computing is a computing infrastructure that combines computer resources
spread over different geographical locations to achieve a common goal. All unused
resources on multiple computers are pooled together and made available for a single
task. Organizations use grid computing to perform large tasks or solve complex
problems that are difficult to do on a single computer.
For example, meteorologists use grid computing for weather modeling. Weather
modeling is a computation-intensive problem that requires complex data management
and analysis. Processing massive amounts of weather data on a single computer is
slow and time consuming. That’s why meteorologists run the analysis over
geographically dispersed grid computing infrastructure and combine the results.
Why is grid computing important?
Organizations use grid computing for several reasons.
Efficiency
With grid computing, you can break down an enormous, complex task into multiple
subtasks. Multiple computers can work on the subtasks concurrently, making grid
computing an efficient computational solution.
Cost
Grid computing works with existing hardware, which means you can reuse existing
computers. You can save costs while accessing your excess computational resources.
You can also cost-effectively access resources from the cloud.
Flexibility
Grid computing is not constrained to a specific building or location. You can set up a
grid computing network that spans several regions. This allows researchers in different
countries to work collaboratively with the same supercomputing power.
What are the use cases of grid computing?
The following are some common applications of grid computing.

Financial services
Financial institutions use grid computing primarily to solve problems involving risk
management. By harnessing the combined computing powers in the grid, they can
shorten the duration of forecasting portfolio changes in volatile markets.
Gaming
The gaming industry uses grid computing to provide additional computational

resources for game developers. The grid computing system splits large tasks, such as
creating in-game designs, and allocates them to multiple machines. This results in a
faster turnaround for the game developers.
Entertainment
Some movies have complex special effects that require a powerful computer to create.
The special effects designers use grid computing to speed up the production timeline.
They have grid-supported software that shares computational resources to render the
special-effect graphics.
Engineering
Engineers use grid computing to perform simulations, create models, and analyze
designs. They run specialized applications concurrently on multiple machines to
process massive amounts of data. For example, engineers use grid computing to
reduce the duration of a Monte Carlo simulation, a software process that uses past
data to make future predictions.
What are the components in grid computing?
In grid computing, a network of computers works together to perform the same task.
The following are the components of a grid computing network.
Nodes
The computers or servers on a grid computing network are called nodes. Each node
offers unused computing resources such as CPU, memory, and storage to the grid
network. At the same time, you can also use the nodes to perform other unrelated
tasks. There is no limit to the number of nodes in grid computing. There are three main
types of nodes: control, provider, and user nodes.
Grid middleware
Grid middleware is a specialized software application that connects computing

resources in grid operations with high-level applications. For example, it handles your
request for additional processing power from the grid computing system.
It controls the user sharing of available resources to prevent overwhelming the grid
computers. The grid middleware also provides security to prevent misuse of resources
in grid computing.
Grid computing architecture
Grid architecture represents the internal structure of grid computers. The following
layers are broadly present in a grid node:
1. The top layer consists of high-level applications, such as an application to

perform predictive modeling.
2. The second layer, also known as middleware, manages and allocates resources
requested by applications.
3. The third layer consists of available computer resources such as CPU, memory,
and storage.
4. The bottom layer allows the computer to connect to a grid computing network.
How does grid computing work?
Grid nodes and middleware work together to perform the grid computing task. In grid
operations, the three main types of grid nodes perform three different roles.
User node
A user node is a computer that requests resources shared by other computers in grid
computing. When the user node requires additional resources, the request goes
through the middleware and is delivered to other nodes on the grid computing system.
Provider node
In grid computing, nodes can often switch between the role of user and provider.
A provider node is a computer that shares its resources for grid computing. When
provider machines receive resource requests, they perform subtasks for the user
nodes, such as forecasting stock prices for different markets. At the end of the
process, the middleware collects and compiles all the results to obtain a global
forecast.
Control node
A control node administers the network and manages the allocation of the grid
computing resources. The middleware runs on the control node. When the user node
requests a resource, the middleware checks for available resources and assigns the
task to a specific provider node.
What are the types of grid computing?

Grid computing is generally classified as follows.
Computational grid
A computational grid consists of high-performance computers. It allows researchers to

use the combined computing power of the computers. Researchers use computational
grid computing to perform resource-intensive tasks, such as mathematical
simulations.
Scavenging grid
While similar to computational grids, CPU scavenging grids have many regular
computers. The term scavenging describes the process of searching for available
computing resources in a network of regular computers. While other network users
access the computers for non-grid–related tasks, the grid software uses these nodes
when they are free. The scavenging grid is also known as CPU scavenging or cycle
scavenging.
Data grid
A data grid is a grid computing network that connects to multiple computers to provide
large data storage capacity. You can access the stored data as if on your local machine
without having to worry about the physical location of your data on the grid.
What is a Cloud Database?

A cloud database is a database that is deployed, delivered, and accessed in the cloud.
Cloud databases organize and store structured, unstructured, and semi-structured
data just like traditional on-premises databases. However, they also provide many of
the same benefits of cloud computing, including speed, scalability, agility, and
reduced costs.
Types of cloud databases
Like a traditional on-premises database, cloud databases can be classified

into relational databases and non-relational databases.
• Relational cloud databases consist of one or more tables of columns and rows
and allow you to organize data in predefined relationships to understand how
data is logically related. These databases typically use a fixed data schema, and
you can use structured query language (SQL) to query and manipulate data. They
are highly consistent, reliable, and best suited to dealing with large amounts of
structured data.
Examples of relational databases include SQL Server, Oracle, MySQL, PostgreSQL,

Spanner, and Cloud SQL.
• Non-relational cloud databases store and manage unstructured data, such as
email and mobile message text, documents, surveys, rich media files, and
sensor data. They don’t follow a clearly-defined schema like relational databases
and allow you to save and organize information regardless of its format.
Examples of non-relational databases include MongoDB, Redis, Cassandra, Hbase,

and Bigtable.
Why use a cloud database?
The amount of data generated and collected today is growing exponentially. It’s not
only more varied, but also wildly disparate. Data can now reside across on-premises
databases and distributed cloud applications and services, making it difficult to
integrate using traditional approaches. In addition, real-time data processing is
becoming essential to business success—delays and lags in data delivery to mission-
critical applications could have catastrophic consequences.
As cloud adoption accelerates and the way we use data continues to evolve, legacy
databases face significant challenges.
Cloud databases provide flexibility, reliability, security, affordability and more.

Providing a solid foundation for building modern business applications. In particular,
they can rapidly adapt to changing workloads and demands without increasing the
workload of already overburdened teams.
Advantages of cloud databases
Reduced operational overhead
Cloud databases eliminate the management and maintenance of any physical
infrastructure. Your cloud provider is responsible for provisioning, updating, and
maintaining all the hardware, operating systems, and database software.
Improved agility and scalability
You can launch a new cloud database or decommission one in minutes. This allows
you to test, operationalize, and validate new ideas faster. Plus, cloud databases can
dynamically scale as your applications grow and deliver consistent performance
under high load.
Lower total cost of ownership (TCO)
The cloud service provider owns and operates infrastructure allowing teams to focus
on building applications. In addition, pay-as-you-go options lets you provision what
you need, when you need it, and scale up or down depending on your usage.
Flexible database options
You can choose purpose-built cloud databases with the capabilities and
performance that match your specific use case and application needs.
Safe, secure data
Cloud providers invest in the best technologies and experts to offer multiple layers of
protection and centralized security policies that can help protect customer data
without slowing innovation.
Better reliability
Cloud platforms, including cloud databases, come with a host of built-in features
designed to maintain constant connectivity and fulfill SLAs, including high availability,
automated backups, and robust disaster recovery.
Considerations for cloud databases
While the benefits of cloud databases can help organizations address many modern
obstacles that impede growth and digital transformation, there are some common
considerations of cloud databases to keep in mind as you plan your migration to the
cloud.
• Vendor lock-in
• Difficulty integrating data with other systems
• Complex and lengthy migrations
• Underestimating cloud costs
• Possibility of connection downtime
• Cloud security concerns
Below are the most opt-for and industry-leading programming languages that support
cloud infrastructure development.
Java
Java is an all-in-one developers’ toolset to develop websites, desktop applications,
android, iOS, and games. The language offers a resource-rich library to support all
programming tasks.
The standard preference among cloud infrastructure developers for developing large-
scale, enterprise-grade applications is Java.
Java offers robust security features, a large developer community, and excellent
compatibility with cloud platforms such as AWS, Azure, and V2 Cloud, making it a
preferred choice for those looking to develop websites and deploy scalable
applications seamlessly.
It offers a range of valuable features, including:
• Being an object-oriented programming language, Java can produce reusable

programs and modules (functions, objects, and classes) with varying
dependencies.
• Cloud applications developed on the Java framework are easily supported by all
operating systems, including Windows, iOS, Linux, etc.
• It’s widely used for headless computing due to its compatibility with various
operating systems, security features, and support for multithreading.
• Java features AOT (ahead-of-time) compilation for various serverless computing
frameworks, optimizes the performance by pre-compiling the code and reducing
the distribution size and cold start latency.
Enterprise-grade security, high performance, and simplicity make Java a perfect
language for cloud computing.
Besides its established role in cloud computing, Java’s versatility extends to various
domains, including web scraping, as demonstrated in java for web scraping.
Python
Python has emerged as one of the leading languages for cloud computing due to its
ease of use, performance, open-source development, third-party integrations, and
popularity among developers.
Python developers can quickly automate workflows and implement headless

computing because of its extensive resource libraries. Python is widely used in data
analytics, machine learning, game development, image processing, natural language
processing, and for the development of complex scientific applications. Additionally,
Python’s capabilities extend to web technologies, making it a popular choice for web
crawling and data extraction. Understanding the nuances of web crawling with
Python can be highly beneficial for cloud computing professionals, as it enables
efficient data harvesting and processing in cloud environments.
Upskilling yourself with Python and its libraries can significantly increase your chances
of landing well-paid jobs and joining the community of cloud computing professionals.
Supported by AWS Lambda, Python is used for serverless computing in AWS Cloud. It
offers dedicated libraries to automate cloud-based workflows, perform data analysis,
and build cloud-native apps.
It includes:
• Boto3 SDK – This AWS SDK (Software Development Kit) for Python allows
developers to access various AWS services via a simple API.
• Apache Libcloud – An all-rounder cloud computing library in Python that offers a
unified API to interact with different cloud vendors, including AWS, Microsoft
Azure, and Google Cloud.
• OpenStack SDK – A complete user-oriented SDK package including all open-
stack python libraries to automate cloud-based workflows such as creating
virtual machines & managing network configurations.
• Pycloud – A pipeline for cloud computing to implement complex data analytics
on the cloud with pCloud API.
• Google Cloud Client Library – A python library to access Google Cloud services,
including Google Cloud Storage, Google Cloud Datastore, & Google Cloud
Pub/Sub.
.NET
ASP.NET or .NET, introduced by Microsoft, is widely used programming for web
development and cloud-native applications. The language is known for its wide-scale
adoption due to its easy-to-use development of dynamic web pages.
A large community of .NET developers and available resource material make the
onboarding and development journey easier for entrants and experts. Its significant
features include:
• Allowing developers to create complex applications without writing too much

code.
• Offering a platform-agnostic and user-friendly approach.
• It implements a separation of concerns (isolation of logic & content) to
streamline application development.
• It utilizes integrated Windows authentication for secure application access.
GoLang
Go, also known as GoLang, is another top pick of cloud developers due to its robust
features like concurrency and package management. Created by Google, the Go
framework offers a quick and seamless way to develop cloud-native applications.
Go is particularly common in the Google Cloud Platform (GCP). It’s a perfect

framework for developing cloud-based applications requiring real-time processing and
low latency.
It’s a fast, simple programming language with an easy-to-adapt syntax that boasts
cross-platform compatibility. Moreover, Go offers a unique combination of robust
performance like C/C++, Python’s simplicity, and Java’s efficient concurrency
handling.
JavaScript
JavaScript, along with HTML and CSS, was instrumental in the development of the
internet. It has matured into a high-level, multi-paradigm language, driving front-end
development for web and Node.js development for cloud-native applications. Its
evolution reflects a broader trend towards using versatile, scalable languages in cloud
computing, underscoring the importance of JavaScript and Node.js in modern
development stacks.
It provides dynamic interactivity for web pages, including alerts, events, notifications,
and pop-ups.It is also well-suited for serverless computing as it allows experts to easily
trigger and respond to events, such as changes in data or user interactions.
All major cloud platforms, including AWS Lambda and Google Cloud Functions,
support JavaScript.
Additionally, when it comes to developing user interfaces for cloud applications,

proficiency in HTML and CSS, often used in conjunction with tools like Photoshop for
design mockups, can be invaluable for tasks like PSD to HTML conversion
Ruby on Rails
Ruby on Rails is a web development framework known for producing a clean and
streamlined codebase, making implementing new features more accessible.
This framework is ideal for developing complex SaaS and marketplace platforms. Ruby
is used for developing various SaaS products by Shopify, Github, and Zendesk.
It offers a balance of functionality, a straightforward approach, and reliability, making it

a top pick among cloud developers. From developing a high-end cloud infrastructure or
a simple application, Ruby on Rails is the go-to place for all your cloud needs.
Its noteworthy features include:
• Easy to learn and implement for entrants, with some programming homework
help.
• Open source and easily accessible extensive libraries from Ruby on Rails
developer communities
• Support Multi-threading to facilitate fast processing.
Runtime Support
Runtimes on Cloud Functions include an operating system, software required to build

and execute code written for a specific programming language, and software to
support your function. Cloud Functions applies updates to runtimes in accordance
with your selected security update policy.
Google provides support for a runtime during General availability (GA). During this
support window:
• Runtime components are regularly updated with security and bug fixes. Updates
are applied in accordance with your function's security update policy.
• To maintain stability, Cloud Functions avoids implementing breaking features or
changes into the runtime. Breaking changes will be announced in advance in the
Cloud Functions release notes.
When a language version is no longer actively maintained by the respective

community, Cloud Functions will also stop providing maintenance and support for that
language runtime. Before a runtime reaches the deprecation phase as described in
the runtimes support schedule, Google will provide a notification to customers in
Cloud Console.
Google may make changes to any runtime's support schedule or lifecycle in

accordance with the terms of your agreement for the use of Google Cloud platform
services.
What is Parallel Computing?
Parallel computing refers to the process of executing several processors an application

or computation simultaneously. Generally, it is a kind of computing architecture where
the large problems break into independent, smaller, usually similar parts that can be
processed in one go. It is done by multiple CPUs communicating via shared memory,
which combines results upon completion. It helps in performing large computations as
it divides the large problem between more than one processor.
Parallel computing also helps in faster application processing and task resolution by
increasing the available computation power of systems. The parallel computing
principles are used by most supercomputers employ to operate. The operational
scenarios that need massive processing power or computation, generally, parallel
processing is commonly used there.
Typically, this infrastructure is housed where various processors are installed in a

server rack; the application server distributes the computational requests into small
chunks then the requests are processed simultaneously on each server. The earliest
computer software is written for serial computation as they are able to execute a single
instruction at one time, but parallel computing is different where it executes several
processors an application or computation in one time.
There are many reasons to use parallel computing, such as save time and money,
provide concurrency, solve larger problems, etc. Furthermore, parallel computing
reduces complexity. In the real-life example of parallel computing, there are two
queues to get a ticket of anything; if two cashiers are giving tickets to 2 persons
simultaneously, it helps to save time as well as reduce complexity.s
Types of parallel computing
From the open-source and proprietary parallel computing vendors, there are generally
three types of parallel computing available, which are discussed below:
1. Bit-level parallelism: The form of parallel computing in which every task is

dependent on processor word size. In terms of performing a task on large-sized
data, it reduces the number of instructions the processor must execute. There is
a need to split the operation into series of instructions. For example, there is an
8-bit processor, and you want to do an operation on 16-bit numbers. First, it must
operate the 8 lower-order bits and then the 8 higher-order bits. Therefore, two
instructions are needed to execute the operation. The operation can be
performed with one instruction by a 16-bit processor.
2. Instruction-level parallelism: In a single CPU clock cycle, the processor
decides in instruction-level parallelism how many instructions are implemented
at the same time. For each clock cycle phase, a processor in instruction-level
parallelism can have the ability to address that is less than one instruction. The
software approach in instruction-level parallelism functions on static
parallelism, where the computer decides which instructions to execute
simultaneously.
3. Task Parallelism: Task parallelism is the form of parallelism in which the tasks
are decomposed into subtasks. Then, each subtask is allocated for execution.
And, the execution of subtasks is performed concurrently by processors.
Applications of Parallel Computing
There are various applications of Parallel Computing, which are as follows:
ADVERTISEMENT
ADVERTISEMENT
o One of the primary applications of parallel computing is Databases and Data
mining.
o The real-time simulation of systems is another use of parallel computing.
o The technologies, such as Networked videos and Multimedia.
o Science and Engineering.
o Collaborative work environments.
o The concept of parallel computing is used by augmented reality, advanced
graphics, and virtual reality.
Advantages of Parallel computing

Parallel computing advantages are discussed below:
o In parallel computing, more resources are used to complete the task that led to
decrease the time and cut possible costs. Also, cheap components are used to
construct parallel clusters.
o Comparing with Serial Computing, parallel computing can solve larger problems
in a short time.
o For simulating, modeling, and understanding complex, real-world phenomena,
parallel computing is much appropriate while comparing with serial computing.
o When the local resources are finite, it can offer benefit you over non-local
resources.
o There are multiple problems that are very large and may impractical or
impossible to solve them on a single computer; the concept of parallel
computing helps to remove these kinds of issues.
o One of the best advantages of parallel computing is that it allows you to do
several things in a time by using multiple computing resources.
o Furthermore, parallel computing is suited for hardware as serial computing
wastes the potential computing power.
Disadvantages of Parallel Computing
There are many limitations of parallel computing, which are as follows:
o It addresses Parallel architecture that can be difficult to achieve.

o In the case of clusters, better cooling technologies are needed in parallel
computing.
o It requires the managed algorithms, which could be handled in the parallel
mechanism.
What is MapReduce?
MapReduce is a programming model or pattern within the Hadoop framework that

is used to access big data stored in the Hadoop File System (HDFS). It is a core
component, integral to the functioning of the Hadoop framework.
MapReduce facilitates concurrent processing by splitting petabytes of data into

smaller chunks, and processing them in parallel on Hadoop commodity servers. In the
end, it aggregates all the data from multiple servers to return a consolidated output
back to t
With MapReduce, rather than sending data to where the application or logic
resides, the logic is executed on the server where the data already resides, to
expedite processing. Data access and storage is disk-based—the input is usually
stored as files containing structured, semi-structured, or unstructured data, and the
output is also stored in files.
MapReduce was once the only method through which the data stored in the HDFS
could be retrieved, but that is no longer the case. Today, there are other query-based
systems such as Hive and Pig that are used to retrieve data from the HDFS using SQL-
like statements. However, these usually run along with jobs that are written using the
MapReduce model. That's because MapReduce has unique advantages.
How MapReduce Works
At the crux of MapReduce are two functions: Map and Reduce. They are sequenced one
after the other.
• The Map function takes input from the disk as <key,value> pairs, processes
them, and produces another set of intermediate <key,value> pairs as
output.
• The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.
The types of keys and values differ based on the use case. All inputs and outputs are
stored in the HDFS. While the map is a mandatory step to filter and sort the initial data,
the reduce function is optional.
<k1, v1> -> Map() -> list(<k2, v2>)

<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
Mappers and Reducers are the Hadoop servers that run the Map and Reduce functions
respectively. It doesn’t matter if these are the same or different servers.
Map
The input data is first split into smaller blocks. Each block is then assigned to a mapper
for processing.
For example, if a file has 100 records to be processed, 100 mappers can run together to
process one record each. Or maybe 50 mappers can run together to process two
records each. The Hadoop framework decides how many mappers to use, based on the
size of the data to be processed and the memory block available on each mapper
server.
Reduce
After all the mappers complete processing, the framework shuffles and sorts the
results before passing them on to the reducers. A reducer cannot start while a mapper
is still in progress. All the map output values that have the same key are assigned to a
single reducer, which then aggregates the values for that key.
Combine and Partition
There are two intermediate steps between Map and Reduce.
Combine is an optional process. The combiner is a reducer that runs individually on

each mapper server. It reduces the data on each mapper further to a simplified form
before passing it downstream.
This makes shuffling and sorting easier as there is less data to work with. Often, the
combiner class is set to the reducer class itself, due to the cumulative and associative
functions in the reduce function. However, if needed, the combiner can be a separate
class as well.
Partition is the process that translates the <key, value> pairs resulting from mappers to
another set of <key, value> pairs to feed into the reducer. It decides how the data has to
be presented to the reducer and also assigns it to a particular reducer.
The default partitioner determines the hash value for the key, resulting from the
mapper, and assigns a partition based on this hash value. There are as many partitions
as there are reducers. So, once the partitioning is complete, the data from each
partition is sent to a specific reducer.
A MapReduce Example
Consider an ecommerce system that receives a million requests every day to process
payments. There may be several exceptions thrown during these requests such as
"payment declined by a payment gateway," "out of inventory," and "invalid address." A
developer wants to analyze last four days' logs to understand which exception is thrown
how many times.
Example Use Case
The objective is to isolate use cases that are most prone to errors, and to take
appropriate action. For example, if the same payment gateway is frequently throwing
an exception, is it because of an unreliable service or a badly written interface? If the
"out of inventory" exception is thrown often, does it mean the inventory calculation
service has to be improved, or does the inventory stocks need to be increased for
certain products?
The developer can ask relevant questions and determine the right course of action. To
perform this analysis on logs that are bulky, with millions of records, MapReduce is an
apt programming model. Multiple mappers can process these logs simultaneously:
one mapper could process a day's log or a subset of it based on the log size and the
memory block available for processing in the mapper server.
Map
For simplification, let's assume that the Hadoop framework runs just four mappers.
Mapper 1, Mapper 2, Mapper 3, and Mapper 4.
The value input to the mapper is one record of the log file. The key could be a text string
such as "file name + line number." The mapper, then, processes each record of the log
file to produce key value pairs. Here, we will just use a filler for the value as '1.' The
output from the mappers look like this:
Mapper 1 -> <Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>,
<Exception A, 1>
Mapper 2 -> <Exception B, 1>, <Exception B, 1>, <Exception A, 1>, <Exception A, 1>
Mapper 3 -> <Exception A, 1>, <Exception C, 1>, <Exception A, 1>, <Exception B, 1>,
<Exception A, 1>
Mapper 4 -> <Exception B, 1>, <Exception C, 1>, <Exception C, 1>, <Exception A, 1>
Assuming that there is a combiner running on each mapper—Combiner 1 … Combiner

4—that calculates the count of each exception (which is the same function as the
reducer), the input to Combiner 1 will be:
<Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>, <Exception A,

1>
Combine
The output of Combiner 1 will be:
<Exception A, 3>, <Exception B, 1>, <Exception C, 1>
The output from the other combiners will be:
Combiner 2: <Exception A, 2> <Exception B, 2>

Combiner 3: <Exception A, 3> <Exception B, 1> <Exception C, 1>
Combiner 4: <Exception A, 1> <Exception B, 1> <Exception C, 2>
Partition
After this, the partitioner allocates the data from the combiners to the reducers. The
data is also sorted for the reducer.
The input to the reducers will be as below:
Reducer 1: <Exception A> {3,2,3,1}

Reducer 2: <Exception B> {1,2,1,1}
Reducer 3: <Exception C> {1,1,2}
If there were no combiners involved, the input to the reducers will be as below:
Reducer 1: <Exception A> {1,1,1,1,1,1,1,1,1}
Reducer 2: <Exception B> {1,1,1,1,1}
Reducer 3: <Exception C> {1,1,1,1}
Here, the example is a simple one, but when there are terabytes of data involved, the
combiner process’ improvement to the bandwidth is significant.
Reduce
Now, each reducer just calculates the total count of the exceptions as:
Reducer 1: <Exception A, 9>

Reducer 2: <Exception B, 5>
Reducer 3: <Exception C, 4>
The data shows that Exception A is thrown more often than others and requires more
attention. When there are more than a few weeks' or months' of data to be processed
together, the potential of the MapReduce program can be truly exploited.
Advantages of MapReduce
1. Scalability
2. Flexibility
3. Security and authentication
4. Faster processing of data
5. Very simple programming model
6. Availability and resilient nature
Hadoop
Hadoop is an open source framework based on Java that manages the storage and
processing of large amounts of data for applications. Hadoop uses distributed storage
and parallel processing to handle big data and analytics jobs, breaking workloads
down into smaller workloads that can be run at the same time.
Four modules comprise the primary Hadoop framework and work collectively to form
the Hadoop ecosystem:
Hadoop Distributed File System (HDFS): As the primary component of the Hadoop
ecosystem, HDFS is a distributed file system in which individual Hadoop nodes
operate on data that resides in their local storage. This removes network latency,
providing high-throughput access to application data. In addition, administrators don’t
need to define schemas up front.
Yet Another Resource Negotiator (YARN): YARN is a resource-management platform
responsible for managing compute resources in clusters and using them to schedule
users’ applications. It performs scheduling and resource allocation across the Hadoop
system.
MapReduce: MapReduce is a programming model for large-scale data processing. In

the MapReduce model, subsets of larger datasets and instructions for processing the
subsets are dispatched to multiple different nodes, where each subset is processed by
a node in parallel with other processing jobs. After processing the results, individual
subsets are combined into a smaller, more manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities used and
shared by other Hadoop modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem
continues to grow and includes many tools and applications to help collect, store,
process, analyze, and manage big data. These include Apache Pig, Apache Hive,
Apache HBase, Apache Spark, Presto, and Apache Zeppelin.
How does Hadoop work?
Hadoop allows for the distribution of datasets across a cluster of commodity

hardware. Processing is performed in parallel on multiple servers simultaneously.
Software clients input data into Hadoop. HDFS handles metadata and the distributed
file system. MapReduce then processes and converts the data. Finally, YARN divides
the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware
failures of individual machines or racks of machines are common and should be
automatically handled in software by the framework.
What are the benefits of Hadoop?
Scalability
Hadoop is important as one of the primary tools to store and process huge amounts
of data quickly. It does this by using a distributed computing model which enables the
fast processing of data that can be rapidly scaled by adding computing nodes.
Low cost
As an open source framework that can run on commodity hardware and has a large
ecosystem of tools, Hadoop is a low-cost option for the storage and management of
big data.
Flexibility
Hadoop allows for flexibility in data storage as data does not require preprocessing
before storing it which means that an organization can store as much data as they like
and then utilize it later.
Resilience
As a distributed computing model, Hadoop allows for fault tolerance and system
resilience, meaning if one of the hardware nodes fail, jobs are redirected to other
nodes. Data stored on one Hadoop cluster is replicated across other nodes within the
system to fortify against the possibility of hardware or software failure.
What are the challenges of Hadoop?
MapReduce complexity and limitations
As a file-intensive system, MapReduce can be a difficult tool to utilize for complex jobs,
such as interactive analytical tasks. MapReduce functions also need to be written in
Java and can require a steep learning curve. The MapReduce ecosystem is quite large,
with many components for different functions that can make it difficult to determine
what tools to use.
Security
Data sensitivity and protection can be issues as Hadoop handles such large datasets.
An ecosystem of tools for authentication, encryption, auditing, and provisioning has
emerged to help developers secure data in Hadoop.
Governance and management
Hadoop does not have many robust tools for data management and governance, nor
for data quality and standardization.
Talent gap
Like many areas of programming, Hadoop has an acknowledged talent gap. Finding
developers with the combined requisite skills in Java to program MapReduce, operating
systems, and hardware can be difficult. In addition, MapReduce has a steep learning
curve, making it hard to get new programmers up to speed on its best practices and
ecosystem.
Programming Support for Google Apps engine

Google App Engine (GAE) is a Platform as a Service (PaaS) cloud-based Web hosting
service on Google's infrastructure. For an application to run on GAE, it must comply
with Google's platform standards, which narrows the range of applications that can be
run and severely limits those applications' portability.
GAE supports the following major features:
1. Dynamic Web services based on common standards

2. Automatic scaling and load balancing
3. Authentication using Google's Accounts API
4. Persistent storage, with query access sorting and transaction management
features
5. Task queues and task scheduling
6. A client-side development environment for simulating GAE on your local system
7. One of either two runtime environments: Java or Python
Google File System:

• Abbreviated as GFS, a Global File System is a cluster file system that enables a
cluster of computers to simultaneously use a block device that is shared
between them.
• GFS reads and writes to the block device like a local file system, but also allows
the computers to coordinate their I/O to maintain file system consistency.
• With GFS any changes that are made to the file system on one computer will
immediately be seen on all other computers in that cluster.
• GFS provides fault tolerance, reliability, scalability, availability and performance
to large networks and connected nodes. GFS is made up of several storage
systems built from low-cost commodity hardware components.
• It is optimized to accommodate Google's different data use and storage needs,
such as its search engine, which generates huge amounts of data that must be
stored.
Big Tables and Google NO SQL System:
• Google Cloud Bigtable is a productized version of the NoSQL database that
stores Google's bits and bytes.
• The big selling point is it doesn't require the maintenance traditionally
needed for compatible on-prem NoSQL solutions.
• Bigtable is a compressed, high performance, and proprietary data storage
system built on Google File System, Chubby Lock Service and a few other Google
technologies.
• Bigtable maps two arbitrary string values (row key and column key) and
timestamp (hence three-dimensional mapping) into an associated arbitrary byte
array.
• It is not a relational database and can be better defined as a sparse,
distributed multi-dimensional sorted map.
Google’s Distributed Lock Service (Chubby):
• Chubby is a distributed lock service intended for coarse-grained synchronization
of activities within Google's distributed systems.
• Chubby has become Google's primary internal name service; it is a common
rendezvous mechanism for systems such as MapReduce; the storage systems
GFS and Bigtable use Chubby to elect a primary from redundant replicas; and it
is a standard repository for files that require high availability, such as access
control lists.
• Chubby is a relatively heavy-weight system intended for coarse-grained locks,
locks held for "hours or days", not "seconds or less."
What is Google File System (GFS)?
GFS is a scalable distributed file system developed by Google for its large data-
intensive applications.
GFS was built for handling batch processing on large data sets and is designed for
system-to-system interaction, not user-to-system interaction.
Google built GFS keeping the following goals in mind:
• Scalable: GFS should run reliably on a very large system built from commodity
hardware.
• Fault-tolerant: The design must be sufficiently tolerant of hardware and
software failures to enable application-level services to continue their
operation in the face of any likely combination of failure conditions.
• Large files: Files stored in GFS will be huge. Multi-GB files are common.
• Large sequential and small random reads: The workloads primarily consist of
two kinds of reads: large, streaming reads and small, random reads.
• Sequential writes: The workloads also have many large, sequential writes that
append data to files. Typical operation sizes are similar to those for reads. Once
written, files are seldom modified again.
• Not optimized for small data: Small, random reads and writes do occur and are
supported, but the system is not optimized for such cases.
• Concurrent access: The level of concurrent access will also be high, with large
numbers of concurrent appends being particularly prevalent, often
accompanied by concurrent reads.
• High throughput: GFS should be optimized for high and sustained throughput in
reading the data, and this is prioritized over latency. This is not to say that
latency is unimportant; rather, GFS needs to be optimized for high-performance
reading and appending large volumes of data for the correct operation of the
system.
GFS use cases
• GFS is a distributed file system built for large, distributed data-intensive

applications like Gmail or YouTube.
• Originally, it was built to store data generated by Google's large crawling and
indexing system.
• Google's BigTable uses the distributed Google File System to store log and data
files.
APIs
GFS does not provide standard POSIX-like APIs; instead, user-level APIs are provided.
In GFS, files are organized hierarchically in directories and identified by their
pathnames. GFS supports the usual file system operations:
create – To create a new instance of a file.

delete – To delete an instance of a file.
open – To open a named file and return a handle.
close – To close a given file specified by a handle.
read – To read data from a specified file and offset.
write – To write data to a specified file and offset.
In addition, GFS supports two special operations:
• Snapshot: A snapshot is an efficient way of creating a copy of the current

instance of a file or directory tree.
• Append: An append operation allows multiple clients to append data to the
same file concurrently while guaranteeing atomicity. It is useful for
implementing multi-way merge results and producer-consumer queues that
many clients can simultaneously append to without additional locking.
What is AWS?
AWS Meaning: The Amazon Web Services (AWS) platform provides more than 200 fully
featured services from data centers located all over the world, and is the world's most
comprehensive cloud platform.
Amazon web service is an online platform that provides scalable and cost-effective
cloud computing solutions.
AWS is a broadly adopted cloud platform that offers several on-demand operations like
compute power, database storage, content delivery, etc., to help corporates scale and
grow.
How Does it Work?
That was all about what is AWS. Next, let’s have a look at the history.
Also Read: AWS Fundamentals
History of AWS
• In the year 2002 - AWS services were launched

• In the year 2006- AWS cloud products were launched
• In the year 2012 - AWS had its first customer event
• In the year 2015- AWS achieved $4.6 billion
• In the year 2016- Surpassed the $10 billion revenue target
• In the year 2016- AWS snowball and AWS snowmobile were launched
• In the year 2019- Released approximately 100 cloud services
Moving forward, we will learn more about AWS services.
How Does AWS Work?

AWS usually works in several different configurations depending on the user's
requirements. However, the user must be able to see the type of configuration used
and the particular server map with respect to the AWS service.
Advantages of AWS
1. AWS provides a user-friendly programming model, architecture, database as

well as operating system that has been already known to employers.
2. AWS is a very cost-effective service. There is no such thing as long-term
commitments for anything you would like to purchase.
3. It offers billing and management for the centralized sector, hybrid computing,
and fast installation or removal of your application in any location with few
clicks.
4. There is no need to pay extra money on running data servers by AWS.
5. AWS offers a total ownership cost at very reasonable rates in comparison to
other private cloud servers.
Disadvantages of AWS
1. AWS has supportive paid packages for intensive or immediate response. Thus,
users might need to pay extra money for that.
2. There might be some cloud computing problems in AWS especially when you
move to a cloud Server such as backup protection, downtime, and some
limited control.
3. From region to region, AWS sets some default limitations on resources such as
volumes, images, or snapshots.
4. If there is a sudden change in your hardware system, the application on the
cloud might not offer great performance.
Migration
Migration services use 3 different sub-services, DMS, SMS, and snowball to transfer
the data physically from Datacenter to AWS.
1. DMS also known as Database Migration Service is used to migrate one
database to another.
2. SMS is a Server Migration Service that helps to migrate on-site servers to AWS
within a short period of time.
3. Snowball is used to migrate data inside in terabytes to data outside within the
AWS environment.
Applications of AWS
The most common applications of AWS are storage and backup, websites, gaming,
mobile, web, and social media applications. Some of the most crucial applications in
detail are as follows:
1. Storage and Backup
One of the reasons why many businesses use AWS is because it offers multiple types
of storage to choose from and is easily accessible as well. It can be used for storage
and file indexing as well as to run critical business applications.
2. Websites
Businesses can host their websites on the AWS cloud, similar to other web
applications.
3. Gaming
There is a lot of computing power needed to run gaming applications. AWS makes it
easier to provide the best online gaming experience to gamers across the world.
4. Mobile, Web and Social Applications
A feature that separates AWS from other cloud services is its capability to launch and
scale mobile, e-commerce, and SaaS applications. API-driven code on AWS can
enable companies to build uncompromisingly scalable applications without requiring
any OS and other systems.
5. Big Data Management and Analytics (Application)
• Amazon Elastic MapReduced to process large amounts of data via the Hadoop
framework.
• Amazon Kinesis to analyze and process the streaming data.
• AWS Glue to handle, extract, transform and load jobs.
• Amazon Elasticsearch Service to enable a team to perform log analysis, and
tool monitoring with the help of the open source tool, Elastic-search.
• Amazon Athena to query data.
• Amazon QuickSight to visualize data.
6. Artificial Intelligence
• Amazon Lex to offer voice and text chatbot technology.

• Amazon Polly to translate text-to-speech translation such as Alexa Voice
Services and echo devices.
• Amazon Rekognition to analyze the image and face.
7. Messages and Notifications
• Amazon Simple Notification Service (SNS) for effective business or core

communication.
• Amazon Simple Email Service (SES) to receive or send emails for IT
professionals and marketers.
• Amazon Simple Queue Service (SQS) to enable businesses to subscribe or
publish messages to end users.
8. Augmented Reality and Virtual Reality
• Amazon Sumerian service enables users to make the use of AR and VR

development tools to offer 3D web applications, E-commerce & sales
applications, Marketing, Online education, Manufacturing, Training
simulations, and Gaming.
9. Game Development
• AWS game development tools are used by large game development

companies that offer developer back-end services, analytics, and various
developer tools.
• AWS allows developers to host game data as well as store the data to analyze
the gamer's performance and develop the game accordingly.
10. Internet of Things
• AWS IoT service offers a back-end platform to manage IoT devices as well as
data ingestion to database services and AWS storage.
• AWS IoT Button offers limited IoT functionality to hardware.
• AWS Greengrass offers AWS computing for IoT device installation.
What is Amazon S3?
✔ Amazon S3 is an object storage service that offers industry-leading scalability, data
availability, security, and performance.
✔ Store and protect any amount of data for a range of use cases, such as data lakes,
websites, cloud-native applications, backups, archive, machine learning, and
analytics.
✔ Amazon S3 is designed for 99.999999999% (11 9's) of durability, and stores data for
millions of customers all around the world.
Features of Amazon S3
Storage management
Amazon S3 has storage management features that you can use to manage costs, meet
regulatory requirements, reduce latency, and save multiple distinct copies of your data
for compliance requirements.
• S3 Lifecycle – Configure a lifecycle configuration to manage your objects and

store them cost effectively throughout their lifecycle. You can transition objects
to other S3 storage classes or expire objects that reach the end of their lifetimes.
• S3 Object Lock – Prevent Amazon S3 objects from being deleted or overwritten
for a fixed amount of time or indefinitely. You can use Object Lock to help meet
regulatory requirements that require write-once-read-many (WORM) storage or
to simply add another layer of protection against object changes and deletions.
• S3 Replication – Replicate objects and their respective metadata and object tags
to one or more destination buckets in the same or different AWS Regions for
reduced latency, compliance, security, and other use cases.
• S3 Batch Operations – Manage billions of objects at scale with a single S3 API
request or a few clicks in the Amazon S3 console. You can use Batch Operations
to perform operations such as Copy, Invoke AWS Lambda function,
and Restore on millions or billions of objects.
Data processing
To transform data and trigger workflows to automate a variety of other processing

activities at scale, you can use the following features.
• S3 Object Lambda – Add your own code to S3 GET, HEAD, and LIST requests to
modify and process data as it is returned to an application. Filter rows,
dynamically resize images, redact confidential data, and much more.
• Event notifications – Trigger workflows that use Amazon Simple Notification
Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS
Lambda when a change is made to your S3 resources.
Storage logging and monitoring
Amazon S3 provides logging and monitoring tools that you can use to monitor and
control how your Amazon S3 resources are being used. For more information,
see Monitoring tools.
Automated monitoring tools

• Amazon CloudWatch metrics for Amazon S3 – Track the operational health of
your S3 resources and configure billing alerts when estimated charges reach a
user-defined threshold.
• AWS CloudTrail – Record actions taken by a user, a role, or an AWS service in
Amazon S3. CloudTrail logs provide you with detailed API tracking for S3 bucket-
level and object-level operations.
Manual monitoring tools
• Server access logging – Get detailed records for the requests that are made to a
bucket. You can use server access logs for many use cases, such as conducting
security and access audits, learning about your customer base, and
understanding your Amazon S3 bill.
• AWS Trusted Advisor – Evaluate your account by using AWS best practice checks
to identify ways to optimize your AWS infrastructure, improve security and
performance, reduce costs, and monitor service quotas. You can then follow the
recommendations to optimize your services and resources.
What is AWS EBS?
AWS EBS is also called AWS Elastic Block Store.
EBS is a service that provides storage volumes.
You can use provided storage volumes in Amazon EC2 instances.
EBS volumes are used for data that needs to persist.
It is important to backup the data with AWS EBS snapshots.

After creating an EBS volume, you can attach it to an AWS EC2 instance.
If the EC2 instance stops or is terminated, all the data on the attached EBS volume
remains.
What are AWS EBS Snapshots?
EBS snapshot is an incremental data backup.
The first backup of a volume backups all the data.
Every next backup copies only a block of data that has changed since the last
snapshot.
It saves on storage costs by not duplicating data.
Image created by Amazon Web Services
The image illustrates how EBS Snapshots work.
Only the data unique to that snapshot is removed when you delete a snapshot.
If the EC2 instance stops, or is terminated, all the data on the attached EBS volume
remains.
What is Microsoft Azure?
Azure is a cloud computing platform and an online portal that allows you to access and
manage cloud services and resources provided by Microsoft. These services and
resources include storing your data and transforming it, depending on your
requirements. To get access to these resources and services, all you need to have is an
active internet connection and the ability to connect to the Azure portal.
Things that you should know about Azure:
• It was launched on February 1, 2010, significantly later than its main

competitor, AWS.
• It’s free to start and follows a pay-per-use model, which means you pay only
for the services you opt for.
• Interestingly, 80 percent of the Fortune 500 companies use Azure services for
their cloud computing needs.
• Azure supports multiple programming languages, including Java, Node Js,
and C#.
• Another benefit of Azure is the number of data centers it has around the world.
There are 42 Azure data centers spread around the globe, which is the highest
number of data centers for any cloud platform. Also, Azure is planning to get 12
more data centers, which will increase the number of data centers to 54,
shortly.
What are the Various Azure Services and How does Azure Work?
Azure provides more than 200 services, are divided into 18 categories. These
categories include computing, networking, storage, IoT, migration, mobile, analytics,
containers, artificial intelligence, and other machine learning, integration,
management tools, developer tools, security, databases, DevOps, media identity, and
web services. Let’s take a look at some of the major Azure services by category:
Compute Services
• Virtual Machine
This service enables you to create a virtual machine in Windows, Linux or any
other configuration in seconds.
• Cloud Service
This service lets you create scalable applications within the cloud. Once the
application is deployed, everything, including provisioning, load balancing,
and health monitoring, is taken care of by Azure.
• Service Fabric
With service fabric, the process of developing a microservice is immensely

simplified. Microservice is an application that contains other bundled smaller
applications.
• Functions
With functions, you can create applications in any programming language. The
best part about this service is that you need not worry about hardware
requirements while developing applications because Azure takes care of that.
All you need to do is provide the code.
Why Use Azure?
Now that you know more about Azure and the services it provides, you might be
interested in exploring the various uses of Azure.
• Application development: You can create any web application in Azure.

• Testing: After developing an application successfully on the platform, you can
test it.
• Application hosting: Once the testing is done, Azure can help you host the
application.
• Create virtual machines: You can create virtual machines in any configuration
you want with the help of Azure.
• Integrate and sync features: Azure lets you integrate and sync virtual devices
and directories.
• Collect and store metrics: Azure lets you collect and store metrics, which can
help you find what works.
• Virtual hard drives: These are extensions of the virtual machines; they provide
a huge amount of data storage.

Cloud Capabilities

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloud Capabilities

Uploaded by

Copyright:

Available Formats

UNIT -4 CLOUD PROGRAMMING AND SOFTWARE ENVIRONMENT

What is grid computing?

Why is grid computing important?

Organizations use grid computing for several reasons.

What are the use cases of grid computing?

The following are some common applications of grid computing.

The gaming industry uses grid computing to provide additional computational

What are the components in grid computing?

Grid middleware is a specialized software application that connects computing

Grid computing architecture

1. The top layer consists of high-level applications, such as an application to

How does grid computing work?

What are the types of grid computing?

A computational grid consists of high-performance computers. It allows researchers to

What is a Cloud Database?

Like a traditional on-premises database, cloud databases can be classified

Examples of relational databases include SQL Server, Oracle, MySQL, PostgreSQL,

Examples of non-relational databases include MongoDB, Redis, Cassandra, Hbase,

Cloud databases provide flexibility, reliability, security, affordability and more.

It offers a range of valuable features, including:

• Being an object-oriented programming language, Java can produce reusable

Python developers can quickly automate workflows and implement headless

• Allowing developers to create complex applications without writing too much

Go is particularly common in the Google Cloud Platform (GCP). It’s a perfect

Additionally, when it comes to developing user interfaces for cloud applications,

It offers a balance of functionality, a straightforward approach, and reliability, making it

Its noteworthy features include:

Runtimes on Cloud Functions include an operating system, software required to build

When a language version is no longer actively maintained by the respective

Google may make changes to any runtime's support schedule or lifecycle in

What is Parallel Computing?

Parallel computing refers to the process of executing several processors an application

Typically, this infrastructure is housed where various processors are installed in a

Types of parallel computing

1. Bit-level parallelism: The form of parallel computing in which every task is

Applications of Parallel Computing

There are various applications of Parallel Computing, which are as follows:

Advantages of Parallel computing

Disadvantages of Parallel Computing

There are many limitations of parallel computing, which are as follows:

o It addresses Parallel architecture that can be difficult to achieve.

MapReduce is a programming model or pattern within the Hadoop framework that

MapReduce facilitates concurrent processing by splitting petabytes of data into

How MapReduce Works

<k1, v1> -> Map() -> list(<k2, v2>)

Combine and Partition

There are two intermediate steps between Map and Reduce.

Combine is an optional process. The combiner is a reducer that runs individually on

Example Use Case

Assuming that there is a combiner running on each mapper—Combiner 1 … Combiner

<Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>, <Exception A,

The output of Combiner 1 will be:

<Exception A, 3>, <Exception B, 1>, <Exception C, 1>

The output from the other combiners will be:

Combiner 2: <Exception A, 2> <Exception B, 2>

The input to the reducers will be as below:

Reducer 1: <Exception A> {3,2,3,1}

Reducer 1: <Exception A, 9>

MapReduce: MapReduce is a programming model for large-scale data processing. In

Hadoop allows for the distribution of datasets across a cluster of commodity

MapReduce complexity and limitations

Governance and management

Programming Support for Google Apps engine

GAE supports the following major features: