Practical 1: Data Mining and Business Intelligence Practical-1

DATA MINING AND BUSINESS INTELLIGENCE PRACTICAL- 1
Practical 1
To study about the Big Data Analytics
Abstract
Big Data has become the buzz word of the storage industry. It is a term applied to a new
generation of software, application, and system and storage architecture. We need advanced
tools, software and systems are required to capture, store, manage, and analyze the data sets.
The technology requirements for big data vary significantly from traditional database
applications. Big data workloads strain traditional storage architectures as the data sets are
unpredictable, growing at an exponential rate generating Zetta Bytes of data, while the need
to keep data in a centralized repository remains constant. This requires a seamless
architecture to accommodate new data sets. Map Reduce is a programming model and an
associated implementation for processing and generating large data sets. We introduce a map
function and a reduce function to automatically parallelized and executed on a large clusters
of commodity machines.
Introduction
Every day, 2.5 quintillion bytes of data are created. This data comes from digital pictures,
videos, posts to social media sites, intelligent sensors, purchase transaction records, cell
phone GPS signals to name a few. This is Big Data. There is a great interest both in the
commercial and in the research communities around Big Data.
Big Data is a new label given to a diverse field of data intensive informatics in which the
datasets are so large that they become hard to work with effectively. The term has been
mainly used in two contexts, firstly as a technological challenge when dealing with data
intensive domains such as high energy physics, astronomy or internet search, and secondly as
a sociological problem when data about us is collected and mined by companies such as
Facebook, Google, mobile phone companies, retail chains and governments. The user gain
awareness of the personally relevant part Big Data that is publicly available in the social web.
The amount of user-generated media uploaded to the web is expanding rapidly and it is
beyond the capabilities of any human to sift through it all to see which media impacts our
privacy. Based on an analysis of social media in Flickr, Locr, Facebook and Google+
implications and potential of the emerging trend of geo-tagged social media Big data usually
includes data sets with sizes beyond the ability of commonly used software tools
to capture, curate, manage, and process the data within a tolerable elapsed time. Big data
Enrollment no. 160750107035 Page 1

sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to
many petabytes of data in a single data set. The target moves due to constant improvement in
traditional DBMS technology as well as new databases like NoSQL and their ability to
handle larger amounts of data.With this difficulty, new platforms of "big data" tools are being
developed to handle various aspects of large quantities of data.
MapReduce is a programming model for processing large data sets like Big data with
a parallel, distributed algorithm on a cluster.
A MapReduce program comprises a Map() procedure that performs filtering and sorting and
a Reduce() procedure that performs a summary operation .The "MapReduce System" (also
called "infrastructure", "framework") orchestrates by marshaling the distributed servers,
running the various tasks in parallel, managing all communications and data transfers
between the various parts of the system, providing for redundancy and failures, and overall
management of the whole process.
The model is inspired by the map and reduce functions commonly used in functional
programming, although their purpose in the MapReduce framework is not the same as their
original forms. Furthermore, the key contribution of the MapReduce framework are not the
actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety
of applications by optimizing the execution engine once.
MapReduce is a framework for processing parallelizable problems across huge datasets using
a large number of computers , collectively referred to as acluster or a grid . Computational
processing can occur on data stored either in a filesystem or in a database. MapReduce can
take advantage of locality of data, processing data on or near the storage assets to decrease
transmission of data.
"Map" step: The master node takes the input, divides it into smaller sub-problems, and
distributes them to worker nodes. A worker node may do this again in turn, leading to a
multi- level tree structure. The worker node processes the smaller problem, and passes the
answer back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and
combines them in some way to form the output – the answer to the problem it was originally
trying to solve.
MapReduce allows for distributed processing of the map and reduction operations. Provided
each mapping operation is independent of the others, all maps can be performed in parallel –
though in practice it is limited by the number of independent data sources and/or the number
of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase -
provided all outputs of the map operation that share the same key are presented to the same
reducer at the same time, or if the reduction function is associative.
While this process can often appear inefficient compared to algorithms that are more
sequential, MapReduce can be applied to significantly larger datasets than "commodity"
servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only
a few hours. The parallelism also offers some possibility of recovering from partial failure of
servers or storage during the operation: if one mapper or reducer fails, the work can be
rescheduled – assuming the input data is still available.

Another way to look at MapReduce is as a 5-step parallel and distributed computation:
1. Prepare the Map() input – the "MapReduce system" designates Map processors,
assigns the K1 input key value each processor would work on, and provides that
processor with all the input data associated with that key value.
2. Run the user-provided Map() code – Map() is run exactly once for each K1 key
value, generating output organized by key values K2.
3. "Shuffle" the Map output to the Reduce processors – the MapReduce system
designates Reduce processors, assigns the K2 key value each processor would work
on, and provides that processor with all the Map-generated data associated with that
key value.
4. Run the user-provided Reduce() code – Reduce() is run exactly once for each K2
key value produced by the Map step.
5. Produce the final output – the MapReduce system collects all the Reduce output,
and sorts it by K2 to produce the final outcome.
Logically these 5 steps can be thought of as running in sequence – each step starts only after
the previous step is completed – though in practice, of course, they can be intertwined, as
long as the final result is not affected.
In many situations the input data might already be distributed ("shared") among many
different servers, in which case step 1 could sometimes be greatly simplified by assigning
Map servers that would process the locally present input data. Similarly, step 3 could
sometimes be sped up by assigning Reduce processors that are as much as possible local to
the Map-generated data they need to process.
Logical view
The Map and Reduce functions of MapReduce are both defined with respect to data
structured in (key, value) pairs. Map takes one pair of data with a type in one data domain,
and returns a list of pairs in a different domain:
Map (k1,v1) → list(k2,v2)
The Map function is applied in parallel to every pair in the input dataset. This produces a list
of pairs for each call. After that, the MapReduce framework collects all pairs with the same
key from all lists and groups them together, creating one group for each key.
The Reduce function is then applied in parallel to each group, which in turn produces a
collection of values in the same domain:
Reduce(k2, list (v2)) → list(v3)
Each Reduce call typically produces either one value v3 or an empty return, though one call
is allowed to return more than one value. The returns of all calls are collected as the desired
result list.
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values.
This behavior is different from the typical functional programming map and reduces
combination, which accepts a list of arbitrary values and returns one single value that
combines all the values returned by map.
It is necessary but not sufficient to have implementations of the map and reduce abstractions
in order to implement MapReduce. Distributed implementations of MapReduce require a
means of connecting the processes performing the Map and Reduce phases. This may be

a distributed file system. Other options are possible, such as direct streaming from mappers to
reducers, or for the mapping processors to serve up their results to reducers that query them.
The input data for this query would be the profiles of the individual customers within the
specifications. After the query is created and sent, the mapping function would sort through
the profiles, then identify and send the most frequently purchased products to the reducer.
The reducer would compare and aggregate the data generated from each map task and return
an output file featuring the top three most frequently purchased products from the cross-
section.
The MapReduce process is key in sorting through the big data that might be available when
submitting a query. The goal is to create the most accurate output in the shortest amount of
time. Figure 1 shows the overall flow of a MapReduce operation in our implementation.
When the user program calls the MapReduce function, the following sequence of actions
occurs (the numbered labels in Figure 1 correspond to the numbers in the list below):
1. The MapReduce library in the user program first splits the input files into M pieces of
typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an
optional parameter). It then starts up many copies of the program on a cluster of machines.
2. One of the copies of the program is special – the master. The rest are workers that are
assigned work by the master. There are M map tasks and R reduce tasks to assign. The master
picks idle workers and assigns each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the corresponding input split. It
parses key/value pairs out of the input data and passes each pair to the user-defined Map
function. The intermediate key/value pairs produced by the Map function are buffered in
memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the
partitioning function. The locations of these buffered pairs on the local disk are passed back
to the master, who is responsible for forwarding these locations to the reduce workers.
5. When a reduce worker is notified by the master about these locations, it uses remote
procedure calls to read the buffered data from the local disks of the map workers. When a
reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all
occurrences of the same key are grouped together. The sorting is needed because typically
many different keys map to the same reduce task. If the amount of intermediate data is too
large to fit in memory, an external sort is used.

6. The reduce worker iterates over the sorted intermediate data and for each unique
intermediate key encountered, it passes the key and the corresponding set of intermediate
values to the user’s Reduce function. The output of the Reduce function is appended to a final
output file for this reduce partition
7. When all map tasks and reduce tasks have been completed, the master wakes up the user
program. At this point, the MapReduce call in the user program returns back to the user code.
After successful completion, the output of the mapreduce execution is available in the R
output files (one per reduce task, with file names as specified by the user).
Typically, users do not need to combine these R output files into one file – they often pass
these files as input to another MapReduce call, or use them from another distributed
application that is able to deal with input that is partitioned into multiple files

Master Data Structures
The master keeps several data structures. For each map task and reduce task, it stores the state
(idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is
propagated from map tasks to reduce tasks. Therefore, for each completed map task, the
master stores the locations and sizes of the R intermediate file regions produced by the map
task. Updates to this location and size information are received as map tasks are completed.
The information is pushed incrementally to workers that have in-progress reduce tasks.

Fault Tolerance
Since the MapReduce library is designed to help process very large amounts of data using
hundreds or thousands of machines, the library must tolerate machine failures gracefully.
Worker Failure
The master pings every worker periodically. If no response is received from a worker in a
certain amount of time, the master marks the worker as failed. Any map tasks completed by
the worker are reset back to their initial idle state, and therefore become eligible for
scheduling on other workers. Similarly, any map task or reduce task in progress on a failed
worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are
re-executed on a failure because their output is stored on the local disk(s) of the failed
machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed
since their output is stored in a global file system.
When a map task is executed first by worker A and then later executed by worker B (because
A failed), all workers executing reduce tasks are notified of the reexecution. Any reduce task
that has not already read the data from worker A will read the data from worker B.
MapReduce is resilient to large-scale worker failures. For example, during one MapReduce
operation, network maintenance on a running cluster was causing groups of 80 machines at a
time to become unreachable for several minutes. The MapReduce master simply re-executed
the work done by the unreachable worker machines, and continued to make forward progress,
eventually completing the MapReduce operation.
Master Failure
It is easy to make the master write periodic checkpoints of the master data structures
described above. If the master task dies, a new copy can be started from the last checkpointed
state. However, given that there is only a single master, its failure is unlikely; therefore our
current implementation aborts the MapReduce computation if the master fails. Clients can
check for this condition
and retry the MapReduce operation if they desire.
The MapReduce programming model:

Distinction on static and variable data

Configurable long running (cacheable) map/reduce tasks
Pub/sub messaging based communication/data transfers
Efficient support for Iterative MapReduce computations
Combine phase to collect all reduce outputs
Related Work:
Chronologically the first paper is on the Google File System from 2003, which is a
distributed file system. Basically, files are split into chunks which are stored in a redundant
fashion on a cluster of commodity machines (Every article about Google has to include the
term “commodity machines”!)
Next up is the MapReduce paper from 2004. MapReduce has become synonymous with Big
Data. Legend has it that Google used it to compute their search indices. I imagine it worked
like this: They have all the crawled web pages sitting on their cluster and every day or so they
ran MapReduce to recompute everything.

Next up is the Bigtable paper from 2006 which has become the inspiration for countless
NoSQL databases like Cassandra, HBase, and others. About half of the architecture of
Cassandra is modeled after BigTable, including the data model, SSTables, and write-through-
logs (the other half being Amazon’s Dynamo database for the peer-to-peer clustering model).
Many systems have provided restricted programming models and used the restrictions to
parallelize the computation automatically. For example, an associative function can be
computed over all prefixes of an N element array in logN time on N processors using parallel
prefix computations [6, 9] MapReduce can be considered a simplification and distillation of
some of these models based on our experience with large real-world computations.
More significantly, we provide a fault-tolerant implementation that scales to thousands of

processors. In contrast, most of the parallel processing systems have only been implemented
on smaller scales and leave the details of handling machine failures to the programmer. Bulk
Synchronous Programming and some MPI
Primitive provide higher-level abstractions that make it easier for programmers to write
parallel programs.
A key difference between these systems and MapReduce
MapReduce exploits a restricted programming model to parallelize the user program

automatically and to provide transparent fault-tolerance. Our locality optimization draws its
inspiration from techniques such as active disks where computation is pushed into processing
elements that are close to local disks, to reduce the amount of data sent across I/O subsystems
or the network. We run on commodity processors to which a small number of disks are
directly connected instead of running directly on disk controller processors, but the general
approach is similar.
Our backup task mechanism is similar to the eager scheduling mechanism employed in the
Charlotte System [3]. One of the shortcomings of simple eager scheduling is that if a given
task causes repeated failures, the entire computation fails to complete. We fix some instances
of this problem with our mechanism for skipping bad records.
The MapReduce implementation relies on an in-house cluster management system that is

responsible for distributing and running user tasks on a large collection of shared machines.
Though not the focus of this paper, the cluster management system is similar in spirit to other
systems such as Condor .

The sorting facility that is a part of the MapReduce library is similar in operation to NOW-
Sort [1]. Source machines (map workers) partition the data to be sorted and send it to one of
R reduce workers. Each reduce worker sorts its data locally (in memory if possible). Of
course NOW-Sort does not have the user-definable Map and Reduce functions that make our
library widely applicable.
River [2] provides a programming model where processes communicate with each other by
sending data over distributed queues. Like MapReduce, the River system tries to provide
good average case performance even in the presence of non-uniformities introduced by
heterogeneous hardware or system perturbations. River achieves this by careful scheduling
of disk and network transfers to achieve balanced completion times. MapReduce has a
different approach. By restricting the programming model, the MapReduce framework is able
to partition the problem into a large number of fine grained tasks. These tasks are
dynamically scheduled on available
workers so that faster workers process more tasks. The restricted programming model also
allows us to schedule redundant executions of tasks near the end of the job which greatly
reduces completion time in the presence of non-uniformities (such as slow or stuck workers).
Conclusions
The MapReduce programming model has been successfully used at Google for many
different purposes. We attribute this success to several reasons. First, the model is easily used
by programmers without experience with parallel and distributed systems, since it hides the
details of parallelization, fault-tolerance, locality optimization, and load balancing. Second, a
large variety of problems are easily expressible as MapReduce computations.
For example, MapReduce is used for the generation of data for Google’s production web
search service, for sorting, for data mining, for machine learning, and many other systems.
Third, we have developed an implementation of MapReduce that scales to large clusters of
machines comprising thousands of machines. The Implementation makes efficient use of
these machine resources and therefore is suitable for use on many of the large computational
problems encountered at Google. We have learned several things from this work. First,
restricting the programming model makes it easy to parallelize and distribute Computations
and to make such computations fault-tolerant. Second, network bandwidth is a scarce
resource. A number of optimizations in our system are therefore targeted at reducing the
amount of data sent across the network: the locality optimization allows us to read data from
local disks, and writing a single copy of the intermediate data to local disk saves network
bandwidth. Third, redundant execution can be used to reduce the impact of slow machines,
and to handle machine failures and data loss.

Practical 1: Data Mining and Business Intelligence Practical-1

Uploaded by

Copyright:

Available Formats

You might also like

Practical 1: Data Mining and Business Intelligence Practical-1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practical 1: Data Mining and Business Intelligence Practical-1

Uploaded by

Copyright:

Available Formats

DATA MINING AND BUSINESS INTELLIGENCE PRACTICAL- 1

To study about the Big Data Analytics

Enrollment no. 160750107035 Page 1

Enrollment no. 160750107035 Page 2

Another way to look at MapReduce is as a 5-step parallel and distributed computation:

Enrollment no. 160750107035 Page 3

Enrollment no. 160750107035 Page 4

Enrollment no. 160750107035 Page 5

Master Data Structures

Enrollment no. 160750107035 Page 6

and retry the MapReduce operation if they desire.

The MapReduce programming model:

Enrollment no. 160750107035 Page 7

Distinction on static and variable data

Enrollment no. 160750107035 Page 8

More significantly, we provide a fault-tolerant implementation that scales to thousands of

A key difference between these systems and MapReduce

MapReduce exploits a restricted programming model to parallelize the user program

The MapReduce implementation relies on an in-house cluster management system that is

Enrollment no. 160750107035 Page 9

Enrollment no. 160750107035 Page 10

You might also like