Professional Documents
Culture Documents
Big Assignment 2
Big Assignment 2
Big Assignment 2
ASSIGNMENT 2
SECTION II
ID: MTR/226/12
Let us understand more about Map Reduce and its components. Map Reduce majorly has the
following three Classes. They are,
Mapper Class
The first stage in Data Processing using Map Reduce is the Mapper Class. Here, Record Reader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.
Input Split
It is the logical representation of data. It represents a block of work that contains a single map
task in the Map Reduce Program.
Record Reader
It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.
Driver Class
The major component in a Map Reduce job is a Driver Class. It is responsible for setting up a
Map Reduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long
with data types and their respective job names.
1. Parallel Processing:
In Map Reduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, Map Reduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount as shown in the figure below (2).
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in
the Map Reduce Framework. In the traditional system, we used to bring data to the processing
unit and process it. But, as the data grew and became very huge, bringing this huge amount of
data to the processing unit posed the following issues:
Moving huge data to processing is costly and deteriorates the network performance.
Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
The master node can get over-burdened and may fail.
Different architectural diagram for physical organization of compute nodes
I. setup(): this method is used for configuring various parameters like input data size,
distributed cache.
public void setup (context)
II. reduce(): heart of the reducer always called once per key with the associated reduced
task
public void reduce(Key, Value, context)
III. cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context) .
3. Consider the following training dataset that describes the weather conditions for
playing a game of golf. Given the weather conditions, each tuple classifies the
conditions as fit (“Yes”=P) or unfit(“No”=N) for playing golf
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
Let us test it on a rain
new set ofmild
features (let ushigh
call it today):Today
true =
N (sunny, hot, normal, false)
Find the following probabilities of playing golf
a) P(YES | today)
b) P(NO|today)
Compute a probability for each class based on the probability distribution in the training data.
First take into account the probability of each attribute. Treat all attributes equally important, i.e.,
multiply the probabilities:
a) P(YES | today)
P(yes|today) = (2/9) * (3/9) * (3/9) * (3/9) = 0.0082
b) P(NO|today)
P(No|today) = (3/5) * ( 1/5) * ( 4/5) * ( 3/5) = 0.0577
4. How does data mining and big data analytics differ in terms of their application?
Data Mining and Big data are two different things, while both of them relate to use of large
datasets to handle the data that will serve our purpose, they are two different terms in the aspect
of operation they are used for. Big Data refers to a collection of large datasets (eg- datasets in
Excel sheets which are too large to be handled easily). Data Mining on the other hand refers to
the activity of going through a large chunk of data to look for relevant or pertinent information.
Big data refers to huge amount of data which is not easy to handle with conventional
ways, it might be structured, semi- structured or unstructured. It comprises of 5 Vs.
Data Mining is important because of various reasons, the most vital and useful of them is
to understand what is relevant and make a good use of it to assess the things as the new
data comes into picture, this in turn branches into various use cases in places like
healthcare industry, financial market analysis etc.
Having understood both the concepts fairly well, we can say they are 2 very different concepts,
the main concept if we look in Data Mining is to dig into the data and analyses the pattern and
relationship which can further be useful in prediction algorithms like of Linear Regression in
Artificial Intelligence. The main concept in Big Data on the other hand is velocity, source,
security of the huge amount of data at our disposal.
It can be said that Data Mining is not dependent on Big Data, as it can be done on any amount of
data (preferentially big, as it gives more test cases and hence accurate results) be it big or small.
Big Data on the other hand is very much dependent on data mining as we need to find the use of
the big volume of data we have, it is no use without its analysis.
The tasks done internally by Hadoop framework with in the shuffle phase are as follows-
The threshold for triggering the merge to disk is configured using the following parameter.
Map reduce. reduce. merge. inmem.t thresholds- The number of sorted map outputs fetched
into memory before being merged to disk. In practice, this is usually set very high (1000) or
disabled (0), since merging in-memory segments is often less expensive than merging from disk.
The merged file, which is the combination of data written to the disk as well as the data still kept
in memory constitutes the input for Reduce task.
B) What are the key differences between Apache Pig and Map Reduce?
Hadoop Map Reduce is a compiled language whereas Apache Pig is a scripting language. Pig provides
higher level of abstraction whereas Hadoop Map Reduce provides low level of abstraction. Hadoop Map
Reduce requires more lines of code when compared to Pig.