Big Assignment 2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

BIG DATA ANALYSIS (IT 562)

ASSIGNMENT 2
SECTION II

Name : Melesse Bisema Dessie

ID: MTR/226/12

Submitted to: Dr. Vasu


1. A) Describe Map Reduce framework in detail. Draw the architectural diagram for
physical organization of compute nodes
Map Reduce is a programming framework that allows us to perform distributed and parallel
processing on large data sets in a distributed environment.

 Map Reduce consists of two distinct tasks – Map and Reduce.


 As the name Map Reduce suggests, the reducer phase takes place after the mapper phase
has been completed.
 So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.

Let us understand more about Map Reduce and its components. Map Reduce majorly has the
following three Classes. They are,

Mapper Class

The first stage in Data Processing using Map Reduce is the Mapper Class. Here, Record Reader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.

 Input Split

It is the logical representation of data. It represents a block of work that contains a single map
task in the Map Reduce Program.

 Record Reader

It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.
Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.

Driver Class 

The major component in a Map Reduce job is a Driver Class. It is responsible for setting up a
Map Reduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long
with data types and their respective job names.

Advantages of Map Reduce

The two biggest advantages of Map Reduce are:

      1. Parallel Processing:

In Map Reduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, Map Reduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount as shown in the figure below (2).

2. Data Locality: 

Instead of moving data to the processing unit, we are moving the processing unit to the data in
the Map Reduce Framework.  In the traditional system, we used to bring data to the processing
unit and process it. But, as the data grew and became very huge, bringing this huge amount of
data to the processing unit posed the following issues: 

 Moving huge data to processing is costly and deteriorates the network performance. 
 Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
 The master node can get over-burdened and may fail.  
Different architectural diagram for physical organization of compute nodes

B) What is the role of a JobTracker in Hadoop?


The Job Tracker is the service with in Hadoop that farms out Map Reduce tasks to specific nodes
in the cluster, ideally the nodes that have the data, or at least are in the same rack. Client
applications submit jobs to the Job tracker. The Job Tracker submits the work to the chosen Task
Tracker nodes. Job Tracker receives the requests for Map Reduce execution from the client. Job
Tracker talks to the Name Node to determine the location of the data. Job Tracker finds the
best Task Tracker nodes to execute tasks based on the data locality (proximity of the data) and
the available slots to execute a task on a given node. Map Reduce processing in Hadoop 1 is
handled by the Job Tracker and Task Tracker daemons. The Job Tracker maintains a view of all
available processing resources in the Hadoop cluster and, as application requests come in, it
schedules and deploys them to the Task Tracker nodes for execution. As applications are
running, the Job Tracker receives status updates from the Task Tracker nodes to track their
progress and, if necessary, coordinate the handling of any failures. The Job Tracker needs to run
on a master node in the Hadoop cluster as it coordinates the execution of all Map Reduce
applications in the cluster, so it’s a mission-critical service.
2. A) What are the parameters of mappers and reducers?
The four basic parameters of a mapper are
I. Text,
II. LongWritable,
III. Text and
IV. IntWritable.
The first two represent input parameters and the second two represent intermediate
output parameters.
B) What are the core methods of a Reducer?
The three core methods of a Reducer are:

I. setup(): this method is used for configuring various parameters like input data size,
distributed cache.
public void setup (context)
II. reduce(): heart of the reducer always called once per key with the associated reduced
task
public void reduce(Key, Value, context)
III. cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context) .

3. Consider the following training dataset that describes the weather conditions for
playing a game of golf. Given the weather conditions, each tuple classifies the
conditions as fit (“Yes”=P) or unfit(“No”=N) for playing golf
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
Let us test it on a rain
new set ofmild
features (let ushigh
call it today):Today
true =
N (sunny, hot, normal, false)
Find the following probabilities of playing golf
a) P(YES | today)
b) P(NO|today)

Compute a probability for each class based on the probability distribution in the training data.
First take into account the probability of each attribute. Treat all attributes equally important, i.e.,
multiply the probabilities:

a) P(YES | today)
P(yes|today) = (2/9) * (3/9) * (3/9) * (3/9) = 0.0082

b) P(NO|today)
P(No|today) = (3/5) * ( 1/5) * ( 4/5) * ( 3/5) = 0.0577

4. How does data mining and big data analytics differ in terms of their application?
Data Mining and Big data are two different things, while both of them relate to use of large
datasets to handle the data that will serve our purpose, they are two different terms in the aspect
of operation they are used for. Big Data refers to a collection of large datasets (eg- datasets in
Excel sheets which are too large to be handled easily). Data Mining on the other hand refers to
the activity of going through a large chunk of data to look for relevant or pertinent information.

 Big data refers to huge amount of data which is not easy to handle with conventional
ways, it might be structured, semi- structured or unstructured. It comprises of 5 Vs.

 Data Mining is important because of various reasons, the most vital and useful of them is
to understand what is relevant and make a good use of it to assess the things as the new
data comes into picture, this in turn branches into various use cases in places like
healthcare industry, financial market analysis etc.
Having understood both the concepts fairly well, we can say they are 2 very different concepts,
the main concept if we look in Data Mining is to dig into the data and analyses the pattern and
relationship which can further be useful in prediction algorithms like of Linear Regression in
Artificial Intelligence. The main concept in Big Data on the other hand is velocity, source,
security of the huge amount of data at our disposal.

It can be said that Data Mining is not dependent on Big Data, as it can be done on any amount of
data (preferentially big, as it gives more test cases and hence accurate results) be it big or small.
Big Data on the other hand is very much dependent on data mining as we need to find the use of
the big volume of data we have, it is no use without its analysis.

5. A) Explain about the partitioning, shuffl e and sort phase?


 In shuffle phase the framework fetches the relevant partition of the output of all the
mappers, via HTTP.
 In sort phase the framework groups Reducer inputs by keys from different map outputs.
 The shuffle and sort phases occur simultaneously; while map-outputs are being fetched
they are merged.

The tasks done internally by Hadoop framework with in the shuffle phase are as follows-

I. Data from mappers is partitioned as per the number of reducers.


II. Data is also sorted by keys with in a partition.
III. Output from Maps is written to disk as many temporary files.
IV. Once the map task is finished all the files written to the disk are merged to create a single
file.
V. Data from a particular partition (from all mappers) is transferred to a reducer that is
supposed to process that particular partition.
VI. If data transferred to a reducer exceeded the memory limit then it is copied to a disk.
VII. Once reducer has got its portion of data from all the mappers data is again merged while
still maintaining the sort order of keys to create reduce task input.
When the buffer reaches a certain threshold map output data is merged and written to disk.
This merging of Map outputs is known as sort phase. During this phase the framework groups
Reducer inputs by keys since different mappers may have produced the same key as output.

The threshold for triggering the merge to disk is configured using the following parameter.

Map reduce. reduce. merge. inmem.t thresholds- The number of sorted map outputs fetched
into memory before being merged to disk. In practice, this is usually set very high (1000) or
disabled (0), since merging in-memory segments is often less expensive than merging from disk.

The merged file, which is the combination of data written to the disk as well as the data still kept
in memory constitutes the input for Reduce task.

B) What are the key differences between Apache Pig and Map Reduce?

Hadoop Map Reduce is a compiled language whereas Apache Pig is a scripting language. Pig provides
higher level of abstraction whereas Hadoop Map Reduce provides low level of abstraction. Hadoop Map
Reduce requires more lines of code when compared to Pig.

C) In which language can Map Reduce program can be written?


Map Reduce is a programming model to perform distributed and parallel processing. Map
Reduce can be written in Java, Ruby, Python, C++ etc. The choice of programming language
depends on programmer i.e. how comfortable you are with a particular language.

You might also like