Professional Documents
Culture Documents
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
(Data Science)
SLIIT – 2019 (Semester 2)
MapReduce Features
• Automatic parallelization and distribution
• Fault-Tolerance
• Provides a clean abstraction for programmers to use
MapReduce Algorithm
• Iterate over a large n u m b e r of records MAP
• Extract something of interest from each
• Shuffle a n d sort intermediate results REDUCE
• Aggregate intermediate results
• Generate final o u t p u t
Key idea: provide a functional abstraction for these two operations
Input:
The key is nothing but the offset of each line in the text file: LongWritable
The value is each individual line (as shown in the figure at the right): Text
Output:
The key is the tokenized words: Text
We have the hardcoded value in our case which is 1: IntWritable
Example – Dear 1, Bear 1, etc.
Reducer Code:
1public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
2
3public void reduce(Text key, Iterable<IntWritable> values,Context context)
4throws IOException,InterruptedException {
5
6int sum=0;
7for(IntWritable x: values)
8{
9sum+=x.get();
10}
11context.write(key, new IntWritable(sum));
12}
13}
Both the input and the output of the Reducer is a key-value pair.
Input:
The key nothing but those unique words which have been generated after the sorting and shuffling phase: Text
The value is a list of integers corresponding to each key: IntWritable
Example – Bear, [1, 1], etc.
Output:
The key is all the unique words present in the input text file: Text
The value is the number of occurrences of each of the unique words: IntWritable
Example – Bear, 2; Car, 3, etc.
We have aggregated the values present in each of the list corresponding to each key and produced the final answer.
• In the driver class, we set the
1Configuration conf= new Configuration();
configuration of our MapReduce
2Job job = new Job(conf,"My Word Count Program");
3job.setJarByClass(WordCount.class); job to run in Hadoop.
4job.setMapperClass(Map.class); • specify the name of the job ,
5job.setReducerClass(Reduce.class); • the data type of input/output of
6job.setOutputKeyClass(Text.class); the mapper and reducer.
7 • specify the names of the
8job.setOutputValueClass(IntWritable.class); mapper and reducer classes.
9job.setInputFormatClass(TextInputFormat.class); • The path of the input and
10job.setOutputFormatClass(TextOutputFormat.class); output folder.
11Path outputPath = new Path(args[1]); • The method
12 setInputFormatClass () is used
13//Configuring the input/output path from the filesystem into the job for specifying that how a
14FileInputFormat.addInputPath(job, new Path(args[0]));
Mapper will read the input data
15FileOutputFormat.setOutputPath(job, new Path(args[1]));
or what will be the unit of work.
Here, we have chosen
TextInputFormat so that single
hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input line is read by the mapper at a
/sample/output time from the input text file.
• The main () method is the entry
point for the driver. In this
method, we instantiate a new
Configuration object for the job.
HDFS Architecture
Distributed File System
• Don’t move data to workers… move workers to
the data!
• Store data on the local disks of nodes in the
cluster
• Start up the workers on the node that has
the data local
• Why DFS?
• Not enough RAM to hold all the data in
memory
• Disk access is slow, but disk throughput is
DFS - Features
• Single Namespace for entire cluster reasonable
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 128 MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
NameNode – Metadata NameNode – Responsibilities
• Meta-data in Memory • Managing the file system namespace:
• The entire metadata is in main memory • Holds file/directory structure, metadata, file-
• No demand paging of meta-data to-block mapping, access permissions, etc.
• Types of Metadata • Coordinating file operations:
• List of file • Directs clients to datanodes for reads and
• List of Blocks for each file writes
• List of DataNodes for each block • No data is moved through the namenode
• File attributes, e.g creation time, • Maintaining overall health:
replication factor • Periodic communication with the datanodes
• A Transaction Log • Block re-replication and rebalancing
• Records file creations, file deletions. etc • Garbage collection
Datanode Block Placement
• A Block Server
• Stores data in the local file system • Current Strategy
• Stores meta-data of a block • One replica on local node
• Serves data and meta-data to Clients • Second replica on a remote rack
• Block Report • Third replica on same remote rack
• Periodically sends a report of all existing • Additional replicas are randomly placed
blocks to the NameNode • Clients read from nearest replica
• Facilitates Pipelining of Data • Would like to make this policy pluggable
• Forwards data to other specified DataNodes
Data Correctness
• Use Checksums to validate data
• Use CRC32
• File Creation
• Client computes checksum per 512 byte
• DataNode stores the checksum
• File access
• Client retrieves the data and checksum from DataNode
• If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
• A directory on the local file system
• A directory on a remote file system (NFS/CIFS)
3. Client writes block directly to one Data Node
• Data Nodes replicates block
1. Client consults Name Node • Cycle repeats for next block
2. Name Node replies with the location of the Data 4. Data node replies with acknowledgement
5. Client sends to the Name Node a request to
Node close the file
1. An application client wishing to read a file must first 3. The client then contacts the Data Nodes to
contact the Name Node to determine where the actual retrieve data.
data is stored. • Important features of the design:
2. In response to the client request the Name Node • Data is never moved through the Name Node
returns: • All data transfer occurs directly between clients
• The relevant block ids and the Data Nodes
• The locations where the blocks are held • Communications with the Name Node only
involves transfer of metadata
Introduction to YARN
MapReduce Vs YARN
MapReduce Vs YARN
Limits Scalability
• Maximum cluster size: 4,000 nodes
• Maximum concurrent tasks: 40,000
• Availability – Job Tracker is SPOF (Single Point of Failure)
• Problem with Resource Utilization
• Predefined number of map slots and reduce slots for each TaskTracker
• Underutilization when more map tasks or reducer tasks are running
• Runs only MapReduce applications
Advantages of YARN
• Application Master
• If application fails or if AM stops sending heartbeats, RM will re-attempt the whole application (2 times by
default)
• MR AppMaster optional setting: Job Recovery
• If false, all tasks will re-run
• If true, MR APPMaster retrieves state of tasks when it restarts; only incomplete tasks will be re-run
• NodeManager
• If NM stops sending heartbeats to RM, it is removed from list of active nodes
• Tasks on the node will be treated as failed by MR AppMaster
• If the AppMaster node fails, it will be treated as a failed application
• Resource Manager
• Next Step to select the 7th index because the Country data is located at 7th index in array 'SingleCountryData'.
• Please note that the input data is in the below format (where Country is at 7th index, with 0 as a starting index)-
• Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Long
itude
• An output of mapper is again a key-value pair which is outputted using 'collect()' method of 'OutputCollector'.
MapReduce Program
• An input to the reduce() method is a key with a list
of multiple values.
• For example, in our case, it will be-
<United Arab Emirates, 1>, <United Arab
Emirates, 1>, <United Arab Emirates, 1>,<United
Arab Emirates, 1>, <United Arab Emirates, 1>,
<United Arab Emirates, 1>.
• This is given to reducer as <United Arab Emirates,
{1,1,1,1,1,1}>
• So, to accept arguments of this form, first two data
types are used,
viz., Text and Iterator<IntWritable>. Text is a data
type of key and Iterator<IntWritable> is a data type
for list of values for that key.
• The next argument is of
type OutputCollector<Text,IntWritable> which
collects the output of reducer phase.
• reduce() method begins by copying key value and initializing frequency count to 0.
• Then, 'while' loop, is used to iterate through the list of values associated with the key and calculate the final
frequency by summing up all the values.
• push the result to the output collector in the form of key and obtained frequency count.
MapReduce Program