Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

LAB ASSIGNMENT - 04

Name: Parth Shah


ID: 191080070
Branch: Information Technology

AIM
To write a Map Reduce code to find top five trending hashtags on twitter.

THEORY
Hadoop is an open-source framework that allows the storage and processing of
big data in a distributed environment across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it has become a
complete subject, which involves various tools, techniques and frameworks.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel with others. In short, Hadoop is used to develop applications
that could perform complete statistical analysis on huge amounts of data.
Hadoop Architecture
At its core, Hadoop has two major layers namely –
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System)

MapReduce
MapReduce is a parallel programming model for writing distributed applications
devised at Google for efficient processing of large amounts of data (multiterabyte
data-sets), on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is
an Apache open-source framework.

The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. It is highly
fault-tolerant and is designed to be deployed on low-cost hardware. It provides
high throughput access to application data and is suitable for applications having
large datasets.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
• Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.

NameNode and DataNodes HDFS have a master/slave architecture.


An HDFS cluster consists of a single NameNode, a master server that manages the
file system namespace and regulates access to files by clients. In addition, there
are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is split into
one or more blocks and these blocks are stored in a set of DataNodes. The
NameNode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to
DataNodes. The DataNodes are responsible for serving read and write requests
from the file system’s clients. The DataNodes also perform block creation,
deletion, and replication upon instruction from the NameNode.

NameNode
- It is a single master server that exists in the HDFS cluster.
- As it is a single node, it may become the reason for single point failure.
- It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
- It simplifies the architecture of the system.

DataNode
- The HDFS cluster contains multiple DataNodes.
- Each DataNode contains multiple data blocks.
- These data blocks are used to store data.
- It is the responsibility of DataNode to read and write requests from the file
system's clients.
It performs block creation, deletion, and replication upon instruction from the
NameNode.

HDFS
In HDFS data is distributed over several machines and replicated to ensure their
durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of
blocks, data nodes and node names.

Why is HDFS used? For -


Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
Streaming Data Access: The time to read the whole data set is more important
than latency in reading the first. HDFS is built on a write-once and
read-many-times pattern.
Commodity Hardware: It works on low cost hardware.

Hadoop Tasks
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
• Data is initially divided into directories and files. Files are divided into uniform
sized blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further
processing.
• HDFS, being on top of the local file system, supervises the processing.
• Blocks are replicated for handling hardware failure.
• Checking that the code was executed successfully.
• Performing the sort that takes place between the map and reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job.

IMPLEMENTATION
Starting the multi node setup on Master and Slave machines.
Create a twitter_trends folder in hadoop/hadoop-3.3.0 and add the following
files:

Hashtags.java:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
* @author parth
*/
public class Hashtags {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job1 = new Job(conf);
job1.setJarByClass(Hashtags.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
job1.setMapperClass(TrendMapper1.class);
job1.setCombinerClass(TrendReducer1.class);
job1.setReducerClass(TrendReducer1.class);
job1.setInputFormatClass(TextInputFormat.class);
job1.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job1, new Path(args[0]));
FileOutputFormat.setOutputPath(job1, new Path(args[1]));
boolean succ =job1.waitForCompletion(true);
if(!succ){
System.exit(1);
}
//---------------------------------------------
Job job2 = new Job(conf, "top-k-pass-2");
job2.setJarByClass(Hashtags.class);
FileInputFormat.setInputPaths(job2, new Path(args[1]));
FileOutputFormat.setOutputPath(job2, new Path(args[2]));
job2.setMapperClass(TrendMapper2.class);
job2.setReducerClass(TrendReducer2.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setMapOutputKeyClass(LongWritable.class);
job2.setMapOutputValueClass(Text.class);
job2.setSortComparatorClass(LongWritable.DecreasingComparator.class);
job2.setOutputFormatClass(TextOutputFormat.class);
System.exit(job2.waitForCompletion(true) ? 0 : 1);
}
}

TrendMapper1.java:
Mapper1 while tokenizing would collect only the tokens which start by #
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* @author parth
*/
public class TrendMapper1 extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
public Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
if(token.startsWith("#")){
word.set(token.toLowerCase());
context.write(word, one);
}
}
}
}

TrendReducer1.java:
Reducer 1 will generate output key-value pairs containing hashtags as keys and its
count as
value. The problem with this output is that it is sorted by the key values, as in the
mapping
phase the shuffle and sort step sorts them alphabetically on the basis of keys. To
get the desired
out of sorting it on the basis of the number of occurrences of each hashTag, we
would need
them to be sorted on the basis of values. So, I have decided to pass this output to
the second
Map-Reduce job which will swap the key and the value and then perform a sort.
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* @author parth
*/
public class TrendReducer1 extends Reducer<Text, IntWritable, Text,
IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context
context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

TrendMapper2.java:
Tokenize the input and put 2nd token (the number) as key and 1st token (hashtag)
as value. While mapping it will shuffle and sort on the basis of key. However, the
sorting of the keys by default is in ascending order and we would not get our
desired list. So, I have used a Comparator

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* @author parth
*/
public class TrendMapper2 extends Mapper<LongWritable, Text, LongWritable,
Text>{
@Override
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
context.write(new
LongWritable(Long.parseLong(tokenizer.nextToken().toString())), new
Text(token));
}
}
}

TrendReducer2.java:
Reducer2 will swap back the result again
Import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* @author parth
*/
public class TrendReducer2 extends Reducer<LongWritable, Text, Text, Text>
{
@Override
protected void reduce(LongWritable key, Iterable<Text> trends, Context
context)
throws IOException, InterruptedException {
for (Text val : trends) {
context.write(new Text(val.toString()),new Text(key.toString()));
}
}
}
Copying the input text file into the hadoop folder:
Paste the file in downloads directory

Create a input directory and adding the input file to datanodes.

We can see the input folder and files in the Hadoop UI which is at master:9870
We can see that the blocks are stored on 1 Datanode: hadoop-slave
Go to twitter_trends directory and view all files.

Make a classes directory

Compiling all java files:


Command format: bin/hadoop com.sun.tools.javac.Main
path_where_java_files_are_stored/*.java -d path_where_class_files_are_stored
Creating a jar file using the above classes:
Command format: jar -cf JAR_FILE_NAME *.class

Running the two MapReduce Jobs:


Command format:
hadoop jar (jar file name) (className_along_with_packageName) (input file on
HDFS) (temporary folderpath on HDFS) (output folderpath on HDFS)

MapReduce Job1:
MapReduce Job2:
Following is the output of the temporary file which stores the intermediate results
containing hashtags and its frequencies in unsorted format:
Following is the output of the main output file which stores the results containing
the top hashtags and its frequencies in sorted in descending order:
CONCLUSION
Thus, I have successfully written and compiled a Map Reduce code to find top five
trending hashtags on twitter on a MultiNode cluster with 1 datanode.

You might also like