BDALab Assn4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

NAME:Pratham Shah ROLL NO: 201080056

SUBJECT: BDA LAB


EXPERIMENT NO: 4

Aim: Write a Map Reduce code to find top ten trending hashtags on twitter.
Theory:
MapReduce is a framework using which we can write applications to process
huge amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce the task, which takes the output from a map
as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
The major advantage of MapReduce is that it is easy to scale data processing
over multiple computing nodes. Under the MapReduce model, the data
processing primitives are called mappers and reducers. Decomposing a data
processing application into mappers and reducers is sometimes nontrivial. But,
once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted
many programmers to use the MapReduce model.
Code and Method of Performance:
The java file for map reduce for the code of top 10 tweets is as follows.
TopHashtags.java
import java.io.IOException;
import java.util.StringTokenizer;
import java.util.TreeMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class TopHashtags {


public static class TokenizerMapper extends
Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context) throws


IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String token = itr.nextToken();
// Assuming hashtags are words starting with #
if (token.startsWith("#")) {
word.set(token.toLowerCase()); // Convert to lowercase to ensure
case-insensitivity
context.write(word, one);
}
}
}
}

public static class IntSumReducer extends


Reducer<Text,IntWritable,Text,IntWritable> {
private TreeMap<Integer, String> topN = new TreeMap<>();

public void reduce(Text key, Iterable<IntWritable> values, Context


context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// Store in TreeMap with count as key for sorting
topN.put(sum, key.toString());
// Keep only top 10 hashtags
if (topN.size() > 10) {
topN.remove(topN.firstKey());
}
}

protected void cleanup(Context context) throws IOException,


InterruptedException {
// Emit the top 10 hashtags
for (Integer count : topN.descendingKeySet()) {
context.write(new Text(topN.get(count)), new IntWritable(count));
}
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "top hashtags");
job.setJarByClass(TopHashtags.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Step 1: Start the Hadoop multi-node cluster on VirtualBox. Here I have


used 1 master(hadoop-master) and 2 slaves (hadoop-slave1 and hadoop-
slave2).
Step 2: Store the TopHashtags.java file in a folder and create ‘input_data’
folder for storing input. Also create ‘toptentweets_java’ folder to hold the
java classes(later step).

Step 3: Create a text file ‘input.txt’ inside ‘input_data’ folder with all the
hashtags.

Step 4: Now, set the HADOOP_CLASSPATH environment variable using


export HADOOP_CLASSPATH=$(Hadoop classpath)
and check it using
echo $ HADOOP_CLASSPATH
Step 5: Create a directory on HDFS for the Assignment. I have named it
‘TopTenTweetsAssn’. Create a sub-directory ‘Input’ inside this directory.
hadoop fs -mkdir <DIRECTORY_NAME>
hadoop fs -mkdir <HDFS_INPUT_DIRECTORY>

Step 6: Check these directories on localhost:9870

Step 7: Upload the input file ‘input.txt’ to hdfs inside /Input directory.
Check it on localhost:9870
hadoop fd -put <INPUT_FILE> <HDFS_INPUT_DIRECTORY>

Step 8: Change the current directory to assignment directory. Here,


compile the java code. Check the files created.
javac -classpath ${HADOOP_CLASSPATH} -d <CLASSES_FOLDER>
<ASSIGNMENT_JAVA_FILE>

Step 9: Put the output files in one jar file.


jar -cvf <JAR_FILE_NAME> <CLASSES_FOLDER>
Step 10: Now, run the jar file on hadoop.
hadoop jar <JAR_FILE> <CLASS_NAME> <HDFS_INPUT_DIRECTORY>
<HDFS_OUTPUT_DIRECTORY>
Step 11: Check the output.
hadoop dfs -cat <HDFS_OUTPUT_DIRECTORY>

Conclusion:
Through this experiment, I learnt about the MapReduce framework in Hadoop.
The MapReduce model is designed to work in a distributed and fault-tolerant
environment. It automatically handles parallelism, data distribution, and fault
recovery across the nodes in a Hadoop cluster. The framework is highly scalable
and can process massive datasets across a large number of nodes. We can
perform variety of different functions by writing the code and executing it
through Hadoop for parallelism and speed.

You might also like