Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

1.

Install eclipse ide for java


2. In eclipse IDE, choose File->new->project->Java->Java Project->next
3. Give project name as Wordcountpgm, then under JRE tab -> use an execution
environment JRE ->javaSE 1.8 ->next
4. Select the project Wordcountpgm ->finish
5. Under rightclick Wordcountpgm project folder (project explorer) on left side of the
eclipse window ->new -> class
6. Class wizard gets opened -> give the name of the class ->WCDriver ->finish (leave other
default settings as it is)
7. Create two more files WCMapper.java and WCReducer.java Separatley.
8. Include the below given code under WCMapper.java:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WCMapper extends MapReduceBase implements Mapper<LongWritable,


Text, Text, IntWritable> {

// Map function
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter rep) throws IOException
{

String line = value.toString();

// Splitting the line on spaces


for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}

9. Include the below given code under WCReducer.java:


import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WCReducer extends MapReduceBase implements Reducer<Text,


IntWritable, Text, IntWritable> {

// Reduce function
public void reduce(Text key, Iterator<IntWritable> value,
OutputCollector<Text, IntWritable> output,
Reporter rep) throws IOException
{

int count = 0;

// Counting the frequency of each words with respect to key


while (value.hasNext())
{
IntWritable i = value.next();
count += i.get();
}

output.collect(key, new IntWritable(count));


}
}

10. Include the below given code under WCDriver.java:


import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WCDriver extends Configured implements Tool {

public int run(String args[]) throws IOException


{
if (args.length < 2)
{
System.out.println("Please give valid inputs");
return -1;
}
JobConf conf = new JobConf(WCDriver.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WCMapper.class);
conf.setReducerClass(WCReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}

// Main Method
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new WCDriver(), args);
System.out.println(exitCode);
}
}

11. Right click the project Wordcountpgm ->Build path -> configure build path ->wizard
gets opened
12. In the wizard choose libraries option -> Add external Jars
13. Select the jar file C:\hadoop-3.1.3\hadoop-3.1.3\share\hadoop\common\ hadoop-
common-3.1.3
14. Select the jar file C:\hadoop-3.1.3\hadoop-3.1.3\share\hadoop\mapreduce\ hadoop-
mapreduce-client-core-3.1.3
15. After adding jar files, click apply and close
16. Create the jar file of this program and name it as WCpgm.jar
 Rightclick the project wordcountpgm -> Export -> wizard opened
 Under Java -> JAR file ->next -> select the project Wordcountpgm(if already
selected leave as it is) -> select the export destination(any folder)->next->next-
>select the class of the application path entry ->main class: browse and select the
java file which is having the main function(here it is Wordcountpgm) > finish ->
jar file gets created.

17. Open command prompt

C:\Users\HP> hdfs namenode –format
C:\Users\HP> cd C:\hadoop-3.1.3\hadoop-3.1.3\sbin
C:\hadoop-3.1.3\hadoop-3.1.3\sbin>start-dfs.cmd
C:\hadoop-3.1.3\hadoop-3.1.3\sbin>start-yarn.cmd
C:\hadoop-3.1.3\hadoop-3.1.3\sbin>hdfs dfs –mkdir /achu
C:\hadoop-3.1.3\hadoop-3.1.3\sbin> hdfs dfs –ls /
C:\hadoop-3.1.3\hadoop-3.1.3\sbin>hdfs dfs -put C:\count.txt /achu
C:\hadoop-3.1.3\hadoop-3.1.3\sbin>hdfs dfs –ls /achu

18. hadoop jar ur_pgm_jar_file_path input_dir output_dir

>hadoop jar C:\copyhadoop\WCpgm.jar /achu/count.txt /outwc

>hdfs dfs -cat /outwc/part-00000

To remove jps error:


Your JDK bin directory does not exist in PATH variable. To add this you need to do below steps.

1. Go to "Control Panel >> System >> Advanced system settings >> Environment Variables
2. Click 'Path' from System variables
3. Click Edit.
4. Now add the path "C:\Program Files\Java\jdk1.8.0_72\bin" Now open command window and write jps. It
will work now.

 JPS which means Java Process Status which is used to list


all the processes that are running on java virtual machine.
 public static class Map extends MapReduceBase implements Mapper

● extends MapReduceBase: Base class for Mapper and Reducer implementations. Provides
default no-op implementations for a few methods, most non-trivial applications need to
override some of them. Example

● implements Mapper : Interface Mapper has the map method

◄ first pair is the input key/value pair , second is the output key/value pair.

◄ “The data types provided here are Hadoop specific data types designed for operational
efficiency suited for massive parallel and lightning fast read write operations.

All these data types are based out of java data types itself, for example LongWritable is the
equivalent for long in java, IntWritable for int and Text for String”

 public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws
IOException {
● LongWritable key, Text value :Data type of the input Key and Value to the mapper.
● OutputCollector output :output collector : “collect data output by either the Mapper or the
Reducer i.e. intermediate outputs or the output of the job” **k and v are the output Key and Value
from the mapper. EX: <”the”,1>

OutputCollector is the generalization of the facility provided by the Map-Reduce framework to collect data
output by either the Mapper or the Reducer i.e. intermediate outputs or the output of the job.

● Reporter reporter: the reporter is used to report the task status internally in Hadoop environment
to avoid time outs.

Reducer:
//Class Header similar to the one in map
public static class Reduce extends MapReduceBase implements Reducer {
//Reduce Header similar to the one in map with different key/value data type
//data from map will be so we get it with an Iterator so we can go through the sets of values
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws
IOException {
//Initaize a variable ‘sum’ as 0
int sum = 0;
//Iterate through all the values with respect to a key and sum up all of them
while (values.hasNext()) { sum += values.next().get(); }
//Push to the output collector the Key and the obtained sum as value
output.collect(key, new IntWritable(sum));
Main(The Driver)
● Given the Mapper and Reducer code, the short main() starts the MapReduction running.
● The Hadoop system picks up a bunch of values from the command line on its own.
● Then the main() also specifies a few key parameters of the problem in the JobConf object.
● JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework
for execution (such as what Map and Reduce classes to use and the format of the input and output
files).
● Other parameters, i.e. the number of machines to use, are optional and the system will determine
good values for them if not specified.
● Then the framework tries to faithfully execute the job as-is described by JobConf. Main(The
Driver)

Main
public static void main(String[] args) throws Exception {
//creating a JobConf object and assigning a job name for identification purposes
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
//Setting configuration object with the Data Type of output Key and Value for map and reduce if you
have diffrent type of outputs there is other set method for them
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
//Providing the mapper and reducer class names
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
//set theCombiner class
conf.setReducerClass(Reduce.class);
// The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The
long value is the byte offset of the line in the file.
conf.setInputFormat(TextInputFormat.class);
// The basic (default) instance is TextOutputFormat, which writes (key, value) pairs on individual
lines of a text file.
conf.setOutputFormat(TextOutputFormat.class);
//the hdfs input and output directory to be fetched from the command line
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
//submits the job to MapReduce. and returns only after the job has completed
JobClient.runJob(conf); }

By using ToolRunner.run(), any hadoop application can handle standard command line options supported by hadoop.
ToolRunner uses GenericOptionsParser internally. In short, the hadoop specific options which are provided command
line are parsed and set into the Configuration object of the application. If you simply use main(), this wont happen
automatically.

You might also like