Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Intellipaat Hands-On

Intellipaat Software Solutions Pvt. Ltd. https://intell


Contents

Getting Started........................................................................................3
Lab1: Working with HDFS .......................................................................4
Lab2: Writing WordCount Program .........................................................9
Lab3: MapReduce with Combiner ......................................................... 13
Lab4: Writing Custom Partitioner .......................................................... 21
Lab5: Unit Test MapReduce ................................................................. 25
Lab6: Using ToolRunner ....................................................................... 29
Lab7: Running MapReduce in LocalJobRunner Mode .......................... 35
Lab8: Using Counters and Map-Only Job ............................................. 37
Lab9: Joining two datasets using Map-Side Join .................................. 41
Lab10: Reduce-Side Join...................................................................... 45

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Getting Started

 
 Login to the Virtual Machine with user training and password training.


 to the Virtual Machine using WinSCP tool or simply
Upload training data from local
 use move or copy command.


 Extract the content of ‘labs.gz’ provided by trainer on Virtual Machine.

  Move the project files of workspace directory to


 /home/training/workspace

$ cd /home/training/labs/workspace
$ cp -r * /home/training/workspace

Note:
For all the labs, the project file under workspace consist of three folders:
1. Stubs: contains minimal skeleton of the code for the Java classes

2. Hints: contains Java class stubs that include additional hints about what’s
required to complete the exercise.

3. Solution: Fully implemented Java code

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab1: Working with HDFS

In this exercise, we will practise generic HDFS commands and get ourselves
familiar with Hadoop command line interface.

Open Terminal on your VM, navigate to Application > System Tools > Terminal Or
use the shortcut on desktop.

1.1 Exploring HDFS


 of all the commands associated with FsShell
To view description
subsystem, enter:

$hdfs dfs


To view the content of root directory
$hdfs dfs -ls /

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com



List the content of /user directory.

 $hdfs dfs -ls /user









To list the content of ‘training’ home
directory.
$hdfs dfs -ls /user/training

$hdfs dfs -ls

Note:

Both the command returned silently because there are no files under
home directory.

Access any random non-existing directory and observe the error


message.

$hdfs dfs -ls test

Remember:

The directory structure in HDFS has nothing to do with the directory


structure of the local filesystem, they are separate namespaces.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


1.2 Uploading Files


On your local filesystem, change the working directoryto the location where
 ‘training data’ is dumped i.e. under labs directory.









List the content of 
‘data’ directory and locate file named
 shakespeare.tar.gz





 
Unzip the file shakespeare.tar.gz

$tar -xvf Shakespeare.tar.gz


Upload shakespeare directory and its content to HDFS
 $hdfs dfs -put shakespeare /user/training

 List the content of ‘training’ home directory now, you should now see
‘shakespeare’ directory.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Note:

If you pass any relative (non-absolute) path to FsShell commands, they


are considered relative to your home directory.


 directory files without using path argument.
To test that, list the home
 $hdfs dfs -ls





 
To create a directory weblog in HDFS

$hdfs dfs -mkdir weblog



Unzip the access_log.gz file located under data directory and upload it to weblog
directory in HDFS.

$gunzip access_log.gz

$hdfs dfs -put access_log weblog

Verify the content using ls command

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


1.3 Manipulating Files


Remove file named shakespeare/glossary from
 HDFS
 $hdfs dfs -rm shakespeare/glossary




 
To read first 10 lines of a file shakespeare/histories

$hdfs dfs -cat shakespeare/histories | tail -n 10


To download a file from HDFS to local filesystem
$hdfs dfs -get shakespeare/poems ~/shakespoems.txt

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab2: Writing WordCount Program

In this exercise, we will count the number of occurrence of each word in a file or set
of files. To achieve this, you need to first compile the Java files, create a JAR and
then run MapReduce job.

Step1: Copy the wordcount folder from /home/training/labs/workspace to


/home/training/workspace.

Step2: Change directory to wordcount source directory

$cd ~/workspace/wordcount/src
$ls

This directory contains three package directories, navigate to solution


directory.

$cd solution
$ls

This package contains following Java files:


WordCount.java: A simple MapReduce driver class

WordMapper.java: A mapper class for the job

SumReducer.java: A reducer class for the job

Step3: Before compiling, examine the classpath Hadoop is configured to use:


$hadoop classpath

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step4: Compile the three Java classes using command line:

$cd ~/workspace/wordcount/src
$javac -classpath `hadoop classpath` solution/*.java
$ls

The complied (.class) files are placed in the solution directory.

Step5: Collect your complied Java files into a JAR file:

$jar cvf wc.jar solution/*.class

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step6: Submit the MapReduce job using ‘shakespeare’ as the input, uploaded in
previous exercise.

$hadoop jar wc.jar solution.WordCount shakespeare


output Where

wc.jar: wordcount jar


soltion.WordCount: Name of the class

shakespeare: input files present under Shakespeare directory

output: location of Mapreduce job final output

Check the counter information on the terminal, similarly job information can also
be accessed from Resource Manager URL.

Step7: Review the result of your MapReduce job:

$hdfs dfs -ls output

Note: Job ran with only one Reducer, so there should be only one part-file.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step8: View the content of the output for your job.

$hdfs dfs -cat output/part-r-00000 | tail -10

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab3: MapReduce with Combiner

In this exercise, we will add a Combiner to the WordCount program to reduce the
amount of intermediate data sent from the Mapper to the Reducer.

Step1: Implement a Combiner


Copy WordMapper.java and SumReducer.Java from the wordcount project to the
combiner project (~/workspace/combiner)



Modify the WordCountDriver.java code to add a Combiner for the
WordCount program.

Step2: Assemble and export the JAR using Eclipse.



Import ‘combiner’ project to Eclipse.


Select ‘Existing Project into workspace’

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com



 Browse to the ‘/home/training/workspace’ directory and select
‘combiner’ project.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com



Click Finish

Intellipaat Software
Solutions Pvt. Ltd. https://intellipaat.com
 
Resolve any error, if detected after the import.

Hint: Add the external JAR file if any build file is missing under
properties.


Remove the missing file and click on Add External JARs

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com



From /usr/lib/hadoop-* add all the available JAR to the project.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


 
Select the Mapper, Reducer and Driver java files and Export them as a JAR file.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com



Name the JAR file and click Finish

Step3: Submit the MapReduce Job

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


$hadoop jar combiner.jar solution.WordCountDriver
shakespeare combinerout

Step4: Verify the output, it should remain identical to the WordCount


application without a Combiner.

$hdfs dfs -ls combinerout

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab4: Writing Custom Partitioner

In this exercise, you will write a MapReduce job with multiple Reducers, and create
a Partitioner to determine which Reducer each piece of Mapper output is sent to.

Test Data: Web server access logs

The final output should consist of 12 files, one each for every month of the year.
Each file will contain a list of IP address, and the number of hits from that address
in that month.

To accomplish this, we will have 12 reducers, each of which is responsible for


processing the data for a particular month.

Step1: Write a Mapper

1. Start with LogMonthMapper.java stub file located under


~/workspace/partitioner/src/stub folder, write a Mapper that
maps a log file output line to an IP/month pair.

2. The Mapper should emit a Text key (IP address) and Text value (the
month).

Solution: The code is packaged under


~/workspace/partitioner/src/solution folder.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step2: Write the Partitioner

1. Modify the MonthPartitioner.java stub file to create a Partitioner that


sends the (key,value) pair to the correct Reducer based on the month.

2. Remember that Partitioner receives both the key and value, so you can
inspect the value to determine which Reducer to choose.

Step3: Modify the Driver

1. Modify driver code to specify that you want 12 Reducer

2. Configure your job to use your custom Partitioner

Step4: Build the JAR file using Eclipse.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Name the JAR file and click FINISH.

Step5: Run the MapReduce Job

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com



that input file ‘access_logs’ is uploaded to HDFS. (completed this step in
Ensure
 Lab1)







 
Submit the MapReduce job

$hadoop jar partitioner.jar solution.ProcessLogs


weblog partoutput


List the partitioned out files

$hdfs dfs -ls partoutput

View the content of out file:

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab5: Unit Test MapReduce

In this exercise, we will write Unit Tests for the WordCount code.

Step1: Launch Eclipse and import mrunit folder from the existing workspace.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step2: Examine the TestWordCount.java file in the mrunit project stubs
package. Notice that three tests have been created, one each for the Mapper,
Reducer, and the entire MapReduce flow.

Currently, all three tests simply fail.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step3: Run the tests by right-clicking on the TestWordCount.java in the package
Explorer panel and choosing Run As > JUnit tab.

Step4: Observe the failure. Results in the JUnit tab (should indicate that three
tests ran with three failures)

Step5: Now implement the three tests (refer the code in the solution package)

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step6: Run the test again. Result in the JUnit tab should indicate that three
tests ran with no failures.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab6: Using ToolRunner

In this exercise, you will implement driver using ToolRunner.

Write a MapReduce job that reads any text input and computes the average
length of all words that starts with each character.

Mapper should also reference a Boolean parameter called caseSensitive, if true


the mapper should treat upper and lower case letters as different and if false or
unset, all letters should be converted to lower case.
Step1: Locate the toolrunner folder under ~/workspace/toolrunner

Step2: Import this project in Eclipse.

Note: In LetterMapper class, setup method is configured with parameter called


caseSensitive, use to indicate whether to do a case sensitive or case insensitive
processing.

Step3: Pass parameter programmatically


Test and JAR your code twice, once passing false and once passing true.

Note: When set to true, final output have both upper and lower case letters,
when false it should have only lower case letters.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Save the JAR file and click on FINISH

Submit the job using the JAR file

$hadoop jar toolrunnerf.jar solution.AvgWordLength


shakespeare avgout

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


View the output, when option was set to FALSE:
hdfs dfs -ls avgout

Now change the parameter to TRUE

Export the JAR again and Submit the job

$hadoop jar toolrunnert.jar solution.AvgWordLength


shakespeare avgout_true

View the out file again, when option was set to TRUE

hdfs dfs -ls avgout_true

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step4: Pass parameter as a runtime parameter

Comment out the code that sets the parameter programmatically and pas the
parameter value using -D on command line.

Export the JAR file and click FINISH.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step5: Test passing both true and false to confirm the parameter works correctly.

$hadoop jar toolrunner.jar solution.AvgWordLength


- DcaseSensitive=true shakespeare avgtrue

View the out file, when option is set to TRUE

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Now change the parameter to FALSE

$hadoop jar toolrunner.jar solution.AvgWordLength


- DcaseSensitive=false shakespeare avgfalse

View the out file, when option is set to FALSE

Note that there will not be any change in the output, when submitting the job
using programmatic or runtime approach.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab7: Running MapReduce in LocalJobRunner Mode

In this exercise, we will practise running a job locally for debugging and testing
purpose.

We will run the Average Word Length program again using the LocalJobRunner on
the command line.

Step1: Specify -jt=local to run the job locally instead of submitting to the
cluster and -fs = file:/// to use the local file system instead of HDFS.

Note: Your input and output files should refer to local files rather than HDFS
files.
Input: Copy the shakespoem.txt to /home/training/input

Submit the job as:

$hadoop jar toolrunner.jar solution.AvgWordLength -fs=file:///


-Dmapreduce.framework.name=local -
Dmapreduce.resourcemanager.address=local -
Dmapreduce.jobtracker.address=local /home/training/input
output

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Verify the output on local:

$cat part-r-00000 | tail -10

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab8: Using Counters and Map-Only Job

In this exercise, we will create a Map-Only MapReduce job.

This job will process a web server’s access log to count the number of times gifs,
jpegs and other resources have been retrieved.

Hint: Use Map-Only MapReduce job, by setting the number of Reducers to 0 in


driver code.

For input data, use the Web access log file that we have uploaded in Lab1 to the
HDFS.

Step1: Open the project named ‘counters’ in eclipse by importing from existing
workspace.

Build path entries are missing, export the external JARs

Step2: In your driver code, retrieve the values of the counters after the job has
completed and report them using System.out.println

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step3: Export the project as JAR file and run the MapReduce job

Name the JAR file and click FINISH.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step4: Submit the job with weblog as the input
$hadoop jar counter.jar solution.ImageCounter
weblog counterout

Step5: View the counter information at the end of MapReduce job execution.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Notice that output folder on HDFS contains Mapper output files which are
empty, as Mapper did not write any data.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab9: Joining two datasets using Map-Side Join

In this exercise, we will consider joining two datasets i.e. customer orders
dataset (order_custid.txt) and customer’s details datasets
(cust_data.txt), where we need distributed cache to keep the small dataset
(order_custid.txt) in the memory.

Input Data:
Input Data : order_custid.txt
..........................................................................
.
...

781571544 S9876126
781571545 S9876127
781571546 S9876128
781571547 S9876129
781571548 S9876130
781571549 S9876131
781571550 S9876132
781571551 S9876133
781571552 S9876134

Customer Dataset
Customer Data : cust_data.txt
..........................................................................
.
..

781571544 Smith,John (248)-555-9430 jsmith@aol.com


781571545 Hunter,April (815)-555-3029 april@showers.org
781571546 Ching,Iris (305)-555-0919 iching@zen.org
781571547 Doe,John (212)-555-0912 jdoe@morgue.com
781571548 Jones,Tom (312)-555-3321 tj2342@aol.com
781571549 Smith,John (607)-555-0023 smith@pocahontas.com
781571550 Crosby,Dave (405)-555-1516 cros@csny.org
781571551 Johns,Pam (313)-555-6790 pj@sleepy.com
781571552 Jetter,Linda (810)-555-8761 netless@earthlink.net

Expected Output:
781571544 Smith,John (248)-555-9430 jsmith@aol.com S9876126
781571545 Hunter,April (815)-555-3029 april@showers.org S9876127
781571546 Ching,Iris (305)-555-0919 iching@zen.org S9876128
781571547 Doe,John (212)-555-0912 jdoe@morgue.com S9876129
781571548 Jones,Tom (312)-555-3321 tj2342@aol.com S9876130
781571549 Smith,John (607)-555-0023 smith@pocahontas.com S9876131
781571550 Crosby,Dave (405)-555-1516 cros@csny.org S9876132
781571551 Johns,Pam (313)-555-6790 pj@sleepy.com S9876133
781571552 Jetter,Linda (810)-555-8761 netless@earthlink.net

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


S9876134

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step1: Upload cust_data.txt from local (/home/training/labs/data) to
HDFS.

Step2: Import the project ‘HadoopMapSideJoin’ to the Eclipse workspace

Step3: Export the project as JAR

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Name the JAR and click FINISH

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step4: Submit the MapReduce Job

"Usage : App -files <location-to-cust-id-and-order-file> <input


path> <output path>"

$hadoop jar MapSide.jar App -files


/home/training/labs/data/order_custid.txt input output

Step5: Verify the output

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Lab10: Reduce-Side Join

To perform this exercise, consider two separate datasets of a sports complex:

1. cust_details: It contains the details of the customer.


2. transaction_details: It contains the transaction record of the customer.

Using these two datasets, we will find out the lifetime value of each customer by
needing following:

1. The person’s name along with the frequency of the visit by that person.
2. Total amount spent by him/her for purchasing the equipment.

Input:

The input files are available on local (/home/training/labs/data/reducejoin)

Expected Output

Kristina 8 651.049999
Paige 6 706.970007
Sherri 3 527.589996
Gretchen 5 337.060005
Karen 5 325.150005
Patrick 5 539.380010
Elsie 6 699.550003
Hazel 10 859.419990
Malcolm 6 457.829996
Dolores 6 447.090004

Step1: Import the ‘ReduceSideJoin’ project to Eclipse workspace.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step2: Export the project as JAR
Right click on the java file and select export, name the JAR and click FINISH.

Step3: Design of the project


Map Phase:

We have separate mapper for each dataset.

Mapper for cust_details, will yield

Key – Value pair: [cust ID, cust name]

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Example: [4000001, cust Kristina], [4000002, cust Paige],
etc.

Mapper for transaction_details, will yield

Key, Value Pair: [cust ID, tnxn amount]

Example: [4000001, tnxn 40.33], [4000002, tnxn 198.44],


etc.

Shorting and Shuffling Phase:


It will generate an array list of values corresponding to each key

Key – list of Values:

{cust ID1 – [(cust name1), (tnxn amount1),


(tnxn amount2), (tnxn amount3),…..]}
{cust ID2 – [(cust name2), (tnxn amount1),
(tnxn amount2), (tnxn amount3),…..]}

Example:

{4000001 – [(cust kristina), (tnxn 40.33),


(tnxn 47.05),…]};

Reducer:
To achieve the final goal, reduce-side join operation will yield output like

Key – Value pair: [Name of the customer] (Key) – [total amount,


frequency of the visit] (Value)

Step4: Prepare the input

$hdfs dfs -mkdir reduceside

$cd /home/training/labs/data/reducejoin
$hdfs dfs -put cust_details reduceside
$hdfs dfs -put transaction_details reduceside

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com


Step5: Submit the job

$hadoop jar ReduceSide.jar ReduceJoin


reduceside/cust_details reduceside/transaction_details out

Step6: View the output


$hdfs dfs -ls out
$hdfs dfs -cat out/part-r-00000

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

You might also like