Professional Documents
Culture Documents
Intellipaat Hands On Exercises PDF
Intellipaat Hands On Exercises PDF
Getting Started........................................................................................3
Lab1: Working with HDFS .......................................................................4
Lab2: Writing WordCount Program .........................................................9
Lab3: MapReduce with Combiner ......................................................... 13
Lab4: Writing Custom Partitioner .......................................................... 21
Lab5: Unit Test MapReduce ................................................................. 25
Lab6: Using ToolRunner ....................................................................... 29
Lab7: Running MapReduce in LocalJobRunner Mode .......................... 35
Lab8: Using Counters and Map-Only Job ............................................. 37
Lab9: Joining two datasets using Map-Side Join .................................. 41
Lab10: Reduce-Side Join...................................................................... 45
Login to the Virtual Machine with user training and password training.
to the Virtual Machine using WinSCP tool or simply
Upload training data from local
use move or copy command.
Extract the content of ‘labs.gz’ provided by trainer on Virtual Machine.
Note:
For all the labs, the project file under workspace consist of three folders:
1. Stubs: contains minimal skeleton of the code for the Java classes
2. Hints: contains Java class stubs that include additional hints about what’s
required to complete the exercise.
In this exercise, we will practise generic HDFS commands and get ourselves
familiar with Hadoop command line interface.
Open Terminal on your VM, navigate to Application > System Tools > Terminal Or
use the shortcut on desktop.
of all the commands associated with FsShell
To view description
subsystem, enter:
$hdfs dfs
To view the content of root directory
$hdfs dfs -ls /
Note:
Both the command returned silently because there are no files under
home directory.
Remember:
On your local filesystem, change the working directoryto the location where
‘training data’ is dumped i.e. under labs directory.
List the content of
‘data’ directory and locate file named
shakespeare.tar.gz
Unzip the file shakespeare.tar.gz
Upload shakespeare directory and its content to HDFS
$hdfs dfs -put shakespeare /user/training
List the content of ‘training’ home directory now, you should now see
‘shakespeare’ directory.
directory files without using path argument.
To test that, list the home
$hdfs dfs -ls
To create a directory weblog in HDFS
Unzip the access_log.gz file located under data directory and upload it to weblog
directory in HDFS.
$gunzip access_log.gz
Remove file named shakespeare/glossary from
HDFS
$hdfs dfs -rm shakespeare/glossary
To read first 10 lines of a file shakespeare/histories
To download a file from HDFS to local filesystem
$hdfs dfs -get shakespeare/poems ~/shakespoems.txt
In this exercise, we will count the number of occurrence of each word in a file or set
of files. To achieve this, you need to first compile the Java files, create a JAR and
then run MapReduce job.
$cd ~/workspace/wordcount/src
$ls
$cd solution
$ls
$cd ~/workspace/wordcount/src
$javac -classpath `hadoop classpath` solution/*.java
$ls
Check the counter information on the terminal, similarly job information can also
be accessed from Resource Manager URL.
Note: Job ran with only one Reducer, so there should be only one part-file.
In this exercise, we will add a Combiner to the WordCount program to reduce the
amount of intermediate data sent from the Mapper to the Reducer.
Copy WordMapper.java and SumReducer.Java from the wordcount project to the
combiner project (~/workspace/combiner)
Modify the WordCountDriver.java code to add a Combiner for the
WordCount program.
Select ‘Existing Project into workspace’
Intellipaat Software
Solutions Pvt. Ltd. https://intellipaat.com
Resolve any error, if detected after the import.
Hint: Add the external JAR file if any build file is missing under
properties.
Remove the missing file and click on Add External JARs
In this exercise, you will write a MapReduce job with multiple Reducers, and create
a Partitioner to determine which Reducer each piece of Mapper output is sent to.
The final output should consist of 12 files, one each for every month of the year.
Each file will contain a list of IP address, and the number of hits from that address
in that month.
2. The Mapper should emit a Text key (IP address) and Text value (the
month).
2. Remember that Partitioner receives both the key and value, so you can
inspect the value to determine which Reducer to choose.
List the partitioned out files
In this exercise, we will write Unit Tests for the WordCount code.
Step1: Launch Eclipse and import mrunit folder from the existing workspace.
Step4: Observe the failure. Results in the JUnit tab (should indicate that three
tests ran with three failures)
Step5: Now implement the three tests (refer the code in the solution package)
Write a MapReduce job that reads any text input and computes the average
length of all words that starts with each character.
Note: When set to true, final output have both upper and lower case letters,
when false it should have only lower case letters.
View the out file again, when option was set to TRUE
Comment out the code that sets the parameter programmatically and pas the
parameter value using -D on command line.
Note that there will not be any change in the output, when submitting the job
using programmatic or runtime approach.
In this exercise, we will practise running a job locally for debugging and testing
purpose.
We will run the Average Word Length program again using the LocalJobRunner on
the command line.
Step1: Specify -jt=local to run the job locally instead of submitting to the
cluster and -fs = file:/// to use the local file system instead of HDFS.
Note: Your input and output files should refer to local files rather than HDFS
files.
Input: Copy the shakespoem.txt to /home/training/input
This job will process a web server’s access log to count the number of times gifs,
jpegs and other resources have been retrieved.
For input data, use the Web access log file that we have uploaded in Lab1 to the
HDFS.
Step1: Open the project named ‘counters’ in eclipse by importing from existing
workspace.
Step2: In your driver code, retrieve the values of the counters after the job has
completed and report them using System.out.println
Step5: View the counter information at the end of MapReduce job execution.
In this exercise, we will consider joining two datasets i.e. customer orders
dataset (order_custid.txt) and customer’s details datasets
(cust_data.txt), where we need distributed cache to keep the small dataset
(order_custid.txt) in the memory.
Input Data:
Input Data : order_custid.txt
..........................................................................
.
...
781571544 S9876126
781571545 S9876127
781571546 S9876128
781571547 S9876129
781571548 S9876130
781571549 S9876131
781571550 S9876132
781571551 S9876133
781571552 S9876134
Customer Dataset
Customer Data : cust_data.txt
..........................................................................
.
..
Expected Output:
781571544 Smith,John (248)-555-9430 jsmith@aol.com S9876126
781571545 Hunter,April (815)-555-3029 april@showers.org S9876127
781571546 Ching,Iris (305)-555-0919 iching@zen.org S9876128
781571547 Doe,John (212)-555-0912 jdoe@morgue.com S9876129
781571548 Jones,Tom (312)-555-3321 tj2342@aol.com S9876130
781571549 Smith,John (607)-555-0023 smith@pocahontas.com S9876131
781571550 Crosby,Dave (405)-555-1516 cros@csny.org S9876132
781571551 Johns,Pam (313)-555-6790 pj@sleepy.com S9876133
781571552 Jetter,Linda (810)-555-8761 netless@earthlink.net
Using these two datasets, we will find out the lifetime value of each customer by
needing following:
1. The person’s name along with the frequency of the visit by that person.
2. Total amount spent by him/her for purchasing the equipment.
Input:
Expected Output
Kristina 8 651.049999
Paige 6 706.970007
Sherri 3 527.589996
Gretchen 5 337.060005
Karen 5 325.150005
Patrick 5 539.380010
Elsie 6 699.550003
Hazel 10 859.419990
Malcolm 6 457.829996
Dolores 6 447.090004
Example:
Reducer:
To achieve the final goal, reduce-side join operation will yield output like
$cd /home/training/labs/data/reducejoin
$hdfs dfs -put cust_details reduceside
$hdfs dfs -put transaction_details reduceside