Intellipaat Hands On Exercises PDF

Intellipaat Hands-On
Intellipaat Software Solutions Pvt. Ltd. https://intell

Contents
Getting Started........................................................................................3
Lab1: Working with HDFS .......................................................................4
Lab2: Writing WordCount Program .........................................................9
Lab3: MapReduce with Combiner ......................................................... 13
Lab4: Writing Custom Partitioner .......................................................... 21
Lab5: Unit Test MapReduce ................................................................. 25
Lab6: Using ToolRunner ....................................................................... 29
Lab7: Running MapReduce in LocalJobRunner Mode .......................... 35
Lab8: Using Counters and Map-Only Job ............................................. 37
Lab9: Joining two datasets using Map-Side Join .................................. 41
Lab10: Reduce-Side Join...................................................................... 45
Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Getting Started
 
 Login to the Virtual Machine with user training and password training.

 to the Virtual Machine using WinSCP tool or simply
Upload training data from local
 use move or copy command.


 Extract the content of ‘labs.gz’ provided by trainer on Virtual Machine.
  Move the project files of workspace directory to

 /home/training/workspace

$ cd /home/training/labs/workspace
$ cp -r * /home/training/workspace
Note:
For all the labs, the project file under workspace consist of three folders:
1. Stubs: contains minimal skeleton of the code for the Java classes
2. Hints: contains Java class stubs that include additional hints about what’s
required to complete the exercise.
3. Solution: Fully implemented Java code

Lab1: Working with HDFS
In this exercise, we will practise generic HDFS commands and get ourselves
familiar with Hadoop command line interface.
Open Terminal on your VM, navigate to Application > System Tools > Terminal Or
use the shortcut on desktop.
1.1 Exploring HDFS

 of all the commands associated with FsShell
To view description
subsystem, enter:
$hdfs dfs

To view the content of root directory
$hdfs dfs -ls /


List the content of /user directory.

 $hdfs dfs -ls /user









To list the content of ‘training’ home
directory.
$hdfs dfs -ls /user/training

$hdfs dfs -ls
Note:
Both the command returned silently because there are no files under
home directory.
Access any random non-existing directory and observe the error

message.
$hdfs dfs -ls test
Remember:
The directory structure in HDFS has nothing to do with the directory

structure of the local filesystem, they are separate namespaces.

1.2 Uploading Files

On your local filesystem, change the working directoryto the location where
 ‘training data’ is dumped i.e. under labs directory.









List the content of 
‘data’ directory and locate file named
 shakespeare.tar.gz






 
Unzip the file shakespeare.tar.gz
$tar -xvf Shakespeare.tar.gz

Upload shakespeare directory and its content to HDFS
 $hdfs dfs -put shakespeare /user/training
 List the content of ‘training’ home directory now, you should now see
‘shakespeare’ directory.

Note:
If you pass any relative (non-absolute) path to FsShell commands, they

are considered relative to your home directory.

 directory files without using path argument.
To test that, list the home
 $hdfs dfs -ls





 
To create a directory weblog in HDFS
$hdfs dfs -mkdir weblog


Unzip the access_log.gz file located under data directory and upload it to weblog
directory in HDFS.
$gunzip access_log.gz
$hdfs dfs -put access_log weblog
Verify the content using ls command

1.3 Manipulating Files

Remove file named shakespeare/glossary from
 HDFS
 $hdfs dfs -rm shakespeare/glossary




 
To read first 10 lines of a file shakespeare/histories
$hdfs dfs -cat shakespeare/histories | tail -n 10

To download a file from HDFS to local filesystem
$hdfs dfs -get shakespeare/poems ~/shakespoems.txt

Lab2: Writing WordCount Program
In this exercise, we will count the number of occurrence of each word in a file or set
of files. To achieve this, you need to first compile the Java files, create a JAR and
then run MapReduce job.
Step1: Copy the wordcount folder from /home/training/labs/workspace to

/home/training/workspace.
Step2: Change directory to wordcount source directory
$cd ~/workspace/wordcount/src
$ls
This directory contains three package directories, navigate to solution

directory.
$cd solution
$ls
This package contains following Java files:

WordCount.java: A simple MapReduce driver class
WordMapper.java: A mapper class for the job
SumReducer.java: A reducer class for the job
Step3: Before compiling, examine the classpath Hadoop is configured to use:

$hadoop classpath

Step4: Compile the three Java classes using command line:
$cd ~/workspace/wordcount/src
$javac -classpath `hadoop classpath` solution/*.java
$ls
The complied (.class) files are placed in the solution directory.
Step5: Collect your complied Java files into a JAR file:
$jar cvf wc.jar solution/*.class

Step6: Submit the MapReduce job using ‘shakespeare’ as the input, uploaded in
previous exercise.
$hadoop jar wc.jar solution.WordCount shakespeare

output Where
wc.jar: wordcount jar

soltion.WordCount: Name of the class
shakespeare: input files present under Shakespeare directory
output: location of Mapreduce job final output
Check the counter information on the terminal, similarly job information can also
be accessed from Resource Manager URL.
Step7: Review the result of your MapReduce job:
$hdfs dfs -ls output
Note: Job ran with only one Reducer, so there should be only one part-file.

Step8: View the content of the output for your job.
$hdfs dfs -cat output/part-r-00000 | tail -10

Lab3: MapReduce with Combiner
In this exercise, we will add a Combiner to the WordCount program to reduce the
amount of intermediate data sent from the Mapper to the Reducer.
Step1: Implement a Combiner

Copy WordMapper.java and SumReducer.Java from the wordcount project to the
combiner project (~/workspace/combiner)



Modify the WordCountDriver.java code to add a Combiner for the
WordCount program.
Step2: Assemble and export the JAR using Eclipse.


Import ‘combiner’ project to Eclipse.

Select ‘Existing Project into workspace’


 Browse to the ‘/home/training/workspace’ directory and select
‘combiner’ project.


Click Finish
Intellipaat Software
Solutions Pvt. Ltd. https://intellipaat.com
 
Resolve any error, if detected after the import.

Hint: Add the external JAR file if any build file is missing under
properties.

Remove the missing file and click on Add External JARs


From /usr/lib/hadoop-* add all the available JAR to the project.

 
Select the Mapper, Reducer and Driver java files and Export them as a JAR file.


Name the JAR file and click Finish
Step3: Submit the MapReduce Job

$hadoop jar combiner.jar solution.WordCountDriver
shakespeare combinerout
Step4: Verify the output, it should remain identical to the WordCount

application without a Combiner.
$hdfs dfs -ls combinerout

Lab4: Writing Custom Partitioner
In this exercise, you will write a MapReduce job with multiple Reducers, and create
a Partitioner to determine which Reducer each piece of Mapper output is sent to.
Test Data: Web server access logs
The final output should consist of 12 files, one each for every month of the year.
Each file will contain a list of IP address, and the number of hits from that address
in that month.
To accomplish this, we will have 12 reducers, each of which is responsible for

processing the data for a particular month.
Step1: Write a Mapper
1. Start with LogMonthMapper.java stub file located under

~/workspace/partitioner/src/stub folder, write a Mapper that
maps a log file output line to an IP/month pair.
2. The Mapper should emit a Text key (IP address) and Text value (the
month).
Solution: The code is packaged under

~/workspace/partitioner/src/solution folder.

Step2: Write the Partitioner
1. Modify the MonthPartitioner.java stub file to create a Partitioner that

sends the (key,value) pair to the correct Reducer based on the month.
2. Remember that Partitioner receives both the key and value, so you can
inspect the value to determine which Reducer to choose.
Step3: Modify the Driver
1. Modify driver code to specify that you want 12 Reducer
2. Configure your job to use your custom Partitioner
Step4: Build the JAR file using Eclipse.

Name the JAR file and click FINISH.
Step5: Run the MapReduce Job


that input file ‘access_logs’ is uploaded to HDFS. (completed this step in
Ensure
 Lab1)







 
Submit the MapReduce job
$hadoop jar partitioner.jar solution.ProcessLogs

weblog partoutput

List the partitioned out files
$hdfs dfs -ls partoutput
View the content of out file:

Lab5: Unit Test MapReduce
In this exercise, we will write Unit Tests for the WordCount code.
Step1: Launch Eclipse and import mrunit folder from the existing workspace.

Step2: Examine the TestWordCount.java file in the mrunit project stubs
package. Notice that three tests have been created, one each for the Mapper,
Reducer, and the entire MapReduce flow.
Currently, all three tests simply fail.

Step3: Run the tests by right-clicking on the TestWordCount.java in the package
Explorer panel and choosing Run As > JUnit tab.
Step4: Observe the failure. Results in the JUnit tab (should indicate that three
tests ran with three failures)
Step5: Now implement the three tests (refer the code in the solution package)

Step6: Run the test again. Result in the JUnit tab should indicate that three
tests ran with no failures.

Lab6: Using ToolRunner
In this exercise, you will implement driver using ToolRunner.
Write a MapReduce job that reads any text input and computes the average
length of all words that starts with each character.
Mapper should also reference a Boolean parameter called caseSensitive, if true

the mapper should treat upper and lower case letters as different and if false or
unset, all letters should be converted to lower case.
Step1: Locate the toolrunner folder under ~/workspace/toolrunner
Step2: Import this project in Eclipse.
Note: In LetterMapper class, setup method is configured with parameter called

caseSensitive, use to indicate whether to do a case sensitive or case insensitive
processing.
Step3: Pass parameter programmatically

Test and JAR your code twice, once passing false and once passing true.
Note: When set to true, final output have both upper and lower case letters,
when false it should have only lower case letters.

Save the JAR file and click on FINISH
Submit the job using the JAR file
$hadoop jar toolrunnerf.jar solution.AvgWordLength

shakespeare avgout

View the output, when option was set to FALSE:
hdfs dfs -ls avgout
Now change the parameter to TRUE
Export the JAR again and Submit the job
$hadoop jar toolrunnert.jar solution.AvgWordLength

shakespeare avgout_true
View the out file again, when option was set to TRUE
hdfs dfs -ls avgout_true

Step4: Pass parameter as a runtime parameter
Comment out the code that sets the parameter programmatically and pas the
parameter value using -D on command line.
Export the JAR file and click FINISH.

Step5: Test passing both true and false to confirm the parameter works correctly.
$hadoop jar toolrunner.jar solution.AvgWordLength

- DcaseSensitive=true shakespeare avgtrue
View the out file, when option is set to TRUE

Now change the parameter to FALSE
$hadoop jar toolrunner.jar solution.AvgWordLength

- DcaseSensitive=false shakespeare avgfalse
View the out file, when option is set to FALSE
Note that there will not be any change in the output, when submitting the job
using programmatic or runtime approach.

Lab7: Running MapReduce in LocalJobRunner Mode
In this exercise, we will practise running a job locally for debugging and testing
purpose.
We will run the Average Word Length program again using the LocalJobRunner on
the command line.
Step1: Specify -jt=local to run the job locally instead of submitting to the
cluster and -fs = file:/// to use the local file system instead of HDFS.
Note: Your input and output files should refer to local files rather than HDFS
files.
Input: Copy the shakespoem.txt to /home/training/input
Submit the job as:
$hadoop jar toolrunner.jar solution.AvgWordLength -fs=file:///

-Dmapreduce.framework.name=local -
Dmapreduce.resourcemanager.address=local -
Dmapreduce.jobtracker.address=local /home/training/input
output

Verify the output on local:
$cat part-r-00000 | tail -10

Lab8: Using Counters and Map-Only Job
In this exercise, we will create a Map-Only MapReduce job.
This job will process a web server’s access log to count the number of times gifs,
jpegs and other resources have been retrieved.
Hint: Use Map-Only MapReduce job, by setting the number of Reducers to 0 in

driver code.
For input data, use the Web access log file that we have uploaded in Lab1 to the
HDFS.
Step1: Open the project named ‘counters’ in eclipse by importing from existing
workspace.
Build path entries are missing, export the external JARs
Step2: In your driver code, retrieve the values of the counters after the job has
completed and report them using System.out.println

Step3: Export the project as JAR file and run the MapReduce job
Name the JAR file and click FINISH.

Step4: Submit the job with weblog as the input
$hadoop jar counter.jar solution.ImageCounter
weblog counterout
Step5: View the counter information at the end of MapReduce job execution.

Notice that output folder on HDFS contains Mapper output files which are
empty, as Mapper did not write any data.

Lab9: Joining two datasets using Map-Side Join
In this exercise, we will consider joining two datasets i.e. customer orders
dataset (order_custid.txt) and customer’s details datasets
(cust_data.txt), where we need distributed cache to keep the small dataset
(order_custid.txt) in the memory.
Input Data:
Input Data : order_custid.txt
..........................................................................
.
...
781571544 S9876126
781571545 S9876127
781571546 S9876128
781571547 S9876129
781571548 S9876130
781571549 S9876131
781571550 S9876132
781571551 S9876133
781571552 S9876134
Customer Dataset
Customer Data : cust_data.txt
..........................................................................
.
..
781571544 Smith,John (248)-555-9430 jsmith@aol.com

781571545 Hunter,April (815)-555-3029 april@showers.org
781571546 Ching,Iris (305)-555-0919 iching@zen.org
781571547 Doe,John (212)-555-0912 jdoe@morgue.com
781571548 Jones,Tom (312)-555-3321 tj2342@aol.com
781571549 Smith,John (607)-555-0023 smith@pocahontas.com
781571550 Crosby,Dave (405)-555-1516 cros@csny.org
781571551 Johns,Pam (313)-555-6790 pj@sleepy.com
781571552 Jetter,Linda (810)-555-8761 netless@earthlink.net
Expected Output:
781571544 Smith,John (248)-555-9430 jsmith@aol.com S9876126
781571545 Hunter,April (815)-555-3029 april@showers.org S9876127
781571546 Ching,Iris (305)-555-0919 iching@zen.org S9876128
781571547 Doe,John (212)-555-0912 jdoe@morgue.com S9876129
781571548 Jones,Tom (312)-555-3321 tj2342@aol.com S9876130
781571549 Smith,John (607)-555-0023 smith@pocahontas.com S9876131
781571550 Crosby,Dave (405)-555-1516 cros@csny.org S9876132
781571551 Johns,Pam (313)-555-6790 pj@sleepy.com S9876133
781571552 Jetter,Linda (810)-555-8761 netless@earthlink.net

S9876134

Step1: Upload cust_data.txt from local (/home/training/labs/data) to
HDFS.
Step2: Import the project ‘HadoopMapSideJoin’ to the Eclipse workspace
Step3: Export the project as JAR

Name the JAR and click FINISH

Step4: Submit the MapReduce Job
"Usage : App -files <location-to-cust-id-and-order-file> <input

path> <output path>"
$hadoop jar MapSide.jar App -files

/home/training/labs/data/order_custid.txt input output
Step5: Verify the output

Lab10: Reduce-Side Join
To perform this exercise, consider two separate datasets of a sports complex:
1. cust_details: It contains the details of the customer.

2. transaction_details: It contains the transaction record of the customer.
Using these two datasets, we will find out the lifetime value of each customer by
needing following:
1. The person’s name along with the frequency of the visit by that person.
2. Total amount spent by him/her for purchasing the equipment.
Input:
The input files are available on local (/home/training/labs/data/reducejoin)
Expected Output
Kristina 8 651.049999
Paige 6 706.970007
Sherri 3 527.589996
Gretchen 5 337.060005
Karen 5 325.150005
Patrick 5 539.380010
Elsie 6 699.550003
Hazel 10 859.419990
Malcolm 6 457.829996
Dolores 6 447.090004
Step1: Import the ‘ReduceSideJoin’ project to Eclipse workspace.

Step2: Export the project as JAR
Right click on the java file and select export, name the JAR and click FINISH.
Step3: Design of the project

Map Phase:
We have separate mapper for each dataset.
Mapper for cust_details, will yield
Key – Value pair: [cust ID, cust name]

Example: [4000001, cust Kristina], [4000002, cust Paige],
etc.
Mapper for transaction_details, will yield
Key, Value Pair: [cust ID, tnxn amount]
Example: [4000001, tnxn 40.33], [4000002, tnxn 198.44],

etc.
Shorting and Shuffling Phase:

It will generate an array list of values corresponding to each key
Key – list of Values:
{cust ID1 – [(cust name1), (tnxn amount1),

(tnxn amount2), (tnxn amount3),…..]}
{cust ID2 – [(cust name2), (tnxn amount1),
(tnxn amount2), (tnxn amount3),…..]}
Example:
{4000001 – [(cust kristina), (tnxn 40.33),

(tnxn 47.05),…]};
Reducer:
To achieve the final goal, reduce-side join operation will yield output like
Key – Value pair: [Name of the customer] (Key) – [total amount,

frequency of the visit] (Value)
Step4: Prepare the input
$hdfs dfs -mkdir reduceside
$cd /home/training/labs/data/reducejoin
$hdfs dfs -put cust_details reduceside
$hdfs dfs -put transaction_details reduceside

Step5: Submit the job
$hadoop jar ReduceSide.jar ReduceJoin

reduceside/cust_details reduceside/transaction_details out
Step6: View the output

$hdfs dfs -ls out
$hdfs dfs -cat out/part-r-00000

Intellipaat Hands On Exercises PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intellipaat Hands On Exercises PDF

Uploaded by

Copyright:

Available Formats

Intellipaat Hands-On

Intellipaat Software Solutions Pvt. Ltd. https://intell

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

  Move the project files of workspace directory to

3. Solution: Fully implemented Java code

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

1.1 Exploring HDFS

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Access any random non-existing directory and observe the error

$hdfs dfs -ls test

The directory structure in HDFS has nothing to do with the directory

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

$tar -xvf Shakespeare.tar.gz

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

If you pass any relative (non-absolute) path to FsShell commands, they

$hdfs dfs -mkdir weblog

$hdfs dfs -put access_log weblog

Verify the content using ls command

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

$hdfs dfs -cat shakespeare/histories | tail -n 10

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Step1: Copy the wordcount folder from /home/training/labs/workspace to

Step2: Change directory to wordcount source directory

This directory contains three package directories, navigate to solution

This package contains following Java files:

WordMapper.java: A mapper class for the job

SumReducer.java: A reducer class for the job

Step3: Before compiling, examine the classpath Hadoop is configured to use:

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

The complied (.class) files are placed in the solution directory.

Step5: Collect your complied Java files into a JAR file:

$jar cvf wc.jar solution/*.class

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

$hadoop jar wc.jar solution.WordCount shakespeare

wc.jar: wordcount jar

shakespeare: input files present under Shakespeare directory

output: location of Mapreduce job final output

Step7: Review the result of your MapReduce job:

$hdfs dfs -ls output

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

$hdfs dfs -cat output/part-r-00000 | tail -10

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Step1: Implement a Combiner

Step2: Assemble and export the JAR using Eclipse.

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Step3: Submit the MapReduce Job

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Step4: Verify the output, it should remain identical to the WordCount

$hdfs dfs -ls combinerout

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

Test Data: Web server access logs

To accomplish this, we will have 12 reducers, each of which is responsible for

Step1: Write a Mapper

1. Start with LogMonthMapper.java stub file located under

Solution: The code is packaged under

Intellipaat Software Solutions Pvt. Ltd. https://intellipaat.com

1. Modify the MonthPartitioner.java stub file to create a Partitioner that

Step3: Modify the Driver