Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

IBM Software

Information Management

Map/Reduce
Hands-On Lab
IBM Software
Information Management

Table of Contents

1 Introduction ....................................................................................................................................................................................... 3
2 About this Lab ................................................................................................................................................................................... 3
3 Background ....................................................................................................................................................................................... 3
4 Preparation ............................................................................................................................................ Error! Bookmark not defined.
5 Create a BigInsights Project ............................................................................................................................................................ 4
6 Create a new Java Map/Reduce Program ....................................................................................................................................... 6
7 Implementing the Java Map/Reduce Program.............................................................................................................................. 12
8 Running the Java Map/Reduce Program....................................................................................................................................... 14
9 Debugging the Java Map/Reduce Program Locally ..................................................................................................................... 19
10 Running the Java Map/Reduce Program on the Cluster.............................................................................................................. 23
11 Debugging the Failed Map Task........................................................................................................... Error! Bookmark not defined.
12 Summary.......................................................................................................................................................................................... 28

Copyright IBM Corp. 2013. All rights reserved Page 2 of 29


IBM Software
Information Management

1 Introduction
The BigInsights tools for Eclipse is the development tooling that comes with InfoSphere BigInsights and includes wizards,
code generators, context-sensitive help, and a test environment to simplify your development efforts for programs written
in Jaql, Pig, Hive or Java map/reduce. In addition, you can package a program into a BigInsights application and publish it
into the BigInsights console to make the program available for other users to run.

This lab focuses on the development of Java map/reduce programs and covers all steps from creating a new map/reduce
program, running it locally with a small subset of data, running it against the cluster and debugging task attempts to find
problems. For creating a BigInsights application, publishing and running it on the cluster, see the BigInsights application
development lab.
The lab can be run in an Eclipse on Linux or Windows. For Windows, follow the pre-requisites in the install section for the
BigInsights tools for Eclipse in the information center.

2 About this Lab


After completing this hands-on lab, youll be able to:

Create a BigInsights project

Create and new Java map/reduce program

Run and debug your Java map/reduce program in Eclipse

Run the Java map/reduce program on the cluster, monitor the jobs and see the log output

Debug failed task attempts on the cluster from your Eclipse environment

3 Background

The Java map/reduce program that you will be writing will analyze sales data from a CSV file (Comma-Separated Values)
and count the orders per customer id. The program is kept simple because the lab focuses on how the available tools can
help you during the development and the process of debugging.
The data that the program will be analyzing is a comma-separated file with sales data that contains order and customer
information.

The goal of the program is to find loyal customers who have ordered more than 3 times. A map-reduce program is ideal
for that scenario, especially for a large amount of sales data. The mapper will process the input file line by line, and since
the file is a CSV file, it can be easily run on multiple nodes. For each line, the mapper will find the customer id, which is
the third field in the file (CUST_CODE) and create an entry with the key of the customer id and the value of 1 to indicate
that the user has placed an order.
The reducer will take the output of all the mappers and add up the orders by counting how many entries exist for a
particular customer id.

Copyright IBM Corp. 2013. All rights reserved Page 3 of 29


IBM Software
Information Management

The file we will analyze contains standard comma separated data of customer sales.

CUST_ORDER_NUMBER,CUST_ORDER_DATE,CUST_CODE,CUST_CC_ID,PAY_METHOD_CODE,
166517,"2006-09-07-08.43.15.017000",125412,35582,29,5,3,1,1
167083,"2006-10-28-07.50.14.017000",125412,35803,22,5,3,2,2,
167336,"2006-10-15-19.20.32.017000",125412,35903,28,5,3,2,2,

As you can see the first column contains the customer order number followed by the order date and the customer code.
We will filter the document in the Map task, splitting up rows into their fields and extracting the customer id for each row.
We will then send the customer ids to the reducer. The reducer will finally compute a total count of all customer ids. He
will then remove all customer ids that have less than 4 orders. This will return repeat customers.

In total our MapReduce program will work very similar to the following SQL call:

SELECT cust_code, count(*) as num from data_small


group by cust_code having num > 3;

As we explained in the presentation map tasks normally execute per-row computations like WHERE conditions and scalar
functions, while the shuffle task handles joins and groupings. Finally the reducer executes tasks like aggregations and
having clauses.

4 Create a BigInsights Project

1. From your desktop, launch the BigInsights Studio icon. Note: The Eclipse IDE has been pre-loaded with
BigInsights tooling for this lab.

When prompted for a workspace browse for /home/biadmin/labs/mapreduce/workspace

2. Open the BigInsights perspective. Go to Window -> Open Perspective -> Other and select BigInsights.

Copyright IBM Corp. 2013. All rights reserved Page 4 of 29


IBM Software
Information Management

3. To create a new project, go to File -> New -> BigInsights Project. If the BigInsights Project menu is not visible,
you can also go to File -> New -> Project and select BigInsights Project from the BigInsights folder.
Enter Sales as the project name and select Finish.

4. The project Sales will appear in the Project Explorer view. Click the arrow next to Sales to expand the project.
You will see a src folder where all your Java source files will be stored. Youll also see two libraries entries JRE
System Library points to Java to compile your project files, and BigInsights v2.0.0.0 includes all the BigInsights
libraries that you will need for development against the BigInsights cluster. This includes Hadoop libraries that are
needed to compile Java map/reduce programs. If you use a different version of BigInsights, the BigInsights library
entry will show a different version other than 2.0.0.0 and the libraries underneath will have different versions
based on what is bundled with the specific version of BigInsights platform.

Copyright IBM Corp. 2013. All rights reserved Page 5 of 29


IBM Software
Information Management

5 Create a new Java Map/Reduce Program


A Java map/reduce program consists of a mapper, reducer and a driver class that provides the configuration and runs the
program. While mapper, reducer and driver methods can all be stored in one class that extends the required interfaces, it
is good design to keep them in separate classes. The BigInsights tools for Eclipse provide a wizard that creates templates
for all three classes.

1. In the BigInsights perspective, select File -> New -> Java MapReduce program.

If the Java MapReduce menu is not visible, you can also select File -> New -> Other and select Java
MapReduce program from the BigInsights folder.

Copyright IBM Corp. 2013. All rights reserved Page 6 of 29


IBM Software
Information Management

2. A new wizard opens which has three pages with the details for a mapper and reducer implementation as well as a
driver class. For both mapper and reducer we have to specify the input and output key and value pairs.

First, set following values for mapper class.


Package : myPackage
Name : SalesMapper

3. For this program, the input will be a CSV file and the mapper will be called with one row of data. To be able to
check whether we have data, we specify the input key type as LongWritable, and the value is one row of text,
so the input value type is Text.
The customer id will be the key for output of the mapper, so output key type is Text. Also, output value is the
number of orders for each user, and thus it should be an Integer.

Therefore, fill in the following types for input and output keys/values.
Type of input keys : org.apache.hadoop.io.LongWritable
Type of input values : org.apache.hadoop.io.Text
Type of output keys : org.apache.hadoop.io.Text
Type of output values : org.apache.hadoop.io.IntWritable

Your first page of the wizard should look like below.

Copyright IBM Corp. 2013. All rights reserved Page 7 of 29


IBM Software
Information Management

4. Click Next.

5. On the next page, we have to specify the values for the reducer implementation. The package is pre-filled in with
the same package as the mapper, but you can change that value if youd like.

Set the name of the reducer as below.


Name : SalesReducer

6. The types for the input keys and values are copied from the output keys and values from the mapper class
definition because the output of the mapper is processed as input by the reducer.

Copyright IBM Corp. 2013. All rights reserved Page 8 of 29


IBM Software
Information Management

For this program, the reducer is aggregating the number of orders for each customer id. Hence we set the type of
the output keys to Text and the type of the output values to an Integer.

Type of output keys : org.apache.hadoop.io.Text


Type of output values : org.apache.hadoop.io.IntWritable

The wizard page for the reducer should look like this.

7. Click Next.

8. On the last page of the wizard we only have to select a package and specify the name of the driver class.
Set the Package and name with following values.

Copyright IBM Corp. 2013. All rights reserved Page 9 of 29


IBM Software
Information Management

Package : myPackage
Name : SalesDriver

The page should look like below.

9. Click Finish.

10. After clicking Finish, the wizard will create templates for the three classes and open them in a Java editor.

The mapper and reducer classes implement the Hadoop Mapper and Reducer interfaces and its up to us to
implement the methods.

Following codes are the templates created from the wizard.

Copyright IBM Corp. 2013. All rights reserved Page 10 of 29


IBM Software
Information Management

The driver classes creates a new Hadoop job configuration and sets mapper and reducer classes accordingly.

Mapper:

Copyright IBM Corp. 2013. All rights reserved Page 11 of 29


IBM Software
Information Management

Reducer:

6 Implementing the Java Map/Reduce Program


Now that we have created the templates for the map/reduce program, we can use the Java editor to implement the
mapper and reducer classes and make a few changes to the driver class. Since the tutorial focuses on the tooling aspect
and not on the details of implementing a mapper and reducer, well just provide the implementation for the relevant
methods. Copy the code into the generated templates for each class.

1. SalesDriver.java:
The only change we have to make to the driver class is to get the arguments for the input and output path from
the program arguments to be able to specify the values during runtime. Replace the lines 29 and 31 where TODO
is specified with following code.
job.setOutputValueClass(IntWritable.class);

// TODO: Update the input path for the location of the inputs of the job
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
// TODO: Update the output path for the output directory of the job
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));

2. SalesMapper.java:
Enter the following code into the map method.

Copyright IBM Corp. 2013. All rights reserved Page 12 of 29


IBM Software
Information Management

public void map(LongWritable key, Text value, Context context)


throws IOException, InterruptedException {

if (key.get() > 0) {
String line = value.toString();
String[] elements = line.split(",");

// customer id is third field


String custId = elements[2];
context.write(new Text(custId), new IntWritable(1));
}
}

This code will take each input line of text and split it into its elements in our case comma separated values. Since the
customer id is the third element we then retrieve it with elements[2] ( arrays in Java start with 0 ) . Finally we output
the customer id and the number 1 as input for the reducer. The shuffle task will automatically take all output values
with the same customer id send it to the same reducer and will add all 1 values into an array.

You may be wondering what happens with the first row of the data file that contains the column names. As
described in the presentation the default TextInputFormat class will split up input files into rows and
add a key that contains the offset of the start byte of each row to the start of the file. For the first row the key
is therefore 0. And you can see that we first filter the rows by key.get() >0. This means that the first row will be
effectively ignored.

3. SalesReducer.java:
Enter the following code into the reduce method.

public void reduce(Text key, Iterable<IntWritable> values, Context context)


throws IOException, InterruptedException {

int count = 0;
for (IntWritable value : values) {
count++;
}
if (count > 3) {
context.write(key, new IntWritable(count));
}

In the reducer we have inputs of each customer ids ( as key ) and an array of input values; Essentially an array of 1,
one for each row with the customer id. We first count up the number of ones and then filter out all customer ids with
less than 4 values. The remaining values are send to the output file by writing them into the context together with the
number of times they occur.

Copyright IBM Corp. 2013. All rights reserved Page 13 of 29


IBM Software
Information Management

7 Running the Java Map/Reduce Program


Now that we have implemented the program, we can run the program. The tooling provides the capabilities to run the
Java map-reduce program locally, or it can be run remotely on the cluster. First we will run the program locally.

1. From the Run menu, select Run Configurations

2. A new Run configurations dialog opens up. Double-click on the entry Java MapReduce in the tree on the left
hand side to create a new empty run configuration.

3. In the Main tab, select the project Sales, and select the main class myPackage.SalesDriver. You can also give
the run configuration a more meaningful name in the Name field, but put SalesDriver for now.

Copyright IBM Corp. 2013. All rights reserved Page 14 of 29


IBM Software
Information Management

4. In the BigInsights section, we can select whether we want to run against the cluster (default), or whether we
want to run locally. For now we want to run in our local Eclipse machine, so we select Local.

5. Leave the Job name in blank for now since the job name is an optional parameter. You can set it to override the
job name that Hadoop will automatically assign when running a job.

Copyright IBM Corp. 2013. All rights reserved Page 15 of 29


IBM Software
Information Management

6. Our program requires two input arguments a value for the input of the data (directory or a specific file) and a
folder where the output will be stored. Since we want to run locally, you should specify directories in your local file
system. As a input file, write a location of the file data_small.csv in your local system.

Write following values for Job arguments:

/home/biadmin/labs/mapreduce/data_small.csv /home/biadmin/labs/mapreduce/output

7. The delete directory is set to the same value as the output directory of the program. If the output directory exists
when running the program, Hadoop will throw an error that the directory already exists. Instead of deleting it
manually every time before running the program, we can simply set the delete directory which comes in handy
when developing a new program or debugging.

Thus, write a directory as below for Delete directory:

/home/biadmin/labs/mapreduce/output

.
8. With variables filled, it should look like figure below.

9. In addition to settings in the Main tab, well also have to make a few changes in the JAR Settings tab. Click on
the JAR settings tab to open it.

Copyright IBM Corp. 2013. All rights reserved Page 16 of 29


IBM Software
Information Management

10. In order to run the map/reduce program, it will be handed to Hadoop as a JAR file. In the JAR settings tab you
have the option to specify an existing JAR file or let the JAR rebuilt every time you run the program using the
source in your BigInsights project.

Click on Browse to open a file select dialog, keep the default location (which is the location of your BigInsights
project), and type in the name salesprogram. Click OK to close the Browse dialog.

11. Check the checkbox Rebuild the JAR file on each run. This will ensure that we rebuild the jar with the latest
source files every time you run the program.

12. Expand the Sales tree below and select the src folder to include all Java files.

The page should look like this and the error indicator next to the JAR settings name should disappear.

Copyright IBM Corp. 2013. All rights reserved Page 17 of 29


IBM Software
Information Management

13. In the Additional JAR files list, you can add additional JAR files that your program may need and they will be
bundled along with the jar and added to the classpath when Hadoop runs your program. Our sample program
doesnt require any additional Jar files, so we keep the list empty.

14. Click Run to run the program. Since we are running locally, you will see a console window with the output from
Hadoop.

15. When the job is completed, youll see a line that indicates that both map and reduce are 100% done.

Copyright IBM Corp. 2013. All rights reserved Page 18 of 29


IBM Software
Information Management

16. To see the output of the program, navigate to the output directory that you specified in the run configuration using
the file browser. Youll see two files: a success file from Hadoop and a part file that contains the output of your
program. If you open the part file, youll see that there is one customer who has ordered more than 3 times.

8 Debugging the Java Map/Reduce Program Locally


During development of the program, youll probably have to debug the program as well. When we run the program locally
and not on the cluster, we can debug it just like any other Java program.

First we want to set a couple of breakpoints to suspend the program execution at certain points.

1. In the editor, open the file SalesMapper.java. Set a breakpoint in line

String line = value.toString();

by double-clicking in the editor next to the line on the left.

Copyright IBM Corp. 2013. All rights reserved Page 19 of 29


IBM Software
Information Management

2. In the editor, open the file SalesReducer.java and set a breakpoint in line
int count = 0;

3. (Optional) To only debug the reducer for a particular customer id, we can set a breakpoint property to only
suspend the program at the breakpoint if a certain condition is true. For example, we can check the input key of
the reducer (which is the customer id), and check for its value. To set breakpoint properties, right click on the
breakpoint and select Breakpoint Properties from the context menu.

Copyright IBM Corp. 2013. All rights reserved Page 20 of 29


IBM Software
Information Management

4. In the dialog, check Conditional and type in a condition. You can use the content assist to ensure that you type in
valid Java syntax. In our example, lets only stop in the reducer if our customer id is 125412:
key.toString().equals("125412");

5. Click OK to set the breakpoint properties.

6. To launch the program in debug mode, we can re-use the same run configuration as before but launch it in debug
mode. Click Run -> Debug Configurations and select the previously created run configuration under the Java
MapReduce entry in the tree.

7. Click Debug to start the program in debug mode.

Copyright IBM Corp. 2013. All rights reserved Page 21 of 29


IBM Software
Information Management

8. Switch to the Debug perspective if prompted for it.

9. The program will be suspended in the SalesMapper class. You will see the stack trace in the Debug view, the
variables and their values in the Variables view, and the current line in the editor window.

10. Step through the code to see how the variables change. You can use either the navigation icons in the Debug
view, or the shortcut keys F5 to F8. You will notice that the program is suspended for every row in the mapper

Copyright IBM Corp. 2013. All rights reserved Page 22 of 29


IBM Software
Information Management

class.

11. Disable or remove the breakpoint in SalesMapper.java to be able to continue the program without interruption and
proceed to the reducer class.
12. Since we set the breakpoint properties in the reducer class, the program is only suspended if the key value is
125412.

Using conditional breakpoint properties is very useful when dealing with a large amount of input rows.

9 Running the Java Map/Reduce Program on the Cluster


Now that we have run and debugged the program in Eclipse locally, we want to test the program on the cluster as well. To
run your program on the cluster, you need to define a connection to an existing BigInsights cluster first.

1. In the BigInsights Perspective, click on the Add a new server icon in the BigInsights servers view (bottom-left
corner).

Copyright IBM Corp. 2013. All rights reserved Page 23 of 29


IBM Software
Information Management

2. When prompted, specify the URL for your BigInsights web console as well as user id and password. The server
name will self-populate once you enter a value into the URL field, but you can change the value:

URL : http://bigdata:8080/data/html/index.html
User ID : biadmin
Password : passw0rd

3. Click Finish to register the BigInsights server in the BigInsights Servers view.

To run on the cluster, we still need a run configuration and we can use the one we created for the local run as the
base to create a new configuration.

4. Since we will run on the cluster now we first need to upload the data files to the HDFS file system. Switch to a
terminal window

5. Make sure you are in the mapreduce lab directory or enter it with

[biadmin@bigdata labs]$ cd /home/biadmin/labs/mapreduce/

6. Create a folder called mapreduce in the directory /user/biadmin on HDFS:

Copyright IBM Corp. 2013. All rights reserved Page 24 of 29


IBM Software
Information Management

[biadmin@bigdata mapreduce]$ hadoop fs -mkdir mapreduce

7. Upload the csv files to this directory

[biadmin@bigdata mapreduce]$ hadoop fs -put *.csv mapreduce

8. Make sure the files have been uploaded correctly:

[biadmin@bigdata mapreduce]$ hadoop fs -ls /user/biadmin/mapreduce

You should see the three csv files in HDFS:

[biadmin@bigdata mapreduce]$ hadoop fs -ls /user/biadmin/mapreduce


Found 3 items
-rw-r--r-- 1 biadmin supergroup 5869221 2013-05-18 13:44
/user/biadmin/mapreduce/data_error.csv
-rw-r--r-- 1 biadmin supergroup 5869361 2013-05-18 13:44
/user/biadmin/mapreduce/data_full.csv
-rw-r--r-- 1 biadmin supergroup 60998 2013-05-18 13:44
/user/biadmin/mapreduce/data_small.csv

9. Switch back to Eclipse

10. Click on Run -> Run Configurations and select the previously created run configuration. Because we have to
change the locations for input and output directories to reflect the locations in the Hadoop file system, we will
create a copy of the run configuration first so that we dont have to fill out everything again and can easily switch
between local and cluster mode without changing our program parameters.

11. In the run configuration dialog, click the Duplicate button to create a copy of the existing run configuration and
give the run configuration a new name

Copyright IBM Corp. 2013. All rights reserved Page 25 of 29


IBM Software
Information Management

12. In duplicated configuration, for execution mode, select Cluster and select the just registered BigInsights server.

13. In the job arguments section, we need to update the input files to locations in HDFS. This time we want to use the
input file data_error.csv. Your user id needs write access to the output directory in HDFS which biadmin has.
Use the same value for the delete directory as for the output directory.

Be careful with the name. The user directory on HDFS is /user/biadmin this is different to /home/biadmin on
Linux FS. Also make sure to remove the labs directory.

14. Click Run to run the program on the cluster.

Copyright IBM Corp. 2013. All rights reserved Page 26 of 29


IBM Software
Information Management

15. When the program is run on the cluster, the JAR of the program will be copied to the cluster and run on the
cluster using the Hadoop command. A dialog with the progress information will pop up in Eclipse and when the
program was successfully submitted on the server, you will see the job id information in the Console window of
your Eclipse workspace:

To see the details of your job, you need to go to the BigInsights console.

16. Open a web browser and go to the URL of the BigInsights console. Log on with your user ID and password.

17. Click on the Application Status page and click on the sub link Jobs.

The jobs page shows the job status information for any Hadoop job and is the starting point to drill down into more
job details.
In our case, the job failed and we want to investigate what caused the job to fail.

18. To drill down into jobs details, click on the row of the job. In the bottom of the page, the breakdown of setup, map,
reduce and cleanup tasks is shown in a table and you can see that the map task failed was killed and was not
successful.

19. Click on the map row to drill down into more details. Keep clicking on the task attempts until you reach the page
with the log files. To have more space to see the log file content, click the Window icon on the right:

Copyright IBM Corp. 2013. All rights reserved Page 27 of 29


IBM Software
Information Management

In the log file page, youll see the stdout, stderr and syslog output from the job.

In our case, you can see that there was an ArrayIndexOutOfBoundsException in the SalesMapper.java class
in line 20.
We could now setup a remote debugging task that would utilize the IsolationRunner helper task. But this would go
over the scope of this lab. If we look at the line in the code, we can guess what the problem is. If any of the lines
in the data file do not have 3 values this line will fail. We should therefore make our program more robust by
checking for these eventualities. Either by ignoring incomplete rows ( for example by catching the
ArrayIndexOutOfBoundsException or by gracefully shutting down.

String custId = elements[2];

10 Summary
To simplify the development of Big Data programs, BigInsights Enterprise Edition provides Eclipse tooling that feature
wizards, code generators, context-sensitive help, and a test environment. This lab provided a quick tour of
development of Java map-reduce programs and showed you how you can create, run and debug these programs in
your local Eclipse environment or on the cluster.

Copyright IBM Corp. 2013. All rights reserved Page 28 of 29


IBM Software
Information Management

Copyright IBM Corporation 2012


All Rights Reserved.

IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada

IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at Copyright and
trademark information at ibm.com/legal/copytrade.shtml

Other company, product and service names may be trademarks or


service marks of others.

References in this publication to IBM products and services do not


imply that IBM intends to make them available in all countries in which
IBM operates.

No part of this document may be reproduced or transmitted in any form


without written permission from IBM Corporation.

Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.

THE INFORMATION PROVIDED IN THIS DOCUMENT IS


DISTRIBUTED AS IS WITHOUT ANY WARRANTY, EITHER
EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.

IBM products are warranted according to the terms and conditions of


the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which
they are provided.

Copyright IBM Corp. 2013. All rights reserved Page 29 of 29

You might also like