Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

RUNNING A PIG PROGRAM ON THE CDH SINGLE NODE

CLUSTER ON AN AWS EC2 INSTANCE


PREREQUISITES

● Ensure that you have installed the ​WinSCP​ tool on your Windows machine.

IMPORTANT INSTRUCTIONS

● The following notations have been used while running the Java API code:
[ec2-user@ip-10-0-0-14 ~]$ ​hadoop command
​Output of the command
As shown above, the command to be run is written in bold. ​The output of the command
is written in italics. The [​ec2-user​​@ip-10-0-0-14 ~] informs us about the user through
which the command is to be executed.

● Whenever you want to access a file whether to execute the code or to push it onto the
HDFS, you need to either be in the directory of the file or specify the relative path to the
file.
● Be careful with the spaces in the commands.
● If a series of commands is given in a particular order, make sure that you run them in
the same order.

NOTE:​​ Before starting with the document below, it is necessary that you create the EC2
instance with Cloudera installed on your machine. Go through ​Video 1​ before getting
started with this document.

In this document, we are running the wordcount program on the ‘dropbox-policy.txt file
using PIG.
Steps to Run a PIG Program on the CDH Single-Node Cluster on an
AWS EC2 Instance

1. Start CDH EC2 instance from the AWS console and wait until the instance state changes
to ​‘running’​.

2. Copy the public IP address from the EC2 dashboard.

3. Go to your browser and open Cloudera Manager. To access the Cloudera Manager page,
enter your public IP address followed by ‘:7180’, as shown below:
<your public IP>:7180
Username: admin
Password: admin
4. After logging in to the Cloudera Manager, click on ‘​Cloudera management service’
followed by ‘​Restart​’.
5. After the restart, click on “​Close​”.

6. Wait until all the services turn green, as shown in the image below.

7. Whenever a MapReduce or PIG program is to be run on the AWS EC2 instance, we need
to visit the YARN Configuration page and edit the following properties:
● yarn.scheduler.maximum-allocation-mb
● yarn.nodemanager.resource.memory-mb

8. To edit them, follow the instructions below:


a. Click on ​‘YARN (MR2 Included)​​’, after which the following page appears:
b. Click on ‘​Configuration​​’.

c. Now we have to increase the RAM. To do so, enter the following properties in
the ‘Search’ field. Increase the RAM of each property to the number mentioned
below:
i. yarn.scheduler.maximum-allocation-mb
The default value is 1 GB. Change it to 8 GB and then click on ‘​Save
Changes​’.
ii. yarn.nodemanager.resource.memory-mb
The default value is 1 GB. Change it to 10 GB and then click on ‘​Save
Changes​’.
d. Restart the YARN service by clicking on the ‘​State Configuration Restart needed​’
icon as shown in the image below:

e. Now, click on ‘​​Restart Stale services​​’.


f. Click on ‘​Restart Now​’.

g. It generally takes a few minutes to restart the service. Wait until both the
green ticks appear. Now, click on ‘​Finish​’.
h. Wait until the YARN service turns green.

8. Next, we have to connect to the EC2 instance using PuTTY. Once connected, we log
in as an EC2 user. We now need to switch to the root user using the ​sudo -i
command.

9. PIG is by default installed in Cloudera. To verify, type ‘pig’​ ​as shown below:
[root@ip-10-0-0-105 ~]# ​pig
10. Now, the grunt shell opens in the terminal where you can run PIG codes.
We have used ​quit ​to exit from the grunt shell.

11. Now, we have to run a program as per the instructions in the video.
First, we have to create a directory named ‘Pig’. We will use the ‘mkdir’ command to
do so:
[root@ip-10-0-0-105 ~]# ​mkdir Pig

12. Download the files ‘Dropbox.txt’ and ‘count-words.pig’.

13. Now, we need to copy the files downloaded above from our local machine to the
EC2 instance.
a. Mac/Linux users:
Use the following command to copy a file from your local system to the EC2
instance.
scp -i <path of .pem> < path of the file in your local system>
ec2-user@public ip: <destination path in the EC2 instance>
b. Windows users:
Move the downloaded files to a folder named ‘Pig_Data’ on the desktop of
your Windows machine. Although it is not necessary to do so and you need
only to remember where you have stored the downloaded files.
WinSCP is a tool to transfer a file from a Windows machine to a Linux
machine (EC2 instance).
i. Open WinSCP.

ii. Enter the following credentials:

Hostname​: Provide the public IP from the EC2 dashboard.

Username​: ​ec2-user
iii. Then, click on ‘​Advanced​’.
iv. After clicking on ‘​Authentication​’​,​ enter the path of your PPK file.

v. Click ‘​OK​’ followed by ‘​Login​’. Click ‘​Yes​’ on the pop-up that appears.
vi. Now, the following screen appears.

On the left-hand side of the screen, your local Windows machine is


displayed, as it is in our case. On the right-hand side, your AWS EC2
instance is displayed.

vii. Browse to the folder where you have stored the downloaded files. In
our case, it was a folder named ‘Pig_Data’ on the desktop.

viii. Drag and drop the files ‘dropbox-policy.txt’ and ‘count-words.pig’


from the left-hand side of the screen to the right-hand side. Click ‘​OK​’
on the prompt that appears. Ensure that you are in the
/home/ec2-user directory on the right-hand side screen.
ix. We have now successfully copied the files ‘dropbox-policy.txt’ and
‘count-words.pig’ from our local machine to our EC2 instance.

14. Now, go back to the EC2 instance. We need to copy the files ‘count-words.pig’ and
‘dropbox-policy.txt’ from ​/home/ec2-user/ ​to ​/root/Pig/​ using the following
commands:
[root@ip-10-0-0-105 ~]#​ cp /home/ec2-user/count-words.pig /root/Pig/
[root@ip-10-0-0-105 ~]# ​cp /home/ec2-user/dropbox-policy.txt /root/Pig/

15. Verify whether the files have been copied to the Pig directory or not. To do so,
change the working directory to ‘Pig’ using the ‘cd’​ ​command followed by ‘ls’, and
check whether the files ‘count-words.pig’ and ‘dropbox-policy.txt’ exist.
[root@ip-10-0-0-105 ~]#​ cd Pig
[root@ip-10-0-0-105 Pig]# ​ls

16. Come out of the Pig directory using the ‘cd’ command.
[root@ip-10-0-0-105 Pig]# ​cd
[root@ip-10-0-0-105 ~]#
Creating a Directory Inside the HDFS and Changing its Owner
17. The commands used below demonstrate how to create a directory in the HDFS.
Note​: A directory can be created in Hadoop only using the HDFS user. So now, switch
to the HDFS user. Note that there is a space between ‘​-​’ and ‘​hdfs​’​ ​in the command
used below​:
[root@ip-10-0-0-105 ~]#​ ​su – hdfs

[hdfs@ip-10-0-0-105 ~]$ ​hadoop fs -mkdir /user/root/

18. You can verify the directory that was created by the command, as shown below:

[hdfs@ip-10-0-0-105 ~]$ ​hadoop fs -ls /user/

Found 6 items

drwxrwxrwx - mapred hadoop 0 2018-02-09 09:28 /user/history

drwxrwxr-t - hive hive 0 2018-02-09 09:30 /user/hive

drwxrwxr-x - hue hue 0 2018-02-09 09:30 /user/hue

drwxrwxr-x - oozie oozie 0 2018-02-09 09:30 /user/oozie

drwxr-xr-x -​ hdfs ​ supergroup 0 2018-02-12 05:58 /user/root

drwxr-x--x - spark spark 0 2018-02-09 09:29 /user/spark

Now, as seen above, the owner of the directory is ‘​​hdfs​’ ​(underlined above). To send
a file from any user to hdfs, the owner of the directory inside hdfs should be changed
to the user sending the file. For example, if you have to send a file from the root user
to a directory inside hdfs, the owner of that particular directory inside hdfs should be
changed to root.
19. To change the owner of the directory created from hdfs to root, run the following
command:
​[hdfs@ip-10-0-0-105 ~]$ ​hadoop fs -chown root /user/root

20. You can verify whether or not the owner has been changed using the command
shown below:
[hdfs@ip-10-0-0-105 ~]$​ ​hadoop fs -ls /user/

Found 6 items
drwxrwxrwx - mapred hadoop 0 2018-02-18 07:16 /user/history
drwxrwxr-t - hive hive 0 2018-02-18 07:17 /user/hive
drwxrwxr-x - hue hue 0 2018-02-18 07:18 /user/hue
drwxrwxr-x - oozie oozie 0 2018-02-18 07:18 /user/oozie
drwxr-xr-x - ​root​ supergroup 0 2018-02-18 09:49 /user/root
drwxr-x--x - spark spark 0 2018-02-18 07:17 /user/spark

You can see that the owner has changed from ​hdfs​ to​ root​.
21. Now, use the ‘exit’ command to shift from the hdfs user to the root user.
[hdfs@ip-10-0-0-105 ~]$ ​exit;
logout
[root@ip-10-0-0-105 ~]#

22. Now, use the ‘put’ command to send the ‘dropbox-policy.txt’ file into the hdfs
/user/root directory.
Syntax: hadoop fs -put <source> <destination>
[root@ip-10-0-0-105 ~]# ​hadoop fs -put /root/Pig/dropbox-policy.txt /user/root

23. Verify the same using the ‘ls’ command.


[root@ip-10-0-0-105 ~]# hadoop fs -ls /user/root
Found 1 items
-rw-r--r-- 3 root supergroup 15048 2018-02-23 08:54
/user/root/dropbox-policy.txt
[root@ip-10-0-0-105 ~]#
24. Now, navigate to the Pig directory using the ‘cd’ command. After executing the cd
command, use the ‘cat’ command to view the code.
[root@ip-10-0-0-105 Pig]# ​cat count-words.pig
-- File: count-words.pig
-- Description: Count the number of words in a text file
file = LOAD 'dropbox-policy.txt' AS (line);
words = FOREACH file GENERATE FLATTEN(TOKENIZE(line)) AS word;
grouped = GROUP words BY word;
counted = FOREACH grouped GENERATE group, COUNT(words);
sorted_counted = ORDER counted BY $1;
-- DUMP sorted_counted;
STORE sorted_counted INTO '/user/cloudera/new-output.out';

25. Now, we need to edit the count-word.pig file. We need to change the ​load​ and ​store
paths according to the HDFS.

a. In count-words.pig, instead of ​LOAD 'dropbox-policy.txt'​, we will write ​LOAD


'/user/root/dropbox-policy.txt'
b. In count-words.pig, instead of ​STORE sorted_counted INTO
'/user/cloudera/new-output.out'​, we will write ​STORE sorted_counted INTO
'/user/root/new-output.out/'
To make the above changes, we will open the ‘count-words.pig’ file using the vi
editor. We will then move into the insert mode of vi using ‘​i​’ and make the
aforementioned changes.

[root@ip-10-0-0-105 Pig]# ​vi count-words.pig

After making the changes, we will switch to the command mode and then use ​:wq!
to save and exit the file in the vi editor.
26. Verify the changes using the ‘cat’ command.

27. Now, we will run the code using the ‘​pig count-words.pig’ ​command and check the
output as specified in the store command in the code file.
[root@ip-10-0-0-105 Pig]# ​pig count-words.pig

The output is as follows:

28. To verify whether the code has run successfully or not, go to the Resource Manager.
Access the Resource Manager using your public IP followed by ‘:8088’, as shown
below:
<Public IP>​:8088
Username​: admin
Password​: admin
Now, check whether the PIG Program shows ‘SUCCEEDED’ in the FinalStatus column
or not.

29. Now, to verify the output, we need to first access Hue.


Access Hue using your public IP followed by ‘:8888’, as shown below:
<Public IP>​:8888
Username​: admin
Password​: admin
30. Then, click on ‘​Files​’.

31. After that, click on ‘​root​’.

32. Next, click on ‘​new-output.out​’.

33. Click on ‘​part-r-00000​’.

34. Now, you can see the output as shown below:

You might also like