Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

EBUS633

Big Data Analytics for Business

Week 2 Seminar
Hadoop
Hugo Lam
Today's Agenda
2

1. Use Hadoop in ENG-HHTC


2. Use Pig to Perform Word Count
3. Use Hive to Perform Word Count
4. Use Java to Perform Word Count
1. Use Hadoop in ENG-HHTC
3

➢ Click Hortonworks Sandbox 2.4 on the Desktop


1. Use Hadoop in ENG-HHTC
4

➢ Select Hortonworks Sandbox with HDP 2.4 and then


press Settings
1. Use Hadoop in ENG-HHTC
5

➢ Check and address any Invalid Settings


1. Use Hadoop in ENG-HHTC
6

➢ When there are no invalid settings detected, press OK


and then press Start
1. Use Hadoop in ENG-HHTC
7

➢ Wait until the following screen appears (it may take a


few minutes)
1. Use Hadoop in ENG-HHTC
8

➢ Open a web browser and type 127.0.0.1:8888 in the


address bar and press Enter
1. Use Hadoop in ENG-HHTC
9

➢ Click the url for Ambari (i.e., http://127.0.0.1:8080)


1. Use Hadoop in ENG-HHTC
10

➢ Sign in Ambari with Username = maria_dev and


Password = maria_dev
1. Use Hadoop in ENG-HHTC
11

➢ Now you can explore the individual components (e.g.,


HDFS, Hive, Pig) inside the Hadoop ecosystem
2. Use Pig to Perform Word
Count
12

➢ The Word Count Example


2. Use Pig to Perform Word
Count
13

➢ Hadoop ecosystem (an incomplete list)

Source: http://www.mssqltips.com
2. Use Pig to Perform Word
Count
14

➢ Pig Latin
A dataflow language: allows you to define a data stream and a
series of transformations that are applied to the data as it flows
through your application.
2. Use Pig to Perform Word
Count
15

➢ Pig Latin
Sample codes to illustrate the data flow sequences:
A = LOAD 'data_file.txt';
...
B = GROUP ... ;
...
C = FILTER ...;
...
DUMP C;
..
STORE C INTO 'Results’;

Pig Latin Manual: https://pig.apache.org/docs/r0.17.0/basic.html


2. Use Pig to Perform Word
Count
16

➢ The Word Count Example


myinput = LOAD
'/user/maria_dev/wordcount/input/input.txt' USING
TextLoader AS (line:CHARARRAY);
words = FOREACH myinput GENERATE
FLATTEN(TOKENIZE(REPLACE(LOWER(TRIM(line)),'[\\p{Punc
t},\\p{Cntrl}]','')));
wordgroup = GROUP words BY $0;
wordcount = FOREACH wordgroup GENERATE $0, COUNT($1);
myoutput = ORDER wordcount BY $0 ASC, $1 DESC;
DUMP myoutput;
2. Use Pig to Perform Word
Count
17

➢ Click HDFS Files to access HDFS


2. Use Pig to Perform Word
Count
18

➢ Go to /user/maria_dev
2. Use Pig to Perform Word
Count
19

➢ Create a new directory named wordcount


2. Use Pig to Perform Word
Count
20

➢ Go inside the wordcount directory and create a new


directory named input
2. Use Pig to Perform Word
Count
21

➢ Go inside the input directory and upload input.txt


(You need to download input.txt from Canvas first)
2. Use Pig to Perform Word
Count
22

➢ Click Pig to access Pig


2. Use Pig to Perform Word
Count
23

➢ Click +New Script, and then type test in Name and


click Create
2. Use Pig to Perform Word
Count
24

➢ Input the following text and click Execute


myinput = LOAD
'/user/maria_dev/wordcount/input/input.txt' USING
TextLoader AS (line:CHARARRAY);
words = FOREACH myinput GENERATE
FLATTEN(TOKENIZE(REPLACE(LOWER(TRIM(line)),'[\\p{Punc
t},\\p{Cntrl}]','')));
wordgroup = GROUP words BY $0;
wordcount = FOREACH wordgroup GENERATE $0, COUNT($1);
myoutput = ORDER wordcount BY $0 ASC, $1 DESC;
DUMP myoutput;
2. Use Pig to Perform Word
Count
25

➢ Input the following text and click Execute


2. Use Pig to Perform Word
Count
26

➢ View Results
3. Use Hive to Perform Word
Count
27

➢ Hadoop ecosystem (an incomplete list)

Source: http://www.mssqltips.com
3. Use Hive to Perform Word
Count
28

HiveQL
➢ Hive’s query language.
➢ A SQL-like declarative language.
➢ Enables users familiar with SQL to query the data in
Hadoop without learning Java.
➢ Supports custom MapReduce scripts to be plugged
into queries.
➢ Hive Language Manual:
https://cwiki.apache.org/confluence/display/Hive/Lang
uageManual
3. Use Hive to Perform Word
Count
29

HiveQL
CREATE EXTERNAL TABLE myinput (line STRING) LOCATION
'/user/maria_dev/wordcount/input/';
CREATE TABLE wordcount AS
SELECT word, count(1) AS count
FROM (SELECT
EXPLODE(SPLIT(LCASE(REGEXP_REPLACE(line,'[\\p{Punct},
\\p{Cntrl}]','')),' '))
AS word FROM myinput) words
GROUP BY word
ORDER BY word ASC, count DESC;
SELECT * FROM wordcount;
3. Use Hive to Perform Word
Count
30

➢ Click Hive View to access Hive


3. Use Hive to Perform Word
Count
31

➢ Input the following text and click Execute


CREATE EXTERNAL TABLE myinput (line STRING) LOCATION
'/user/maria_dev/wordcount/input/';
CREATE TABLE wordcount AS
SELECT word, count(1) AS count
FROM (SELECT
EXPLODE(SPLIT(LCASE(REGEXP_REPLACE(line,'[\\p{Punct},
\\p{Cntrl}]','')),' '))
AS word FROM myinput) words
GROUP BY word
ORDER BY word ASC, count DESC;
SELECT * FROM wordcount;
3. Use Hive to Perform Word
Count
32

➢ Input the following text and click Execute


3. Use Hive to Perform Word
Count
33

➢ View Results
3. Use Hive to Perform Word
Count
34

➢ Input the following text and click Execute to delete


tables myinput and wordcount
DROP TABLE myinput;
DROP TABLE wordcount;
4. Use Java to Perform Word
Count
35

➢ The Word Count Example


✓ The input text file (input.txt) is stored in
/user/maria_dev/wordcount/input/
✓ The output result is stored in
/user/maria_dev/wordcount/output
✓ The Java source codes file (WordCount.java) is
stored in /user/maria_dev/wordcount/
✓ The Java Archive file is wordcount.jar
4. Use Java to Perform Word
Count
36

➢ Go to /user/maria_dev/wordcount and upload


WordCount.java (You need to download it from Canvas first)
4. Use Java to Perform Word
Count
37

➢ Click the url under Secure Shell (SSH) Client in the


front page (i.e., http://127.0.0.1:4200)
4. Use Java to Perform Word
Count
38

➢ Login with username = root and password = hadoop


(hadoop is the default password; please note that in
the UNIX system, the password will not be displayed
on the screen when you type it)
4. Use Java to Perform Word
Count
39

➢ Type hadoop again and press Enter


4. Use Java to Perform Word
Count
40

➢ Type bdab633 and press Enter (bdab633 is the new


password. All students are suggested to use the
same new password for easy management)
4. Use Java to Perform Word
Count
41

➢ Type bdab633 again and press Enter


4. Use Java to Perform Word
Count
42

➢ Type ls and press Enter to show the content in the


local filesystem
4. Use Java to Perform Word
Count
43

➢ Copy WordCount.java from HDFS to the local


filesystem
hadoop fs -get /user/maria_dev/wordcount/WordCount.java
ls
4. Use Java to Perform Word
Count
44

➢ Compile the WordCount class


mkdir build
javac -cp $(mapred classpath) WordCount.java -d build
ls
4. Use Java to Perform Word
Count
45

➢ Create the Java Archive file as wordcount.jar


jar -cvf wordcount.jar -C build/ .
ls
4. Use Java to Perform Word
Count
46

➢ Run wordcount.jar with input path as


/user/maria_dev/wordcount/input and output path as
/user/maria_dev/wordcount/output
hadoop jar wordcount.jar org.myorg.WordCount
/user/maria_dev/wordcount/input
/user/maria_dev/wordcount/output
4. Use Java to Perform Word
Count
47

➢ View the output result


hadoop fs -cat /user/maria_dev/wordcount/output/*
4. Use Java to Perform Word
Count
48

➢ View the output result in Ambari


(/user/maria_dev/wordcount/output/)
4. Use Java to Perform Word
Count
49

➢ Delete the wordcount directory in HDFS


➢ Delete build, wordcount.jar and WordCount.java in the
local filesystem
hadoop fs -rm -r /user/maria_dev/wordcount
rm -rf build wordcount.jar WordCount.java
ls
4. Use Java to Perform Word
Count
50

If you need to reset the password for the root account:


1. After you press Start in VirtualBox, press ESC immediately
2. Press ‘e’ to edit
3. Highlight the line that begins with ‘kernel’. Press ‘e’ again to
edit
4. At the end of the line, add ‘ single’
5. Press ‘enter’ to make the change and press ‘b’ to boot
6. The system should load into single user mode and you will be
left at the command line automatically logged in as root. Type
‘passwd’ to change the root password
7. Type ‘bdab633’ as the new password
8. Type ‘reboot’ to restart into your machine’s normal
configuration

You might also like