Professional Documents
Culture Documents
Big Data
Big Data
Big Data
& MANAGEMENT
Practical file
HADOOP
1. Introduction to HADOOP
4. Veracity: The term Veracity is coined for the inconsistent or incomplete data which
results in the generation of doubtful or uncertain Information. Often data inconsistency
arises because of the volume or amount of data e.g. data in bulk could create confusion
whereas less amount of data could convey half or incomplete Information.
4. Value: After having the 4 V’s into account there comes one more V which stands
for Value!. Bulk of Data having no Value is of no good to the company, unless you
turn it into something useful. Data in itself is of no use or importance but it needs
to be converted into something valuable to extract Information. Hence, you can
state that Value! is the most important V of all the 5V’s
HDFS Introduction
With growing data velocity the data size easily outgrows the storage limit of a machine. A
solution would be to store the data across a network of machines. Such file systems are
called distributed file systems. Since data is stored across a network all the complications
of a network come in.
This is where HADOOP comes in. It provides one of the most reliable file systems. HDFS
(HADOOP Distributed File System) is a unique design that provides storage for extremely
large files with streaming data access pattern and it runs on commodity hardware.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. Master Node:
Manages all the slave nodes and assign work to them.
It executes file system namespace operations like opening, closing, renaming files
and directories.
It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. Name Node:
Actual worker nodes, who do the actual work like reading, writing, pro cessing etc.
They also perform creation, deletion, and replication upon instruction from the
master.
They can be deployed on commodity hardware.
YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
The main components of YARN architecture include:
Client: It submits map-reduce jobs.
Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a
processing request.
Scheduler: It performs scheduling based on the allocated application and available
resources.
Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager.
MAP REDUCE
One of the three components of Hadoop is Map Reduce. The first component of Hadoop
that is, Hadoop Distributed File System (HDFS) is responsible for storing the file. The
second component that is, Map Reduce is responsible for processing the file.
Program: - 2
Wap to implement word count problem using map reduce
Driver Code: WordCountDriver.java:-
package com.javadeveloperzone.bigdata.hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool{
@Override
public int run(String[] args) throws Exception {
Configuration configuration = getConf();
Job job = Job.getInstance(configuration, "WordCountJob");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int result = ToolRunner.run(new Configuration(), new WordCountDriver(), args);
if(result==0){
System.out.println("Job Completed successfully...");
}
else{
System.out.println("Job Execution Failed with status::"+result);
}
}
}
Mapper Code: WordCountMapper.java
package com.javadeveloperzone.bigdata.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<Object,Text,Text,IntWritable>
{
private static final IntWritable countOne = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws
IOException,InterruptedException
{
String [] words = value.toString().split(" ");
for(String string : words)
{
word.set(string);
context.write(word, countOne);
}
}
}
11824
"'A 1
"'About 1
"'Absolute 1
"'Ah!' 2
"'Ah, 2
"'Ample.' 1
"'And 10
"'Arthur!' 1
"'As 1
"'At 1
"'Because 1
"'Breckinridge, 1
"'But 1
"'But, 1
"'But,' 1
"'Certainly 2
"'Certainly,' 1
"'Come! 1
"'Come, 1
"'DEAR 1
"'Dear 2
"'Dearest 1
"'Death,' 1
"'December 1
"'Do 3
"'Don't 1
"'Entirely.' 1
"'For 1
"'Fritz! 1
"'From 1
"'Gone 1
"'Hampshire. 1
"'Have 1
"'Here 1
"'How 2
"'I 22
"'If 2
"'In 2
"'Is 3
"'It 7
"'It's 1
"'Jephro,' 1
"'Keep 1
"'Ku 1
"'L'homme 1
"'Look 2
"'Lord 1
"'MY 2
"'May 1
"'Most 1
"'Mr. 2
"'My 4
"'Never 1
"'Never,' 1
Program: - 3
Wap to implement HDFS Commands
HDFS Commands
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful
when we want a hierarchy of a folder.
Note: Observe that we don’t write bin/hdfs while checking the things
present on local filesystem. bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt
/geeks
7 cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
8 mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
9 rmr: This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.
10 du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks
Hadoop Installation
Environment required for Hadoop: The production environment of Hadoop is UNIX, but it can
also be used in Windows using Cygwin. Java 1.6 or above is needed to run Map Reduce
Programs. For Hadoop installation from tar ball on the UNIX environment you need
1. Java Installation
2. SSH installation
3. Hadoop Installation and File Configuration
1) Java Installation
Step 1. Type "java -version" in prompt to find if the java is installed or not. If not then
download java from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-
downloads-1880260.html . The tar filejdk-7u71-linux-x64.tar.gz will be downloaded to your
system.
Step 3. To make java available for all the users of UNIX move the file to /usr/local and set the
path. In the prompt switch to root user and then type the command below to move the jdk
to /usr/lib
# mv jdk1.7.0_71 /usr/lib/
Now in ~/.bashrc file add the following commands to set up the path.
# export JAVA_HOME=/usr/lib/jdk1.7.0_71
# export PATH=PATH:$JAVA_HOME/bin
Now, you can check the installation by typing "java -version" in the prompt.
2) SSH Installation
SSH is used to interact with the master and slaves computer without any prompt for
password. First of all create a Hadoop user on the master and slave systems
# useradd hadoop
# passwd Hadoop
To map the nodes open the hosts file present in /etc/ folder on all the machines and put the
ip address along with their host name.
# vi /etc/hosts
190.12.1.114 hadoop-master
190.12.1.121 hadoop-salve-one
190.12.1.143 hadoop-slave-two
Set up SSH key in every node so that they can communicate among themselves without
password. Commands for the same are:
# su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
3) Hadoop Installation
export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71
<configuration>
<property>
<name>dfs.data.dir</name>
<value>usr/hadoop/dfs/name/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>usr/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
vi .bashrc
Append following lines in the end and save and exit
#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71
export HADOOP_INSTALL=/usr/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
On the slave machine install Hadoop using the command below
# su hadoop
$ cd /opt/hadoop
$ scp -r hadoop hadoop-slave-one:/usr/hadoop
$ scp -r hadoop hadoop-slave-two:/usr/Hadoop
$ vi etc/hadoop/masters
hadoop-master
$ vi etc/hadoop/slaves
hadoop-slave-one
hadoop-slave-two
After this format the name node and start all the deamons
# su hadoop
$ cd /usr/hadoop
$ bin/hadoop namenode -format
$ cd $HADOOP_HOME/sbin
$ start-all.sh
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:9001</value>
</property>
</configuration>