Big Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

ADVANCED INSTTUTE OF TECHNLOGY

& MANAGEMENT

Practical file

HADOOP

Submitted To : Ms. Deepika Sharma Submitted by :______________


Roll no. :______________
Class : MCA 3th sem
INDEX
Sr. No. Topic Sign

1. Introduction to HADOOP

2. Wap to implement word count problem using


map reduce

3. Wap to implement HDFS Commands

4. Wap how to install Hadoop


Program. 1
Introduction to HADOOP
HADOOP is an ecosystem of open source components that fundamentally changes the way
enterprises store, process, and analyzes data. Unlike traditional systems, HADOOP enables
multiple types of analytic workloads to run on the same data, at the same time, at massive
scale on industry-standard hardware.

1. Volume: With increasing dependence on technology, data is producing at a large


volume. Common examples are data being produced by various social networking
sites, sensors, scanners, airlines and other organizations.
2. Velocity: Huge amount of data is generated per second. It is estimated that by the
end of 2020, every individual will produce 3mb data per second. This large volume of
data is being generated with a great velocity.
3. Variety: The data being produced by different means is of three types:
 Structured Data: It is the relational data which is stored in the form of rows and
columns.
 Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured
data which can’t be stored in the form of rows and columns.
 Semi Structured Data: Log files are the examples of this type of data.

4. Veracity: The term Veracity is coined for the inconsistent or incomplete data which
results in the generation of doubtful or uncertain Information. Often data inconsistency
arises because of the volume or amount of data e.g. data in bulk could create confusion
whereas less amount of data could convey half or incomplete Information.

4. Value: After having the 4 V’s into account there comes one more V which stands
for Value!. Bulk of Data having no Value is of no good to the company, unless you
turn it into something useful. Data in itself is of no use or importance but it needs
to be converted into something valuable to extract Information. Hence, you can
state that Value! is the most important V of all the 5V’s
HDFS Introduction
With growing data velocity the data size easily outgrows the storage limit of a machine. A
solution would be to store the data across a network of machines. Such file systems are
called distributed file systems. Since data is stored across a network all the complications
of a network come in.
This is where HADOOP comes in. It provides one of the most reliable file systems. HDFS
(HADOOP Distributed File System) is a unique design that provides storage for extremely
large files with streaming data access pattern and it runs on commodity hardware.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. Master Node:
 Manages all the slave nodes and assign work to them.
 It executes file system namespace operations like opening, closing, renaming files
and directories.
 It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. Name Node:
 Actual worker nodes, who do the actual work like reading, writing, pro cessing etc.
 They also perform creation, deletion, and replication upon instruction from the
master.
 They can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in background.


 Name nodes:
 Run on the master node.
 Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
 Require high amount of RAM.
 Store meta-data in RAM for fast retrieval i.e. to reduce seek time. Though a
persistent copy of it is kept on disk.
 Data Nodes:
 Run on slave nodes.
 Require high memory as data is actually stored here.

YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
The main components of YARN architecture include:
 Client: It submits map-reduce jobs.
 Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a
processing request.
 Scheduler: It performs scheduling based on the allocated application and available
resources.
Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager.

 Application Master: An application is a single job submitted to a framework. The


application manager is responsible for negotiating resources with the resource
manager, tracking the status and monitoring progress of a single application.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk
on a single node.

MAP REDUCE
One of the three components of Hadoop is Map Reduce. The first component of Hadoop
that is, Hadoop Distributed File System (HDFS) is responsible for storing the file. The
second component that is, Map Reduce is responsible for processing the file.
Program: - 2
Wap to implement word count problem using map reduce
Driver Code: WordCountDriver.java:-
package com.javadeveloperzone.bigdata.hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool{
@Override
public int run(String[] args) throws Exception {
Configuration configuration = getConf();
Job job = Job.getInstance(configuration, "WordCountJob");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int result = ToolRunner.run(new Configuration(), new WordCountDriver(), args);
if(result==0){
System.out.println("Job Completed successfully...");
}
else{
System.out.println("Job Execution Failed with status::"+result);
}
}
}
Mapper Code: WordCountMapper.java
package com.javadeveloperzone.bigdata.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<Object,Text,Text,IntWritable>
{
private static final IntWritable countOne = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws
IOException,InterruptedException
{
String [] words = value.toString().split(" ");
for(String string : words)
{
word.set(string);
context.write(word, countOne);
}
}
}

Reducer Code: WordCountReducer.java


package com.javadeveloperzone.bigdata.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException,InterruptedException{
int total = 0;
for(IntWritable value : values)
{
total++;
}
context.write(key, new IntWritable(total));
}
}
Output

11824

"'A 1
"'About 1
"'Absolute 1
"'Ah!' 2
"'Ah, 2
"'Ample.' 1
"'And 10
"'Arthur!' 1
"'As 1
"'At 1
"'Because 1
"'Breckinridge, 1
"'But 1
"'But, 1
"'But,' 1
"'Certainly 2
"'Certainly,' 1
"'Come! 1
"'Come, 1
"'DEAR 1
"'Dear 2
"'Dearest 1
"'Death,' 1
"'December 1
"'Do 3
"'Don't 1
"'Entirely.' 1
"'For 1
"'Fritz! 1
"'From 1
"'Gone 1
"'Hampshire. 1
"'Have 1
"'Here 1
"'How 2
"'I 22
"'If 2
"'In 2
"'Is 3
"'It 7
"'It's 1
"'Jephro,' 1
"'Keep 1
"'Ku 1
"'L'homme 1
"'Look 2
"'Lord 1
"'MY 2
"'May 1
"'Most 1
"'Mr. 2
"'My 4
"'Never 1
"'Never,' 1
Program: - 3
Wap to implement HDFS Commands
HDFS Commands
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful
when we want a hierarchy of a folder.

1. Syntax: bin/hdfs dfs -ls <path>


Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File
System) commands.

2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So


let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>

creating home directory:

hdfs/bin -mkdir /user


hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
created relative to the home directory.

3. touchz: It creates an empty file.


Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
4 copyFromLocal (or) put: To copy files/folders from local file system to hdfs
store. This is the most important command. Local filesystem means the files
present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks

5 cat: To print file contents.


Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->
6 copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero
myfile.txt from geeks folder will be copied to folder hero present on Desktop.

Note: Observe that we don’t write bin/hdfs while checking the things
present on local filesystem. bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt
/geeks

7 cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
8 mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied

9 rmr: This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.
10 du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks

11 dus: This command will give the total size of directory/file.


Syntax:
bin/hdfs dfs -dus <dirName>

12 moveFromLocal: This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example: bin/hdfs dfs -dus /geeks
13 stat: It will give the last modified time of directory or path. In short it will give
stats of the directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geeks

14 setrep: This command is used to change the replication factor of a file/directory


in HDFS. By default it is 3 for anything which is stored in HDFS (as set in
hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.
bin/hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a directory geeksInput stored in
HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R means recursively,
we use it for directories as they may also contain many files and folders inside them.
Note: There are more commands in HDFS but we discussed the commands which are
commonly used when working with Hadoop. You can check out the list of dfs commands
using the following command:
bin/hdfs dfs
Program:- 5
Wap how to install Hadoop

Hadoop Installation
Environment required for Hadoop: The production environment of Hadoop is UNIX, but it can
also be used in Windows using Cygwin. Java 1.6 or above is needed to run Map Reduce
Programs. For Hadoop installation from tar ball on the UNIX environment you need

1. Java Installation
2. SSH installation
3. Hadoop Installation and File Configuration

1) Java Installation

Step 1. Type "java -version" in prompt to find if the java is installed or not. If not then
download java from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-
downloads-1880260.html . The tar filejdk-7u71-linux-x64.tar.gz will be downloaded to your
system.

Step 2. Extract the file using the below command

#tar zxf jdk-7u71-linux-x64.tar.gz

Step 3. To make java available for all the users of UNIX move the file to /usr/local and set the
path. In the prompt switch to root user and then type the command below to move the jdk
to /usr/lib

# mv jdk1.7.0_71 /usr/lib/

Now in ~/.bashrc file add the following commands to set up the path.

# export JAVA_HOME=/usr/lib/jdk1.7.0_71
# export PATH=PATH:$JAVA_HOME/bin

Now, you can check the installation by typing "java -version" in the prompt.

2) SSH Installation

SSH is used to interact with the master and slaves computer without any prompt for
password. First of all create a Hadoop user on the master and slave systems

# useradd hadoop
# passwd Hadoop

To map the nodes open the hosts file present in /etc/ folder on all the machines and put the
ip address along with their host name.

# vi /etc/hosts

Enter the lines below

190.12.1.114 hadoop-master
190.12.1.121 hadoop-salve-one
190.12.1.143 hadoop-slave-two

Set up SSH key in every node so that they can communicate among themselves without
password. Commands for the same are:

# su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit

3) Hadoop Installation

Hadoop can be downloaded from


http://developer.yahoo.com/hadoop/tutorial/module3.html

Now extract the Hadoop and copy it to a location.


$ mkdir /usr/hadoop
$ sudo tar vxzf hadoop-2.2.0.tar.gz ?c /usr/hadoop

Change the ownership of Hadoop folder

$sudo chown -R hadoop usr/hadoop

Change the Hadoop configuration files:

All the files are present in /usr/local/Hadoop/etc/hadoop

1) In hadoop-env.sh file add

export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71

2) In core-site.xml add following between configuration tabs,


<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

3) In hdfs-site.xmladd following between configuration tabs,

<configuration>
<property>
<name>dfs.data.dir</name>
<value>usr/hadoop/dfs/name/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>usr/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

4) Open the Mapred-site.xmcd $HOME

vi .bashrc
Append following lines in the end and save and exit
#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71
export HADOOP_INSTALL=/usr/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
On the slave machine install Hadoop using the command below

# su hadoop
$ cd /opt/hadoop
$ scp -r hadoop hadoop-slave-one:/usr/hadoop
$ scp -r hadoop hadoop-slave-two:/usr/Hadoop

Configure master node and slave node

$ vi etc/hadoop/masters
hadoop-master

$ vi etc/hadoop/slaves
hadoop-slave-one
hadoop-slave-two

After this format the name node and start all the deamons

# su hadoop
$ cd /usr/hadoop
$ bin/hadoop namenode -format

$ cd $HADOOP_HOME/sbin
$ start-all.sh

l and make the change as shown below

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:9001</value>
</property>
</configuration>

5) Finally, update your $HOME/.bahsrc

You might also like