Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

LAB ASSIGNMENT - 01

Name: Parth Shah


ID: 191080070
Branch: Information Technology

AIM
To setup single node Hadoop cluster

THEORY
Hadoop is an open-source framework that allows to store and process big data in
a distributed environment across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage.
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it has become a
complete subject, which involves various tools, techniques and frameworks.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel with others. In short, Hadoop is used to develop applications
that could perform complete statistical analysis on huge amounts of data.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System).
MapReduce: MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large amounts of data
(multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop which is an Apache open-source framework.

Hadoop Distributed File System: The Hadoop Distributed File System (HDFS) is
based on the Google File System (GFS) and provides a distributed file system that
is designed to run on commodity hardware. It has many similarities with existing
distributed file systems. However, the differences from other distributed file
systems are significant. It is highly fault-tolerant and is designed to be deployed on
low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
• Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.

NameNode and DataNodes


HDFS has a master/slave architecture. An HDFS cluster consists of a single
NameNode, a master server that manages the file system namespace and
regulates access to files by clients. In addition, there are a number of DataNodes,
usually one per node in the cluster, which manage storage attached to the nodes
that they run on. HDFS exposes a file system namespace and allows user data to
be stored in files. Internally, a file is split into one or more blocks and these blocks
are stored in a set of DataNodes. The NameNode executes file system namespace
operations like opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes. The DataNodes are responsible
for serving read and write requests from the file system’s clients. The DataNodes
also perform block creation, deletion, and replication upon instruction from the
NameNode.

The NameNode and DataNode are pieces of software designed to run on


commodity machines. These machines typically run a GNU/Linux operating system
(OS). HDFS is built using the Java language; any machine that supports Java can
run the NameNode or the DataNode software. Usage of the highly portable Java
language means that HDFS can be deployed on a wide range of machines. A
typical deployment has a dedicated machine that runs only the NameNode
software. Each of the other machines in the cluster runs one instance of the
DataNode software. The architecture does not preclude running multiple
DataNodes on the same machine but in a real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly simplifies the architecture
of the system. The NameNode is the arbitrator and repository for all HDFS
metadata. The system is designed in such a way that user data never flows
through the NameNode.

The File System Namespace


HDFS supports a traditional hierarchical file organization. A user or an application
can create directories and store files inside these directories. The file system
namespace hierarchy is similar to most other existing file systems; one can create
and remove files, move a file from one directory to another, or rename a file.
HDFS does not yet implement user quotas. HDFS does not support hard links or
soft links. However, the HDFS architecture does not preclude implementing these
features. The NameNode maintains the file system namespace. Any change to the
file system namespace or its properties is recorded by the NameNode. An
application can specify the number of replicas of a file that should be maintained
by HDFS. The number of copies of a file is called the replication factor of that file.
This information is stored by the NameNode.

How Does Hadoop Work?


It is quite expensive to build bigger servers with heavy configurations that handle
large scale processing, but as an alternative, you can tie together many
commodity computers with single-CPU, as a single functional distributed system
and practically, the clustered machines can read the dataset in parallel and
provide a much higher throughput. Moreover, it is cheaper than one high-end
server. So this is the first motivational factor behind using Hadoop that it runs
across clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
• Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further
processing.
• HDFS, being on top of the local file system, supervises the processing.
• Blocks are replicated for handling hardware failure.
• Checking that the code was executed successfully.
• Performing the sort that takes place between the map and reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job.

IMPLEMENTATION
1. Install Java JDK
sudo apt install default-jre
sudo apt install openjdk-11-jre-headless

sudo apt install openjdk-8-jre-headless


2. Install OpenJDK via the following command: sudo apt-get install openjdk-8-jdk

3. Download Hadoop binary Run the following command in Ubuntu terminal to


download a binary from the internet: wget
http://mirror.intergrid.com.au/apache/hadoop/common/hadoop-3.3.0/hadoop-3.
3.0.tar.gz
4. Unzip Hadoop binary
Run the following command to create a hadoop folder under user home folder:
mkdir ~/hadoop
And then run the following command to unzip the binary package:
tar -xvzf hadoop-3.3.0.tar.gz -C ~/hadoop
Once it is unpacked, change the current directory to the Hadoop folder:
cd ~/hadoop/hadoop-3.3.0/
5. Configure passphraseless ssh

Install openssh-server

6. Configure the pseudo-distributed mode (Single-node mode)


a. Setup environment variables by editing file ~/.bashrc.
nano ~/.bashrc
Add the following environment variables: export
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64 export
HADOOP_HOME=~/hadoop/hadoop-3.3.0 export
PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

b. Run the following command to source the latest variables: source ~/.bashrc
c. Edit etc/hadoop/hadoop-env.sh file:
nano etc/hadoop/hadoop-env.sh

Set a JAVA_HOME environment variable:


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
d. Edit etc/hadoop/core-site.xml: nano etc/hadoop/core-site.xml

Add the following configuration: <configuration> <property>


<name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property>
</configuration>

e. Edit etc/hadoop/hdfs-site.xml: nano etc/hadoop/hdfs-site.xml


Add the following configuration: <configuration> <property>
<name>dfs.replication</name> <value>1</value> </property> </configuration>

f. Edit file etc/hadoop/mapred-site.xml: nano etc/hadoop/mapred-site.xml


Add the following configuration: <configuration> <property>
<name>mapreduce.framework.name</name> <value>yarn</value> </property>
<property> <name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAP
RED_HOME/share/hadoop/mapreduce/lib/*</value> </property>
</configuration>

g. Edit file etc/hadoop/yarn-site.xml: h. nano etc/hadoop/yarn-site.xml

Add the following configuration: <configuration> <property>


<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value> </property> <property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOO
P_CONF_DIR,CLASSPATH_PREPEND_DISTC
ACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property>
</configuration>

7. Format namenode Run the following command to format the name node:
bin/hdfs namenode -format
8. Run DFS daemons
a. Run the following commands to start NameNode and DataNode daemons:
sbin/start-dfs.sh

b. Check status via jps command: When the services are initiated successfully, you
should be able to see these four processes

9. Run YARN daemon: Run the following command to start YARN daemon:
sbin/start-yarn.sh
10. Shutdown services: Once you've completed explorations, you can use the
following command to shutdown those daemons:
sbin/stop-yarn.sh
sbin/stop-dfs.sh
You can verify through jps command which will only show one process now:
CONCLUSION
Thus from this experiment I set up a single node Hadoop cluster on Linux. I
studied the architecture of Hadoop File System and successfully worked with
Hadoop on Ubuntu.

You might also like