Professional Documents
Culture Documents
BDA Lab Assignment 1 PDF
BDA Lab Assignment 1 PDF
AIM
To setup single node Hadoop cluster
THEORY
Hadoop is an open-source framework that allows to store and process big data in
a distributed environment across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage.
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it has become a
complete subject, which involves various tools, techniques and frameworks.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel with others. In short, Hadoop is used to develop applications
that could perform complete statistical analysis on huge amounts of data.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System).
MapReduce: MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large amounts of data
(multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop which is an Apache open-source framework.
Hadoop Distributed File System: The Hadoop Distributed File System (HDFS) is
based on the Google File System (GFS) and provides a distributed file system that
is designed to run on commodity hardware. It has many similarities with existing
distributed file systems. However, the differences from other distributed file
systems are significant. It is highly fault-tolerant and is designed to be deployed on
low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
• Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
IMPLEMENTATION
1. Install Java JDK
sudo apt install default-jre
sudo apt install openjdk-11-jre-headless
Install openssh-server
b. Run the following command to source the latest variables: source ~/.bashrc
c. Edit etc/hadoop/hadoop-env.sh file:
nano etc/hadoop/hadoop-env.sh
7. Format namenode Run the following command to format the name node:
bin/hdfs namenode -format
8. Run DFS daemons
a. Run the following commands to start NameNode and DataNode daemons:
sbin/start-dfs.sh
b. Check status via jps command: When the services are initiated successfully, you
should be able to see these four processes
9. Run YARN daemon: Run the following command to start YARN daemon:
sbin/start-yarn.sh
10. Shutdown services: Once you've completed explorations, you can use the
following command to shutdown those daemons:
sbin/stop-yarn.sh
sbin/stop-dfs.sh
You can verify through jps command which will only show one process now:
CONCLUSION
Thus from this experiment I set up a single node Hadoop cluster on Linux. I
studied the architecture of Hadoop File System and successfully worked with
Hadoop on Ubuntu.