Lecture7 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Big Data Analytics

Mirza Touseef
RISE
Lecture 7.1 – Install and Configure Hadoop
Objectives

• How To Install and Configure Hadoop on


CentOS/RHEL

Big Data Analytics 2


Introduction
• Hadoop is a free, open-source and Java-based software
framework used for storage and processing of large
datasets on clusters of machines.
• It uses HDFS to store its data and process these data
using MapReduce.
• It is an ecosystem of Big Data tools that are primarily
used for data mining and machine learning.
• It has four major components such as Hadoop
Common, HDFS, YARN, and MapReduce.

Big Data Analytics 3


Step 1 – Disable SELinux
• To disable SELinux, open the /etc/selinux/config file:

• Change the following line:

• Save the file when you are finished. Next, restart your system to
apply the SELinux changes.

Big Data Analytics 4


Step 2 – Install Java
• Hadoop is written in Java and supports only Java version 8.
You can install OpenJDK 8 and ant using DNF command as
shown below:

• Once installed, verify the installed version of Java with the


following command:

• You should get the following output:

Big Data Analytics 5


Step 3 – Create a Hadoop User
It is a good idea to create a separate user to run Hadoop for
security reasons.
• Run the following command to create a new user with
name hadoop:

• Next, set the password for this user with the following
command:

• Provide and confirm the new password as shown below:

Big Data Analytics 6


Step 4 – Configure SSH Key-
based Authentication
Next, you will need to configure passwordless SSH
authentication for the local system.
• First, change the user to hadoop with the following
command:

• Next, run the following command to generate Public and


Private Key Pairs:

Big Data Analytics 7


Step 4 – Configure SSH Key-
based Authentication
• You will be asked to enter the filename. Just press Enter to
complete the process:

Big Data Analytics 8


Step 4 – Configure SSH Key-
based Authentication
• Next, append the generated public keys from id_rsa.pub
to authorized_keys and set proper permission:

• Next, verify the passwordless SSH authentication with the


following command:

Big Data Analytics 9


Step 4 – Configure SSH Key-
based Authentication
• You will be asked to authenticate hosts by adding RSA keys
to known hosts. Type yes and hit Enter to authenticate the
localhost:

Big Data Analytics 10


Step 5 – Install Hadoop
• First, change the user to hadoop with the following
command:

• Next, download the latest version of Hadoop using the


wget command:

• Once downloaded, extract the downloaded file:

• Next, rename the extracted directory to hadoop:

Big Data Analytics 11


Step 5 – Install Hadoop
Next, you will need to configure Hadoop and Java
Environment Variables on your system.
• Open the ~/.bashrc file in your favorite text editor:

• Append the following lines:

Big Data Analytics 12


Step 5 – Install Hadoop
• Save and close the file. Then, activate the environment
variables with the following command:

• Next, open the Hadoop environment variable file:


• doop-env.sh

• Update the JAVA_HOME variable as per your Java


installation path:

• Save and close the file when you are finished.

Big Data Analytics 13


Step 6 – Configure Hadoop
First, you will need to create the namenode and datanode
directories inside Hadoop home directory:
• Run the following command to create both directories:

• Next, edit the core-site.xml file and update with your


system hostname:
• Change the following name as per your system hostname:

Big Data Analytics 14


Step 6 – Configure Hadoop
• Save and close the file. Then, edit the hdfs-site.xml file:

• Change the NameNode and DataNode directory path as


shown below:

Big Data Analytics 15


Step 6 – Configure Hadoop
• Save and close the file. Then, edit the mapred-site.xml file:

• Make the following changes:

• Save and close the file. Then, edit the yarn-site.xml file:

Big Data Analytics 16


Step 6 – Configure Hadoop
• Make the following changes:

• Save and close the file when you are finished.

Big Data Analytics 17


Step 7 – Start Hadoop Cluster
Before starting the Hadoop cluster. You will need to format
the Namenode as a hadoop user.
• Run the following command to format the hadoop
Namenode:

• You should get the following output:

Big Data Analytics 18


Step 7 – Start Hadoop Cluster
• After formating the Namenode, run the following
command to start the hadoop cluster:

• Once the HDFS started successfully, you should get the


following output:

• Next, start the YARN service as shown below:

Big Data Analytics 19


Step 7 – Start Hadoop Cluster
• You should get the following output:

• You can now check the status of all Hadoop services using
the jps command:
• You should see all the running services in the following
output:

Big Data Analytics 20


Step 8 – Access Hadoop Namenode
and Resource Manager
• To access the Namenode, open your web browser and
visit the URL http://your-server-ip:9870. You should see
the following screen:

Big Data Analytics 21


Step 9 – Access Hadoop Namenode
and Resource Manager
• To access the Resource Manage, open your web browser
and visit the URL http://your-server-ip:8088. You should
see the following screen:

Big Data Analytics 22


Step 10 – Verify the Hadoop
Cluster
At this point, the Hadoop cluster is installed and configured. Next, we will
create some directories in HDFS filesystem to test the Hadoop.
• Let’s create some directory in the HDFS filesystem using
the following command:

• Next, run the following command to list the above


directory:

• You should get the following output:

Big Data Analytics 23


Step 10 – Verify the Hadoop
Cluster
You can also verify the above directory in the Hadoop Namenode web
interface.
• Go to the Namenode web interface, click on the Utilities => Browse
the file system. You should see your directories which you have
created earlier in the following screen:

Big Data Analytics 24


Step 11 – Stop Hadoop Cluster
You can also stop the Hadoop Namenode and Yarn service
any time by running the stop-dfs.sh and stop-yarn.sh script
as a Hadoop user.
• To stop the Hadoop Namenode service, run the following
command as a hadoop user:

• To stop the Hadoop Resource Manager service, run the


following command:

Big Data Analytics 25


Conclusion

• In the above tutorial, you learned how to set up the


Hadoop single node cluster on CentOS 8. I hope you have
now enough knowledge to install the Hadoop in the
production environment.

Big Data Analytics 26

You might also like