ChiragMangla - Hadoop Architecture

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Hadoop Architecture Lab

CBD 3115

Name of Student : Chirag Mangla


Registration. No : 1419210023
Section : E
Course : B.Tech Computer Science Engineering
Specialization : Blockchain and IOT
Semester & Year : V & III

Submitted To: Dr. Ajay Kaushik


Computer Science Department
SRM University
Practical - 1

Aim: To learn the concept of apache ambari, its features and benefits.

Theory:

What is Apache Ambari?

An open source administration tool which is responsible for keeping track of running
applications and their status is what we call Apache Ambari.

Basically, it is deployed on top of the Hadoop cluster. Moreover, we can consider it as


an open source web-based management tool which manages, monitors as well as
provisions the health of Hadoop clusters.

However, to visualize the progress as well as the status of every application which is
running over the Hadoop cluster, Ambari offers a highly interactive dashboard which
permits administrators.

In addition, it is a very flexible and scalable user-interface which permits a range of tools,
for example, Pig, MapReduce, Hive, and many more to be installed on the cluster and
administers their performances in a user-friendly fashion.
Key points of this technology are:

● Instantaneous insight into the health of Hadoop cluster using pre-configured


operational metrics.
● Also, it is very easy to perform installation due to its user-friendly
configuration.
● Basically, through the Hortonworks data platform, we can easily install
Apache Ambari.
● Moreover, by visualizing and analyzing jobs and tasks, dependencies and
performances monitored here.
● By installing Kerberos-based Hadoop clusters, Authentication, authorization,
and auditing takes place.
● Since it is very flexible and adaptive technology, it is fitting perfectly in the
enterprise environment.

There are following applications in Apache Ambari, at the core:

• Ambari server
• The Ambari agent
• Ambari web UI
• Database

Features of Ambari

• Pre-configured Operational Metrics: There are pre-configured operational metrics


in Ambari, by using that, it gives instantaneous insight into the health of the
Hadoop cluster.
• User-Friendly Configuration: Due to the user-friendly configuration, it offers an
easy step-by-step guide for installation.

• Authentication: By installing Kerberos-based Hadoop clusters, Ambari


provides authentication, authorization, and auditing.

• Monitoring: By visualizing and analyzing jobs and tasks, Dependencies


and performances are monitored.

• Platform Independent: Apache Ambari architecturally supports any hardware and


software systems, and it can run in Windows, Mac and many other platforms
especially where Ambari runs are Ubuntu, SLES, RHEL and many more.

• Additionally, components that are dependent on a platform ought to be


plugged with well-defined interfaces, for example like yum, rpm packages,
Debian packages.
• Pluggable Component: We can easily customize any current Ambari application.
Basically, any specific tools, as well as technologies, must be surrounded by
pluggable components. Also, make sure that the pluggability’s purpose doesn’t
hold within the standardization of inter-component.

• Version Management and Upgrade: While it comes to maintaining versions in


Ambari it does not need any external tools such as Git. Moreover, we can say it is
quite easy to upgrade both Ambari or its applications.

• Extensibility: By including different view components, it is possible to extend


the functionality of existing Ambari Applications.

• Failure Recovery: To understand Failure recovery concept in Ambari, let’s


suppose something wrong happens when we are working on an Ambari
application. And, at this time, the system needs to recover from it, gracefully.

• Ambari does also, see it in this way, If we worked on the word file and all of
sudden there is a power outage, after turning the system on we will get an
autosaved version of the document at MS word, this is same as the failure
recovery concept in Ambari.

• Security: The Ambari application can sync with LDAP over the active directory
and also it comes with robust security.

Apache Ambari Advantages

These benefits are given with respect to Hortonworks Data Platform (HDP). Basically,
to watch over Hadoop operations, Ambari eliminates the need for manual tasks used.
As for provisioning, managing and monitoring HDP deployments, it gives a simple
secure platform. In addition, it is an easy to use Hadoop management UI as well as
Ambari is solidly backed by REST APIs.

Now, here is a list of several benefits of Apache Ambari:

• Installation, Configuration, and Management is way

Moreover, we can say Ambaris’ the performance is optimal due to its wizard-driven
approach. Basically, client components and Master-slave are assigned to
configuring services those help to install, start as well as test the cluster.
Here, running clusters can be updated on the go with maintenance releases and
feature-bearing releases, due to its rolling upgrade feature. In addition, there is no
unnecessary downtime.
• Centralized Security and Application

Ambari reduces the complexity of cluster security configuration as well as


administration across the components of the Hadoop ecosystem. And like
Kerberos and Ranger, this tool helps with the automated setup of advanced
security constructs.

• Complete Visibility to Cluster Health

We can monitor our cluster’s health and availability, by using Ambari. For each
service in the cluster like HDFS, YARN, and HBase, an easily customized web-
based dashboard has metrics which gives status information.
Also, users can browse alerts for their clusters, search and filter alerts, through the
browser interface. In addition, by Ambari we can also view and modify alert
properties alert instances.

• Security

Over the active directory, the Ambari can sync with LDAP, and also the application
comes with robust security.

• Metrics visualization and dashboarding

Moreover, for Hadoop component metrics, Ambari offers scalable low latency
storage systems. Additionally, a leading graph and dashboard builder Grafana,
simplifies the metrics reviewing process. With Ambari metrics along with HDP, it is
also there.

• Customization

We can easily work on Hadoop gracefully in one’s enterprise setup, because of


Ambari. Basically, it holds a large innovative community within which improves
upon the tool and also eliminates vendor lock-in.
In order to plug in newly created services that can perform alongside Hadoop, a
natural extension point for operators is provided by the Stacks. Through Ambari
views, third parties can plug in their views.

• Open-source

As the best advantage, any user can make an improvement or suggestion to it, as
designed by committee. Since anyone can read the source code, contributors and
customers can identify fixes and security vulnerabilities which make it into the
product.

• Extensible

It is possible to extend the functionality of existing Ambari applications by including


different view components.
Practical - 2
Aim: Implement the following Data structures in Java
i) Linked Lists ii) Stacks iii) Queues iv) Set v) Map

i) Linked List :

import java.util.*;
public class LinkedListDemo {
public static void main(String args[]) {
// create a linked list
var ll = new LinkedList();

// add elements to the linked list


ll.add("E");
ll.add("J");
ll.add("Q");
ll.add("O");
ll.add("P");
ll.addLast("Z");
ll.addFirst("A");
ll.add(1, "F6");
System.out.println("Original contents of ll: " + ll);

// remove elements from the linked list


ll.remove("Q"); ll.remove(4);
System.out.println("Contents of ll after deletion: " + ll);

// remove first and last elements


ll.removeFirst();
ll.removeLast();
System.out.println("ll after deleting first and last: "+ ll);

// get and set a value


Object val = ll.get(2);
ll.set(2, (String) val + " Changed"); System.out.println("ll after change: " + ll);
}
}

Output:
ii) Stacks Program:

import java.util.*;
public class StackDemo {
static void showpush(Stack st, int a) {
st.push(new Integer(a));
System.out.println("push(" + a + ")");
System.out.println("stack: " + st);

static void showpop(Stack st) {


System.out.print("pop -> ");
Integer a = (Integer) st.pop();
System.out.println(a);
System.out.println("stack: " + st);
}

public static void main(String args[]) {


Stack st = new Stack();
System.out.println("stack: " + st);
showpush(st, 86);
showpush(st, 43);
showpush(st, 21);
showpop(st);
showpop(st);
showpop(st);

try {
showpop(st);
}
catch (EmptyStackException e) { System.out.println("empty stack");
}
}
}

Output:
iii) Queues:

import java.util.LinkedList;
import java.util.Queue; public class QueueExample {
public static void main(String[] args) {
Queue<Integer> q = new LinkedList<>();

// Adds elements {21, 22, 23, 24, 25} to queue


for (int i=21; i<26; i++)
q.add(i);

// Display contents of the queue.


System.out.println("Elements of queue-"+q);

// To remove the head of queue.


int removedele = q.remove();
System.out.println("removed element-" + removedele);
System.out.println(q);

// To view the head of queue


int head = q.peek();
System.out.println("head of queue-" + head);

// Rest all methods of collection interface,


// Like size and contains can be used with this
// implementation.
int size = q.size();
System.out.println("Size of queue-" + size);
}
}

Output:
iv) Set:

import java.util.*;
public class SetDemo {
public static void main(String args[]) {
int count[] = {37,48,63,72,84,14};
Set<Integer> set = new HashSet<Integer>();
try{
for(int i = 0; i<6; i++){
set.add(count[i]);
}
System.out.println(set);
TreeSet sortedSet = new TreeSet<Integer>(set);
System.out.println("The sorted list is:");
System.out.println(sortedSet);
System.out.println("The First element of the set is: "+(Integer)sortedSet.first());
System.out.println("The last element of the set is: "+(Integer)sortedSet.last());
}
catch(Exception e){}
}
}

Output:
v) Map Program:

import java.awt.Color;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
public class MapDemo {
public static void main(String[] args) {
Map<String, Color> favoriteColors = new HashMap<String, Color>();
favoriteColors.put("aaaa", Color.RED);
favoriteColors.put("bbbb", Color.BLUE);
favoriteColors.put("cccc", Color.GREEN);
favoriteColors.put("dddd", Color.RED);

//Print all keys and values in the map


Set<String> keySet = favoriteColors.keySet();
for (String key : keySet) {
Color value = favoriteColors.get(key);
System.out.println(key + " : " + value);
}
}
}

Output:
Practical - 3
Aim: Perform setting up and Installing Hadoop
Hadoop software can be installed in three modes of operation:

• Stand Alone Mode: Hadoop is a distributed software and is designed to run on a


commodity of machines. However, we can install it on a single node in stand-alone
mode. In this mode, Hadoop software runs as a single monolithic java process.
This mode is extremely useful for debugging purpose. You can first test run your
Map-Reduce application in this mode on small data, before actually executing it on
cluster with big data.
• Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a
Single Node. Various daemons of Hadoop will run on the same machine as
separate java processes. Hence all the daemons namely NameNode, DataNode,
SecondaryNameNode, JobTracker, TaskTracker run on single machine.

• Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode,


JobTracker, SecondaryNameNode (Optional and can be run on a separate node)
run on the Master Node. The daemons DataNode and TaskTracker run on the
Slave Node.

Hadoop Installation:

Step 1 - Download Hadoop binary package


Open PowerShell in destination where you have installed Hadoop and then run the following command
lines one by one:

$dest_dir="F:\big-data"
$url = "http://apache.mirror.digitalpacific.com.au/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz"
$client = new-object System.Net.WebClient
$client.DownloadFile($url,$dest_dir+"\hadoop-3.2.1.tar.gz")

Step 2 - Unpack the package


Step 3 - Install Hadoop native IO binary
Hadoop on Linux includes optional Native IO support. However Native IO is mandatory on
Windows and without it you will not be able to get your installation working. The Windows
native IO libraries are not included as part of Apache Hadoop release. Thus we need to build and
install it.
Download all the files in the following location and save them to the bin folder under
Hadoop folder. For my environment, the full path is: F:\big-data\hadoop-3.2.1\bin.
Remember to change it to your own path accordingly.
Step 4 - (Optional) Java JDK installation
Step 5 - Configure environment variables
Configure HADOOP_HOME environment variable
Similarly we need to create a new environment variable for HADOOP_HOME using the following
command. The path should be your extracted Hadoop folder. For my environment it is: F:\big-
data\hadoop-3.2.1.

If you used PowerShell to download and if the window is still open, you can simply run the
following command:

SETX HADOOP_HOME $dest_dir+"/hadoop-3.2.1"


The output looks like the following screenshot:

Close PowerShell window and open a new one and type winutils.exe directly to verify that our
above steps are completed successfully:
Step 6 - Configure Hadoop
Now we are ready to configure the most important part - Hadoop configurations which involves
Core, YARN, MapReduce, HDFS configurations.

Configure core site


Edit file core-site.xml in %HADOOP_HOME%\etc\hadoop folder. For my environment, the
actual path is F:\big-data\hadoop-3.2.1\etc\hadoop.

Replace configuration element with the following:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
Configure HDFS
Edit file hdfs-site.xml in %HADOOP_HOME%\etc\hadoop folder.

Before editing, please correct two folders in your system: one for namenode directory and
another for data directory. For my system, I created the following two sub folders:

• F:\big-data\data\dfs\namespace_logs
• F:\big-data\data\dfs\data

Replace configuration element with the following (remember to replace the highlighted
paths accordingly):

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///F:/big-data/data/dfs/namespace_logs</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///F:/big-data/data/dfs/data</value>
</property>
</configuration>

Configure MapReduce and YARN site


Edit file mapred-site.xml in %HADOOP_HOME%\etc\hadoop folder.

Replace configuration element with the following:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>

<value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/hadoop/mapreduce/lib/*,%HAD
OOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/shar
e/hadoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOO
P_HOME%/share/hadoop/hdfs/lib/*</value>
</property>
</configuration>
Edit file yarn-site.xml in %HADOOP_HOME%\etc\hadoop folder.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPAT
H_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Step 7 - Initialise HDFS & bug fix


Run the following command in Command Prompt

hdfs namenode -format


Once this is done, the format command (hdfs namenode -format) will show something
like the following:
Practical - 4
Aim: Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm.

Source code:

import java. io. IO Exception;


import java .util . String Tokenizer;
import org. apache. hadoop. io. Int Writable;
import org.apache.hadoop.io.LongWritable;
import org. apache.hadoop.io. Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org. apache. hadoop.conf. C onfiguration;
import org.apache.hadoop.mapreduce.Job;
import org. apache.hadoop.mapreduce.lib.input. Text Input Format;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org. apache. hadoop.mapreduce. lib.input. File Input Format;
import org.apache.hadoop.mapreduce. lib. output.FileOutput Format;
import org. apache.hadoop. fs. Path;

public class WordCount {


public st a tic class Map extends
Mapper < LongWritable, Text, Text, Int Writable >
{ public void map(LongWritable key, Text value, C ontext context) throws IO Exception, In
terrupted Exception
{
String line = value .to String ( ) ;
String Tokenizer tokenizer = new String Tokenizer ( line);
while (tokenizer.has MoreTokens())
{
value. set( tokenizer. next Token());
context . write (value, new Int Writable ( 1 ) ) ;
}
}
}
public st a tic class Reduce extends
Reducer< Text, Int Writable, Text, Int Writable> { public void reduce(Text key,
It erable< Int Write table> values, C ontext context) throws
IO Exception, Interrupted Exception {
int sum=0;
(Int Write able x: values)
{
sum+=x.get() ;
}
con tex t . write( key, new Int Writable (sum) ) ;
}
}
public static void main ( String[] args) throws Exception {
Configuration conf= new Configuration( ) ;
Job job = new Job(conf,"My Word Count P rogram");
job.set Jar B yC lass( Word C ount. class);
job.setMapper C lass( Map. class);
job.set Reducer Class( Reduce. class);
job. set Output K ey C l ass( Text. c lass);
job. set Output V alue C l ass( Int W r i t able. class);
job.set Input Format C lass( Text Input Format. class);
job.setOutput Format Class( Text Output Format. cla ss);

Path outputPath = new Path( args[ 1 ] ) ;

// Configuring the input/ output path from the files yst em into the job File Input Forma
t.
add Input Path(job, new Path( args[ 0 ] ) ) ;
File Output Format. setOutput Path(job, new P a th( args[ 1 ] ) ) ;
// deleting the output path automatically from hdfs so t hat we don't have t o delete it
explicitly
output Path. get File System (conf). delete( output P ath);
// exiting the job only if the flag value becomes false
System. exit( job. wait For Completion( true) ? 0 : 1 );
}
}

We have created a class Map that extends the class Mapper which is already defined in
the Map Reduce Framework. We define the data types of input and output key/value pair
after the class declaration using angle brackets. Both the input and output of the Mapper
is a key/value pair.

Run the MapReduce code:


The command for running a MapReduce code is:
hadoop jar hadoop-mapreduce-example.jar
WordCount / sample/input /sample/output
Practical - 5
Aim: Implement the following file management tasks in Hadoop

• Adding files and directories


• Retrieving files
• Deleting files

Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and
copies them into HDFS using one of the above command line utilities.

1. Create a directory in HDFS at given path(s).


Usage:
hadoop fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

2. List the contents of a directory.


Usage :
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/saurzcode

3. Upload and download a file in HDFS.


Upload:
hadoop fs -put:
Copy single src file, or multiple src files from local file system to the Hadoop data file
system
Usage:
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/ saurzcode/dir3/

Download:
hadoop fs -get: Copies/Downloads files to the local file system
Usage:
hadoop fs -get <hdfs_src> <localdst>
Example:
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/

4. See contents of a file Same as unix cat command:


Usage:
hadoop fs -cat <path[filename]>
Example:
hadoop fs -cat /user/saurzcode/dir1/abc.txt

5. Copy a file from source to destination


This command allows multiple sources as well in which case the
destination must be a directory.
Usage:
hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

6. Copy a file from/ To Local file system to HDFS copyFromLocal


Usage:
hadoop fs -copyFromLocal <localsrc> URI
Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/
saurzcode/abc.txt Similar to put command,
except that the source is restricted to a local file reference. copyToLocal
Usage:
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file reference.

7. Move file from source to destination.


Note:- Moving files across filesystem is not permitted.
Usage :
hadoop fs-mv <src> <dest>
Example:
hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2
EXPERIMENT-10
AIM: Installing ANACONDA

1. Download the Anaconda installer.


2. RECOMMENDED: Verify data integrity with SHA-256. For more information on
hashes, see What about cryptographic hash verification?
3. Double click the installer to launch.
4. Click Next.
5. Read the licensing terms and click “I Agree”.
6. Select an install for “Just Me” unless you’re installing for all users (which requires
Windows Administrator privileges) and click Next.
7. Select a destination folder to install Anaconda and click the Next button. See FAQ.

8. Choose whether to add Anaconda to your PATH environment variable. We


recommend not adding Anaconda to the PATH environment variable, since this
can interfere with other software. Instead, use Anaconda software by opening
Anaconda Navigator or the Anaconda Prompt from the Start Menu.
9. Choose whether to register Anaconda as your default Python. Unless you plan on
installing and running multiple versions of Anaconda or multiple versions of Python,
accept the default and leave this box checked.
10. Click the Install button. If you want to watch the packages Anaconda is
installing, click Show Details.
11. Click the Next button.
12. Optional: To install PyCharm for Anaconda, click on the link
to https://www.anaconda.com/pycharm.
Or to install Anaconda without PyCharm, click the Next button.
13. After a successful installation you will see the “Thanks for installing
Anaconda” dialog box:
14. If you wish to read more about Anaconda.org and how to get started with
Anaconda, check the boxes “Anaconda Individual Edition Tutorial” and “Learn
more about Anaconda”. Click the Finish button.
15. Verify your installation.
EXPERIMENT -11
AIM: To read data from the dataset taken from Kaggle.
Kaggle is the world's largest data science community with powerful tools and resources
to help you achieve your data science goals.
Procedure :
1. Open the anaconda and launch the jupyter tool.
2. Now it will launch jupyter in a browser.
3. Now open Kaggle with this link in browser https://www.kaggle.com/datasets.
4. And download any of the datasets of your interest.
5. Save it in any of your directory and then import it in home page of jupyter.

6. Then click on new and select python from the given options.
7. Use the commands to get the shape
import pandas as pd
a = pd.read_csv('tested_worldwide.csv')
a.shape

8. To get full info write the python command as:


a.describe()

9. To delete the columns the python command will be:


b = a.drop(columns=['active'])
This will delete the active column and we can again use describe command to check the
table.

You might also like