Professional Documents
Culture Documents
ChiragMangla - Hadoop Architecture
ChiragMangla - Hadoop Architecture
ChiragMangla - Hadoop Architecture
CBD 3115
Aim: To learn the concept of apache ambari, its features and benefits.
Theory:
An open source administration tool which is responsible for keeping track of running
applications and their status is what we call Apache Ambari.
However, to visualize the progress as well as the status of every application which is
running over the Hadoop cluster, Ambari offers a highly interactive dashboard which
permits administrators.
In addition, it is a very flexible and scalable user-interface which permits a range of tools,
for example, Pig, MapReduce, Hive, and many more to be installed on the cluster and
administers their performances in a user-friendly fashion.
Key points of this technology are:
• Ambari server
• The Ambari agent
• Ambari web UI
• Database
Features of Ambari
• Ambari does also, see it in this way, If we worked on the word file and all of
sudden there is a power outage, after turning the system on we will get an
autosaved version of the document at MS word, this is same as the failure
recovery concept in Ambari.
• Security: The Ambari application can sync with LDAP over the active directory
and also it comes with robust security.
These benefits are given with respect to Hortonworks Data Platform (HDP). Basically,
to watch over Hadoop operations, Ambari eliminates the need for manual tasks used.
As for provisioning, managing and monitoring HDP deployments, it gives a simple
secure platform. In addition, it is an easy to use Hadoop management UI as well as
Ambari is solidly backed by REST APIs.
Moreover, we can say Ambaris’ the performance is optimal due to its wizard-driven
approach. Basically, client components and Master-slave are assigned to
configuring services those help to install, start as well as test the cluster.
Here, running clusters can be updated on the go with maintenance releases and
feature-bearing releases, due to its rolling upgrade feature. In addition, there is no
unnecessary downtime.
• Centralized Security and Application
We can monitor our cluster’s health and availability, by using Ambari. For each
service in the cluster like HDFS, YARN, and HBase, an easily customized web-
based dashboard has metrics which gives status information.
Also, users can browse alerts for their clusters, search and filter alerts, through the
browser interface. In addition, by Ambari we can also view and modify alert
properties alert instances.
• Security
Over the active directory, the Ambari can sync with LDAP, and also the application
comes with robust security.
Moreover, for Hadoop component metrics, Ambari offers scalable low latency
storage systems. Additionally, a leading graph and dashboard builder Grafana,
simplifies the metrics reviewing process. With Ambari metrics along with HDP, it is
also there.
• Customization
• Open-source
As the best advantage, any user can make an improvement or suggestion to it, as
designed by committee. Since anyone can read the source code, contributors and
customers can identify fixes and security vulnerabilities which make it into the
product.
• Extensible
i) Linked List :
import java.util.*;
public class LinkedListDemo {
public static void main(String args[]) {
// create a linked list
var ll = new LinkedList();
Output:
ii) Stacks Program:
import java.util.*;
public class StackDemo {
static void showpush(Stack st, int a) {
st.push(new Integer(a));
System.out.println("push(" + a + ")");
System.out.println("stack: " + st);
try {
showpop(st);
}
catch (EmptyStackException e) { System.out.println("empty stack");
}
}
}
Output:
iii) Queues:
import java.util.LinkedList;
import java.util.Queue; public class QueueExample {
public static void main(String[] args) {
Queue<Integer> q = new LinkedList<>();
Output:
iv) Set:
import java.util.*;
public class SetDemo {
public static void main(String args[]) {
int count[] = {37,48,63,72,84,14};
Set<Integer> set = new HashSet<Integer>();
try{
for(int i = 0; i<6; i++){
set.add(count[i]);
}
System.out.println(set);
TreeSet sortedSet = new TreeSet<Integer>(set);
System.out.println("The sorted list is:");
System.out.println(sortedSet);
System.out.println("The First element of the set is: "+(Integer)sortedSet.first());
System.out.println("The last element of the set is: "+(Integer)sortedSet.last());
}
catch(Exception e){}
}
}
Output:
v) Map Program:
import java.awt.Color;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
public class MapDemo {
public static void main(String[] args) {
Map<String, Color> favoriteColors = new HashMap<String, Color>();
favoriteColors.put("aaaa", Color.RED);
favoriteColors.put("bbbb", Color.BLUE);
favoriteColors.put("cccc", Color.GREEN);
favoriteColors.put("dddd", Color.RED);
Output:
Practical - 3
Aim: Perform setting up and Installing Hadoop
Hadoop software can be installed in three modes of operation:
Hadoop Installation:
$dest_dir="F:\big-data"
$url = "http://apache.mirror.digitalpacific.com.au/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz"
$client = new-object System.Net.WebClient
$client.DownloadFile($url,$dest_dir+"\hadoop-3.2.1.tar.gz")
If you used PowerShell to download and if the window is still open, you can simply run the
following command:
Close PowerShell window and open a new one and type winutils.exe directly to verify that our
above steps are completed successfully:
Step 6 - Configure Hadoop
Now we are ready to configure the most important part - Hadoop configurations which involves
Core, YARN, MapReduce, HDFS configurations.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
Configure HDFS
Edit file hdfs-site.xml in %HADOOP_HOME%\etc\hadoop folder.
Before editing, please correct two folders in your system: one for namenode directory and
another for data directory. For my system, I created the following two sub folders:
• F:\big-data\data\dfs\namespace_logs
• F:\big-data\data\dfs\data
Replace configuration element with the following (remember to replace the highlighted
paths accordingly):
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///F:/big-data/data/dfs/namespace_logs</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///F:/big-data/data/dfs/data</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/hadoop/mapreduce/lib/*,%HAD
OOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/shar
e/hadoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOO
P_HOME%/share/hadoop/hdfs/lib/*</value>
</property>
</configuration>
Edit file yarn-site.xml in %HADOOP_HOME%\etc\hadoop folder.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPAT
H_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
Source code:
// Configuring the input/ output path from the files yst em into the job File Input Forma
t.
add Input Path(job, new Path( args[ 0 ] ) ) ;
File Output Format. setOutput Path(job, new P a th( args[ 1 ] ) ) ;
// deleting the output path automatically from hdfs so t hat we don't have t o delete it
explicitly
output Path. get File System (conf). delete( output P ath);
// exiting the job only if the flag value becomes false
System. exit( job. wait For Completion( true) ? 0 : 1 );
}
}
We have created a class Map that extends the class Mapper which is already defined in
the Map Reduce Framework. We define the data types of input and output key/value pair
after the class declaration using angle brackets. Both the input and output of the Mapper
is a key/value pair.
Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and
copies them into HDFS using one of the above command line utilities.
Download:
hadoop fs -get: Copies/Downloads files to the local file system
Usage:
hadoop fs -get <hdfs_src> <localdst>
Example:
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/
6. Then click on new and select python from the given options.
7. Use the commands to get the shape
import pandas as pd
a = pd.read_csv('tested_worldwide.csv')
a.shape