Professional Documents
Culture Documents
8CS4-21 BDA Lab - Dr. Varun P Saxena
8CS4-21 BDA Lab - Dr. Varun P Saxena
LAB MANUAL
B.Tech. VIII Semester
i
Exp. CONTENTS Page No.
No.
ii
Lab Instructions
Lab Introduction iv
2 Perform setting up and Installing Hadoop in its three operating modes: 5-11
Standalone, Pseudodistributed, Fully distributed.
3 Implement the following file management tasks in Hadoop: 12-13
(i) Adding files and directories
(ii) Retrieving
files (iii)Deleting
files
4 Run a basic Word Count Map Reduce program to understand Map Reduce 14-16
Paradigm.
5 17-19
Write a Map Reduce program that mines weather data.
6 20-23
Implement Matrix Multiplication with Hadoop Map Reduce
7 Install and Run Pig then write Pig Latin scripts to sort, group, join, project, 24-27
and filter your data.
8 Install and Run Hive then use Hive to create, alter, and drop databases, 28-29
tables, views, functions, and indexes.
9 30-31
Solve some real-life big data problems.
ii
Lab Instructions
1. Keep Silence in the lab.
3. Always follow the instruction given by lab staff for to perform the assigned experiment.
4. Do not switch ON the supply of the panel until the circuit is not been checked by lab staff.
5. Every student is responsible for any damage to the panel or any component which
is assigned for lab work.
6. Do not go to any table to assist any student without permission of lab staff, if so, may
be punished for that.
7. Always come to lab with your file and file work should be completed.
9. Always get checked and signed your file work in time (experiment which was performed
in previous lab positively to be checked) after that faculty may not check your work.
iii
BTU SYLLABUS
2. Perform setting up and Installing Hadoop in its three operating modes: Standalone,
Pseudodistributed, Fully distributed.
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
7. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter
your data.
8. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views
functions, and indexes.
iii
LAB INTRODUCTION
The Big Data Analytics(BDA) Lab focuses on large-scale data analytics problems that arise in
different application domains and disciplines starting with fundamental concepts. The aim of
BDA lab is to let user get hands on experience of Hadoop framework practically. The user will
be exposed to installation of Hadoop and its related components like YARN, Hive, Pig etc. User
will be writing programs based on map-reduce which is programming unit of Hadoop
framework. Along with this user will be working with basics of Pig and Hive commands and
analyzing data using both Pig and Hive.
iv
8CS4-21: Big Data Analytics Lab
Experiment NO. 1
import java.util.*;
public class LL {
public static void main(String args[])
{
ll.remove("B");
ll.remove(3);
ll.removeFirst();
ll.removeLast();
System.out.println(ll);
}
}
1.2 Stack
// Java code for stack implementation
import java.io.*;
import java.util.*;
class Test
{
static void stack_push(Stack<Integer> stack)
{
for(int i = 0; i < 5; i++)
{
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 1
8CS4-21: Big Data Analytics Lab
stack.push(i);
}
}
if(pos == -1)
System.out.println("Element not found");
else
System.out.println("Element is found at position: " +
pos);
}
1.3 Queue
// Java program to demonstrate a Queue
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 2
8CS4-21: Big Data Analytics Lab
import java.util.LinkedList;
import java.util.Queue;
1.5 Map
// Java Program to Demonstrate
// Working of Map interface
import java.util.*;
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 3
8CS4-21: Big Data Analytics Lab
class MapExample {
Map<String, Integer> hm
= new HashMap<String, Integer>();
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));
Vivo Voce:
1 What is Collection Framework?
2. What is LinkedList?
3. What is Set/SortedSet?
4. What is Map?
5. How to create Queue/Stack in Java?
References:
EXPERIMENT No. 2
Objective: Perform setting up and Installing Hadoop in its three operating modes:
The next step is to install openssh server by typing the following command in
terminal user@user-ThinkCentre-E73:~$ sudo apt-get install openssh-server
The first step is to check whether SSH is installed on your nodes. We can easily do this by use
of the "which" UNIX command:
user@user-ThinkCentre-E73:~$ which ssh
ser@user-ThinkCentre-E73:~$cat
/home/user/.ssh/id_rsa.pub>>/home/user/.ssh/authorized_keys
user@user-ThinkCentre-E73:~$ exit
11. Open bashrc file and add the following lines at the end of file
12. Now apply all the changes into the current running system by typing the
following command in terminal
13. Now verify the java version by giving the following command in terminal.
user@user-ThinkCentre-E73:~$ echo $JAVA_HOME
14. Now verify the hadooop version by giving the following command in terminal.
The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons running on a
single machine. This mode complements the standalone mode for debugging your code, allowing
you to examine memory usage, HDFS input/output issues, and other daemon interactions. In order to
work with pseudo distributed mode in addition to the above process the four XML files (core-
site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml) are to be updated as below.
2.2.1 core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
2.2.2 hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
2.2.3 mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
2.2.4 yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration
In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the
JobTracker, respectively. In hdfs-site.xml we specify the default replication factor for HDFS, which
should only be one because we’re running on only one node. The yarn-site.xml is used to configure
yarn into Hadoop.
The hadoop-env.sh file contains other variables for defining your Hadoop environment. To set
environment variables we have to add following lines at the end of file in hadoop-env.sh.
export JAVA_HOME=/home/user/Downloads/jdk1.8.0_77
export HADOOP_HOME=/home/user/Downloads/hadoop
1. First decide how may nodes are to be grouped . Assume one of it as master and other as slave1,
slave2 respectively. You can download hadoop-2.7.2.tar.gz file for hadoop
Now host name file is opened Remove old name and type new name as master, slave1 and slave2
respectively. Then save and exit. To apply these changes restart the master and slave nodes.
3. Add the association between the hostnames and the ip address for the master and the slaves
on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to
ping to each other.
user@master:~$ cd /etc
4. Next we need to copy the public key to every slave node as well as the master node.
Make sure that the master is able to do a password-less ssh to all the slaves.
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@slave1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@slave2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
5. open bashrc file and add the following lines at the end of file
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/user/Downloads/hadoop/tmp</value>
</property>
</configuration>
hdfs-site.xml :
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml :
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 11
8CS4-21: Big Data Analytics Lab
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8040</value>
</property>
</configuration>
10. Use the following commands to copy hadoop from master node to slave nodes.
Vivo Voce:
1. What are various modes of Hadoop?
2. What are various configuration files in hadoop?
3. Explain all three modes of hadoop installation.
References:
EXPERIMENT No. 3
Objective: Implement the following file management tasks in Hadoop:
HDFS Command that reads a file on HDFS and prints the content of that file to the standard output.
/new_edureka/test
ii) mkdir
iii) cp
HDFS Command to copy files from source to destination. This command allows
multiple sources as well, in which case the destination must be a directory.
iv) rm
v) get
HDFS Command to copy files from hdfs to the local file system.
Vivo Voce:
1. How local and distributed file system differ ?
2. How to copy file from local file system to hdfs?
3. Explain various commands used in hdfs.
References:
EXPERIMENT No. 4
Objective: Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm.
// Mapper Code
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
//Reducer
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 15
8CS4-21: Big Data Analytics Lab
//Driver
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}
Vivo Voce:
1. How map and reduce differ?
2. Explain combiner and partitioner ?
3. Explain various commands used in hdfs.
References:
EXPERIMENT No. 5
Objective: Write a Map Reduce program that mines weather data.
//Mapper
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable>
{ private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus
signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]"))
{ context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
// Reducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException
{ int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values)
{ maxValue = Math.max(maxValue,
value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
// Driver
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output
path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 19
8CS4-21: Big Data Analytics Lab
Vivo Voce:
1. How map and reduce differ?
2. Explain combiner and partitioner ?
3. Explain weather data mining in hadoop.
References:
EXPERIMENT No. 6
package com.lendap.hadoop;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
context.write(outputKey, outputValue);
}
}
}
}
// Reducer
package com.lendap.hadoop;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.HashMap;
//Driver
package com.lendap.hadoop;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Vivo Voce:
1. How map and reduce differ?
2. Explain driver program in this experiment?
3. What is sparse matrix format.
References:
EXPERIMENT No. 7
Objective: Install and Run Pig then write Pig Latin scripts to sort, group, join, project,
and filter your data.
Table/Relation
4. Group operator
5. Join Operator
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
Self - join
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at
least one relation. Assuming two same relations exists: customer1 and customer2 on customers.txt
file.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when
there is a match in both tables.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An
outer join operation is carried out in three ways −
6. Filter Operator
7. Order by Operator:
Vivo Voce:
1. How pig and piglatin differ?
2. What are the two modes of pig execution?
3. What is pig architecture? Explain .
References:
EXPERIMENT No. 8
Objective: Install and Run Hive then use Hive to create, alter, and drop databases,
tables, views, functions, and indexes.
4. Create Table
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
6. Alter Table
7. Drop Table
hive> DROP TABLE IF EXISTS employee;
Vivo Voce:
1. How hive works ?
2. What are the various operators used in hive?
3. What is hive architecture? Explain .
References:
EXPERIMENT No. 9
1. Example 1
The University of Alabama has more than 38,000 students and an ocean of data. In the past
when there were no real solutions to analyze that much data, some of them seemed useless.
Now, administrators can use analytics and data visualizations for this data to draw out patterns
of students revolutionizing the university’s operations, recruitment, and retention efforts.
2. Example 2
Wearable devices and sensors have been introduced in the healthcare industry which can provide
real-time feed to the electronic health record of a patient. One such technology is Apple. Appl
has come up with Apple HealthKit, CareKit, and ResearchKit. The main goal is to empower iPh-
one users to store and access their real-time health records on their phones.
3. Example 3
Food and Drug Administration (FDA) which runs under the jurisdiction of the Federal
Government of the USA leverages the analysis of big data to discover patterns and associations
to identify and examine the expected or unexpected occurrences of food-based infections.
4. Example 4
Spotify, on-demand music providing platform, uses Big Data Analytics, collects data from all its
users around the globe, and then uses the analyzed data to give informed music
recommendations and suggestions to every individual user. Amazon Prime that offers, videos,
music, and Kindle books in a one-stop shop is also big on using big data.
5. Example 5
IBM Deep Thunder, which is a research project by IBM, provides weather forecasting through
high-performance computing of big data. IBM is also assisting Tokyo with improved weather
forecasting for natural disasters or predicting the probability of damaged power lines.
6. Example 6
Uber generates and uses a huge amount of data regarding drivers, their vehicles, locations,
every trip from every vehicle, etc. All this data is analyzed and then used to predict supply,
demand, location of drivers, and fares that will be set for every trip.
And guess what? We too make use of this application when we choose a route to save fuel
and time, based on our knowledge of having taken that particular route sometime in the past.
In this case, we analyzed and made use of the data that we had previously acquired on ac-
count of our experience, and then we used it to make a smart decision. It’s pretty cool that
big data has played parts not only in big fields but also in our smallest day-to-day life deci-
sions too.
Vivo Voce:
1. Think of some problem of Big Data that Hadoop can solve ?
2. How Pig and Hive differs?
3. Discuss some solutions of big data problem in health care industry. .
References: