8CS4-21 BDA Lab - Dr. Varun P Saxena

Faculty Name : Dr.
Varun Prakash Saxena
DEPARTMENT OF COMPUTER SCIENCE &

ENGINEERING
LAB MANUAL
B.Tech. VIII Semester
BIG DATA ANALYTICS LAB 8CS4-21
i
Exp. CONTENTS Page No.
No.
ii
Lab Instructions
BTU Syllabus iii
Lab Introduction iv
1. Implement the following Data structures in Java 1-4

i) Linked Lists ii) Stacks iii) Queues iv) Set v) Map
2 Perform setting up and Installing Hadoop in its three operating modes: 5-11
Standalone, Pseudodistributed, Fully distributed.
3 Implement the following file management tasks in Hadoop: 12-13
(i) Adding files and directories
(ii) Retrieving
files (iii)Deleting
files
4 Run a basic Word Count Map Reduce program to understand Map Reduce 14-16
Paradigm.
5 17-19
Write a Map Reduce program that mines weather data.
6 20-23
Implement Matrix Multiplication with Hadoop Map Reduce
7 Install and Run Pig then write Pig Latin scripts to sort, group, join, project, 24-27
and filter your data.
8 Install and Run Hive then use Hive to create, alter, and drop databases, 28-29
tables, views, functions, and indexes.
9 30-31
Solve some real-life big data problems.
ii
Lab Instructions
1. Keep Silence in the lab.
2. Sit on your own seat which is assigned by faculty or lab staff.
3. Always follow the instruction given by lab staff for to perform the assigned experiment.
4. Do not switch ON the supply of the panel until the circuit is not been checked by lab staff.
5. Every student is responsible for any damage to the panel or any component which
is assigned for lab work.
6. Do not go to any table to assist any student without permission of lab staff, if so, may
be punished for that.
7. Always come to lab with your file and file work should be completed.
8. Please keep your bag at right place inside the lab.
9. Always get checked and signed your file work in time (experiment which was performed
in previous lab positively to be checked) after that faculty may not check your work.
iii
BTU SYLLABUS
NETWORK PROGRAMMING LAB (4CS4-23)
S.No. Name of Experiment

1. Implement the following Data structures in Java
2. Perform setting up and Installing Hadoop in its three operating modes: Standalone,
Pseudodistributed, Fully distributed.
3. Implement the following file management tasks in Hadoop:

(ii)Retrieving files
(iii)Deleting files
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
5. Write a Map Reduce program that mines weather data.
6. Implement Matrix Multiplication with Hadoop Map Reduce
7. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter
your data.
8. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views
functions, and indexes.
9. Solve some real-life big data problems.
iii
LAB INTRODUCTION
The Big Data Analytics(BDA) Lab focuses on large-scale data analytics problems that arise in
different application domains and disciplines starting with fundamental concepts. The aim of
BDA lab is to let user get hands on experience of Hadoop framework practically. The user will
be exposed to installation of Hadoop and its related components like YARN, Hive, Pig etc. User
will be writing programs based on map-reduce which is programming unit of Hadoop
framework. Along with this user will be working with basics of Pig and Hive commands and
analyzing data using both Pig and Hive.
iv
8CS4-21: Big Data Analytics Lab
Experiment NO. 1
Objective: Implement the following Data structures in Java

1.1 Linked List

// Java Program to Demonstrate
// Implementation of LinkedList
// class
import java.util.*;
public class LL {
public static void main(String args[])
{
LinkedList<String> ll = new LinkedList<String>();

ll.add("A");
ll.add("B");
ll.addLast("C");
ll.addFirst("D");
ll.add(2, "E");
System.out.println(ll);
ll.remove("B");
ll.remove(3);
ll.removeFirst();
ll.removeLast();
System.out.println(ll);
}
}
1.2 Stack
// Java code for stack implementation
import java.io.*;
import java.util.*;
class Test
{
static void stack_push(Stack<Integer> stack)
{
for(int i = 0; i < 5; i++)
{
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 1
stack.push(i);
}
}
static void stack_pop(Stack<Integer> stack)

{
System.out.println("Pop Operation:");
for(int i = 0; i < 5; i++)

{
Integer y = (Integer) stack.pop();
System.out.println(y);
}
}
// Displaying element on the top of the stack

static void stack_peek(Stack<Integer> stack)
{
Integer element = (Integer) stack.peek();
System.out.println("Element on stack top: " + element);
}
static void stack_search(Stack<Integer> stack, int element)

{
Integer pos = (Integer) stack.search(element);
if(pos == -1)
System.out.println("Element not found");
else
System.out.println("Element is found at position: " +
pos);
}
public static void main (String[] args)

{
Stack<Integer> stack = new Stack<Integer>();
stack_push(stack);
stack_pop(stack);
stack_push(stack);
stack_peek(stack);
stack_search(stack, 2);
stack_search(stack, 6);
}
}
1.3 Queue
// Java program to demonstrate a Queue
import java.util.LinkedList;
import java.util.Queue;
public class QueueExample {
public static void main(String[] args)

{
Queue<Integer> q = new LinkedList<>();
for (int i = 0; i < 5; i++)
q.add(i);
System.out.println("Elements of queue "+ q);
int removedele = q.remove();

System.out.println("removed element-"+ removedele);
System.out.println(q);
int head = q.peek();

System.out.println("head of queue-"+ head);
int size = q.size();
System.out.println("Size of queue-"+ size);
}
}
1.4 Set
// Java program Illustrating Set Interface
import java.util.*;
public class SetExample {
public static void main(String[] args)

{
Set<String> hash_Set = new HashSet<String>();
hash_Set.add("Geeks");
hash_Set.add("For");
hash_Set.add("Geeks");
hash_Set.add("Example");
hash_Set.add("Set");
System.out.println(hash_Set);
}
}
1.5 Map
// Java Program to Demonstrate
// Working of Map interface
import java.util.*;
class MapExample {
public static void main(String args[])

{
Map<String, Integer> hm
= new HashMap<String, Integer>();
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));
for (Map.Entry<String, Integer> me :

hm.entrySet())
{ System.out.print(me.getKey() + ":");
System.out.println(me.getValue());
}
}
}
Vivo Voce:
1 What is Collection Framework?
2. What is LinkedList?
3. What is Set/SortedSet?
4. What is Map?
5. How to create Queue/Stack in Java?
References:
1. “Java Complete Reference”, Herber Schildt, TMH Publications.

2. Hadoop In Action By Chuck Lam, Manning Publications
3. Hadoop The Definitive Guide By Tom White, Oreilly Publications

EXPERIMENT No. 2
Objective: Perform setting up and Installing Hadoop in its three operating modes:
Standalone, Pseudodistributed, Fully distributed.
2.1 Local Standalone Mode

The standalone mode is the default mode for Hadoop. With empty configuration files, Hadoop will
run completely on the local machine. Because there’s no need to communicate with other nodes, the
standalone mode doesn’t use HDFS, nor will it launch any of the Hadoop daemons. Its primary use
is for developing and debugging the application logic of a MapReduce program without the
additional complexity of interacting with the daemons.
1. Place softwares into system downloads:
i. hadoop-2.7.2.tar.gz
ii. jdk-8u77-linux-i586.tar.gz
2. Extraction of tar files
Extract above two files in downloads itself. This can be done by right clicking on the tar
file by selecting extract here option.
3. Rename the extracted file hadoop2.7.2 to hadoop.
4. Update ubuntu operating system
The ubuntu can be updated with the following
command user@user-ThinkCentre-E73:~$ sudo apt
update
The above command asks the user enter password. Enter the corresponding password and
press enter key.
5. Install openssh server
The next step is to install openssh server by typing the following command in
terminal user@user-ThinkCentre-E73:~$ sudo apt-get install openssh-server
6. Verify SSH installation
The first step is to check whether SSH is installed on your nodes. We can easily do this by use
of the "which" UNIX command:
user@user-ThinkCentre-E73:~$ which ssh
7. Setup password less ssh to localhost

user@user-ThinkCentre-E73:~$ ssh-keygen -t rsa

8. Add the generated key into authorized keys
ser@user-ThinkCentre-E73:~$cat
/home/user/.ssh/id_rsa.pub>>/home/user/.ssh/authorized_keys
9. To check local host connection type ssh localhost in terminal.
user@user-ThinkCentre-E73:~$ ssh localhost
10. After connecting to local host type exit.
user@user-ThinkCentre-E73:~$ exit
11. Open bashrc file and add the following lines at the end of file
user@user-ThinkCentre-E73:~$ sudo gedit .bashrc

export JAVA_HOME=/home/user/Downloads/jdk1.8.0_77
export PATH=$PATH:$JAVA_HOME/bin
HADOOP_HOME=/home/user/Downloads/hadoop
export PATH=$PATH:/home/user/Downloads/hadoop/bin
export PATH=$PATH:/home/user/Downloads/hadoop/sbin
12. Now apply all the changes into the current running system by typing the
following command in terminal
user@user-ThinkCentre-E73:~$ source ~/.bashrc
13. Now verify the java version by giving the following command in terminal.
user@user-ThinkCentre-E73:~$ echo $JAVA_HOME
14. Now verify the hadooop version by giving the following command in terminal.
user@user-ThinkCentre-E73:~$ hadoop version

2.2 Pseudo Distributed Mode
The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons running on a
single machine. This mode complements the standalone mode for debugging your code, allowing
you to examine memory usage, HDFS input/output issues, and other daemon interactions. In order to
work with pseudo distributed mode in addition to the above process the four XML files (core-
site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml) are to be updated as below.
2.2.1 core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
2.2.2 hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
2.2.3 mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

2.2.4 yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration
In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the
JobTracker, respectively. In hdfs-site.xml we specify the default replication factor for HDFS, which
should only be one because we’re running on only one node. The yarn-site.xml is used to configure
yarn into Hadoop.
The hadoop-env.sh file contains other variables for defining your Hadoop environment. To set
environment variables we have to add following lines at the end of file in hadoop-env.sh.
export HADOOP_HOME=/home/user/Downloads/hadoop
2.3 Fully Distributed Mode
In the discussion below we’ll use the following server names:

■ master—The master node of the cluster and host of the NameNode and JobTracker daemons
■ backup—The server that hosts the Secondary NameNode daemon
■ slave1, slave2, slave3, ...—The slave boxes of the cluster running both DataNode and TaskTracker
daemons
1. First decide how may nodes are to be grouped . Assume one of it as master and other as slave1,
slave2 respectively. You can download hadoop-2.7.2.tar.gz file for hadoop
2. Extract it to a Downloads folder say, /home/user/Downloads on master.

NOTE: Master and all the slaves must have the same user name. Slaves should not have hadoop
folder since it will be automatically created during installation.
To change name of host
$sudo gedit /etc/hostname
In terminal it will ask enter password user123
Now host name file is opened Remove old name and type new name as master, slave1 and slave2
respectively. Then save and exit. To apply these changes restart the master and slave nodes.
3. Add the association between the hostnames and the ip address for the master and the slaves
on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to
ping to each other.
user@master:~$ cd /etc

user@master:/etc$ sudo vi hosts (OR) user@master:/etc$ sudo gedit hosts
In the slave1 and slave2 also perform the same operation.
4. Next we need to copy the public key to every slave node as well as the master node.
Make sure that the master is able to do a password-less ssh to all the slaves.
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@slave1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@slave2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
5. open bashrc file and add the following lines at the end of file
user@user-ThinkCentre-E73:~$ sudo gedit .bashrc

export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
7. Edit Hadoop environment files
Add following lines at the end of script in etc/hadoop/hadoop-env.sh

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Add following lines at start of script in etc/hadoop/yarn-env.sh
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

8. Set the configuration files for fully distributed mode as below.
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/user/Downloads/hadoop/tmp</value>
</property>
</configuration>
hdfs-site.xml :
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml :
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
</property>
<property>
<name>yarn.resourcemanager.address</name>
</property>
</configuration>
9. Add the slave entries in slaves file on master machine:

slave1
slave2
10. Use the following commands to copy hadoop from master node to slave nodes.
user@master:~/Downloads$ scp -r hadoop slave1:/home/user/Downloads/hadoop

user@master:~/Downloads$ scp -r hadoop slave2:/home/user/Downloads/hadoop
Vivo Voce:
1. What are various modes of Hadoop?
2. What are various configuration files in hadoop?
3. Explain all three modes of hadoop installation.
References:


EXPERIMENT No. 3
Objective: Implement the following file management tasks in Hadoop:
(ii) Retrieving files
(iii) Deleting files
i) The cat command
HDFS Command that reads a file on HDFS and prints the content of that file to the standard output.
Usage:hdfsdfs –cat /path/to/file_in_hdfs
Command: hdfs dfs –cat
/new_edureka/test
ii) mkdir
HDFS Command to create the directory in HDFS.

Usage: hdfs dfs –mkdir /directory_name
Command:hdfs dfs –mkdir /new_edureka
iii) cp
HDFS Command to copy files from source to destination. This command allows
multiple sources as well, in which case the destination must be a directory.
Usage: hdfs dfs -cp <src> <dest>

Command: hdfs dfs -cp/user/hadoop/file1 /user/hadoop/file2
Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
iv) rm
HDFS Command to remove the file from HDFS.

Usage: hdfs dfs –rm <path>
Command:hdfs dfs –rm /new_edureka/test

v) get
HDFS Command to copy files from hdfs to the local file system.
Usage: hdfs dfs -get <src> <localdst>

Command:hdfs dfs –get /user/test /home/edureka
Vivo Voce:
1. How local and distributed file system differ ?
2. How to copy file from local file system to hdfs?
3. Explain various commands used in hdfs.
References:


EXPERIMENT No. 4
Objective: Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm.
// Mapper Code
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WCMapper extends

MapReduceBase implements
Mapper<LongWritable,Text, Text, IntWritable>
{
public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter rep)
throws IOException
{
String line = value.toString();
for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}
//Reducer
import java.util.Iterator;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WCReducer extends

MapReduceBase implements
Reducer<Text,IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> value,

OutputCollector<Text, IntWritable> output,
Reporter rep) throws IOException
{
int count = 0;
while (value.hasNext())
{
IntWritable i = value.next();
count += i.get();
}
output.collect(key, new IntWritable(count));
}
}
//Driver
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WCDriver extends Configured implements Tool
{ public int run(String args[]) throws IOException

{
if (args.length < 2)
{
System.out.println("Please give valid inputs");
return -1;
}
JobConf conf = new JobConf(WCDriver.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WCMapper.class);
conf.setReducerClass(WCReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}
public static void main(String args[]) throws Exception

{
int exitCode = ToolRunner.run(new WCDriver(), args);
System.out.println(exitCode);
}
}
Vivo Voce:
1. How map and reduce differ?
2. Explain combiner and partitioner ?
3. Explain various commands used in hdfs.
References:


EXPERIMENT No. 5
Objective: Write a Map Reduce program that mines weather data.
//Mapper
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable>
{ private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus
signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]"))
{ context.write(new Text(year), new IntWritable(airTemperature));
}
}
}

// Reducer
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException
{ int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values)
{ maxValue = Math.max(maxValue,
value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
// Driver
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output
path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Vivo Voce:
2. Explain combiner and partitioner ?
3. Explain weather data mining in hadoop.
References:


EXPERIMENT No. 6
Objective: Implement Matrix Multiplication with Hadoop Map Reduce

.
//Mapper
package com.lendap.hadoop;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapreduce.Mapper;
public class Map

extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text,
Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
// (M, i, j, Mij);
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("M"))
{ for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
// outputKey.set(i,k);
outputValue.set(indicesAndValue[0] + "," +
indicesAndValue[2]
+ "," + indicesAndValue[3]);
// outputValue.set(M,j,Mij);
context.write(outputKey, outputValue);
}
} else {
// (N, j, k, Njk);
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("N," + indicesAndValue[1] + ","
+ indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}
}
// Reducer
import org.apache.hadoop.mapreduce.Reducer;
import java.util.HashMap;
public class Reduce

extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text,
Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context
context)
throws IOException, InterruptedException
{ String[] value;
//key=(i,k),
//Values = [(M/N,j,V/W),..]
HashMap<Integer, Float> hashA = new HashMap<Integer,
Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer,
Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("M")) {
hashA.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
}
}
int n =
Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float m_ij;
float n_jk;
for (int j = 0; j < n; j++) {
m_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
n_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;

result += m_ij * n_jk;

}
if (result != 0.0f) {
context.write(null,
new Text(key.toString() + "," +
Float.toString(result)));
}
}
}
//Driver
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MatrixMultiply {
public static void main(String[] args) throws Exception

{ if (args.length != 2) {
System.err.println("Usage: MatrixMultiply <in_dir>
<out_dir>");
System.exit(2);
}
Configuration conf = new Configuration();
// M is an m-by-n matrix; N is an n-by-p matrix.
conf.set("m", "1000");
conf.set("n", "100");
conf.set("p", "1000");
@SuppressWarnings("deprecation")
Job job = new Job(conf, "MatrixMultiply");
job.setJarByClass(MatrixMultiply.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
// Input file in sparse matrix form.
Vivo Voce:
2. Explain driver program in this experiment?
3. What is sparse matrix format.
References:


EXPERIMENT No. 7
Objective: Install and Run Pig then write Pig Latin scripts to sort, group, join, project,
and filter your data.
1. Install & Run

Pig comes built-in with Cloudera CDH. Just run the pig on terminal by typing:
pig
Table/Relation
Student ID First Last Name Phone City

Name
001 Rajiv Reddy 9848022337 Hyderabad
002 siddarth Battacharya 9848022338 Kolkata
003 Rajesh Khanna 9848022339 Delhi
004 Preethi Agarwal 9848022330 Pune
005 Trupthi Mohanthy 9848022336 Bhuwaneshwar
006 Archana Mishra 9848022335 Chennai

2. Load data into HDFS

student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
3. Dump the relation: dump student;
4. Group operator
(i)group_data = GROUP student_details by age;

(ii) group_multiple = GROUP student_details by (age, city);
5. Join Operator
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
Self - join
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at
least one relation. Assuming two same relations exists: customer1 and customer2 on customers.txt
file.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when
there is a match in both tables.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An
outer join operation is carried out in three ways −
 Left outer join

 Right outer join
 Full outer join
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
6. Filter Operator
filter_data = FILTER student_details BY city == 'Chennai';
7. Order by Operator:
grunt> order_by_data = ORDER student_details BY age DESC;

Vivo Voce:
1. How pig and piglatin differ?
2. What are the two modes of pig execution?
3. What is pig architecture? Explain .
References:


EXPERIMENT No. 8
Objective: Install and Run Hive then use Hive to create, alter, and drop databases,
tables, views, functions, and indexes.
1. Install & Run

Hive comes built-in with Cloudera CDH. Open terminal and just type: hive
2. Create Database
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
3. Drop Database
hive> DROP DATABASE IF EXISTS userdb;
4. Create Table
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
5. Load Data into Table
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'

OVERWRITE INTO TABLE employee;
6. Alter Table
hive> ALTER TABLE employee RENAME TO emp;
7. Drop Table
hive> DROP TABLE IF EXISTS employee;

8. Built-in Functions (See reference: https://www.tutorialspoint.com/hive/hive_built_in_functions.htm)

(i) hive> SELECT round(2.6) from temp;
(ii) hive> SELECT floor(2.6) from temp;
(iii) hive> SELECT ceil(2.6) from temp;
9. Working with Views

(i) hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;
(ii) hive> DROP VIEW emp_30000;
10. Working with Indexes
(i) hive> CREATE INDEX inedx_salary ON TABLE employee(salary)

AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
(ii) hive> DROP INDEX index_salary ON employee;
Vivo Voce:
1. How hive works ?
2. What are the various operators used in hive?
3. What is hive architecture? Explain .
References:


EXPERIMENT No. 9
Objective: Solve some real-life big data problems
1. Example 1
The University of Alabama has more than 38,000 students and an ocean of data. In the past
when there were no real solutions to analyze that much data, some of them seemed useless.
Now, administrators can use analytics and data visualizations for this data to draw out patterns
of students revolutionizing the university’s operations, recruitment, and retention efforts.
2. Example 2
Wearable devices and sensors have been introduced in the healthcare industry which can provide
real-time feed to the electronic health record of a patient. One such technology is Apple. Appl
has come up with Apple HealthKit, CareKit, and ResearchKit. The main goal is to empower iPh-
one users to store and access their real-time health records on their phones.
3. Example 3
Food and Drug Administration (FDA) which runs under the jurisdiction of the Federal
Government of the USA leverages the analysis of big data to discover patterns and associations
to identify and examine the expected or unexpected occurrences of food-based infections.
4. Example 4
Spotify, on-demand music providing platform, uses Big Data Analytics, collects data from all its
users around the globe, and then uses the analyzed data to give informed music
recommendations and suggestions to every individual user. Amazon Prime that offers, videos,
music, and Kindle books in a one-stop shop is also big on using big data.

5. Example 5
IBM Deep Thunder, which is a research project by IBM, provides weather forecasting through
high-performance computing of big data. IBM is also assisting Tokyo with improved weather
forecasting for natural disasters or predicting the probability of damaged power lines.
6. Example 6
Uber generates and uses a huge amount of data regarding drivers, their vehicles, locations,
every trip from every vehicle, etc. All this data is analyzed and then used to predict supply,
demand, location of drivers, and fares that will be set for every trip.
And guess what? We too make use of this application when we choose a route to save fuel
and time, based on our knowledge of having taken that particular route sometime in the past.
In this case, we analyzed and made use of the data that we had previously acquired on ac-
count of our experience, and then we used it to make a smart decision. It’s pretty cool that
big data has played parts not only in big fields but also in our smallest day-to-day life deci-
sions too.
Vivo Voce:
1. Think of some problem of Big Data that Hadoop can solve ?
2. How Pig and Hive differs?
3. Discuss some solutions of big data problem in health care industry. .
References:

3. https://intellipaat.com/blog/10-big-data-examples-application-of-big-data-in-real-life/

8CS4-21 BDA Lab - Dr. Varun P Saxena

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8CS4-21 BDA Lab - Dr. Varun P Saxena

Uploaded by

Copyright:

Available Formats

Faculty Name : Dr.

Varun Prakash Saxena

DEPARTMENT OF COMPUTER SCIENCE &

BIG DATA ANALYTICS LAB 8CS4-21

BTU Syllabus iii

1. Implement the following Data structures in Java 1-4

2. Sit on your own seat which is assigned by faculty or lab staff.

8. Please keep your bag at right place inside the lab.

NETWORK PROGRAMMING LAB (4CS4-23)

S.No. Name of Experiment

3. Implement the following file management tasks in Hadoop:

5. Write a Map Reduce program that mines weather data.

6. Implement Matrix Multiplication with Hadoop Map Reduce

9. Solve some real-life big data problems.

Objective: Implement the following Data structures in Java

1.1 Linked List

LinkedList<String> ll = new LinkedList<String>();

static void stack_pop(Stack<Integer> stack)

for(int i = 0; i < 5; i++)

// Displaying element on the top of the stack

static void stack_search(Stack<Integer> stack, int element)

public static void main (String[] args)

public class QueueExample {

public static void main(String[] args)

System.out.println("Elements of queue "+ q);

int removedele = q.remove();

int head = q.peek();

public static void main(String[] args)

public static void main(String args[])

for (Map.Entry<String, Integer> me :

1. “Java Complete Reference”, Herber Schildt, TMH Publications.

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 4

Standalone, Pseudodistributed, Fully distributed.

2.1 Local Standalone Mode

5. Install openssh server

6. Verify SSH installation

7. Setup password less ssh to localhost

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 6

8. Add the generated key into authorized keys

9. To check local host connection type ssh localhost in terminal.

user@user-ThinkCentre-E73:~$ ssh localhost

10. After connecting to local host type exit.

user@user-ThinkCentre-E73:~$ sudo gedit .bashrc

user@user-ThinkCentre-E73:~$ source ~/.bashrc

user@user-ThinkCentre-E73:~$ hadoop version

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 7

2.2 Pseudo Distributed Mode

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 8

2.3 Fully Distributed Mode

In the discussion below we’ll use the following server names:

2. Extract it to a Downloads folder say, /home/user/Downloads on master.

$sudo gedit /etc/hostname

In terminal it will ask enter password user123

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 9

user@master:/etc$ sudo vi hosts (OR) user@master:/etc$ sudo gedit hosts

In the slave1 and slave2 also perform the same operation.

user@user-ThinkCentre-E73:~$ sudo gedit .bashrc

7. Edit Hadoop environment files

Add following lines at the end of script in etc/hadoop/hadoop-env.sh

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 10

8. Set the configuration files for fully distributed mode as below.

9. Add the slave entries in slaves file on master machine:

user@master:~/Downloads$ scp -r hadoop slave1:/home/user/Downloads/hadoop