Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 37

Faculty Name : Dr.

Varun Prakash Saxena

DEPARTMENT OF COMPUTER SCIENCE &


ENGINEERING

LAB MANUAL
B.Tech. VIII Semester

BIG DATA ANALYTICS LAB 8CS4-21

i
Exp. CONTENTS Page No.
No.

ii
Lab Instructions

BTU Syllabus iii

Lab Introduction iv

1. Implement the following Data structures in Java 1-4


i) Linked Lists ii) Stacks iii) Queues iv) Set v) Map

2 Perform setting up and Installing Hadoop in its three operating modes: 5-11
Standalone, Pseudodistributed, Fully distributed.
3 Implement the following file management tasks in Hadoop: 12-13
(i) Adding files and directories
(ii) Retrieving
files (iii)Deleting
files
4 Run a basic Word Count Map Reduce program to understand Map Reduce 14-16
Paradigm.
5 17-19
Write a Map Reduce program that mines weather data.
6 20-23
Implement Matrix Multiplication with Hadoop Map Reduce
7 Install and Run Pig then write Pig Latin scripts to sort, group, join, project, 24-27
and filter your data.
8 Install and Run Hive then use Hive to create, alter, and drop databases, 28-29
tables, views, functions, and indexes.
9 30-31
Solve some real-life big data problems.

ii
Lab Instructions
1. Keep Silence in the lab.

2. Sit on your own seat which is assigned by faculty or lab staff.

3. Always follow the instruction given by lab staff for to perform the assigned experiment.

4. Do not switch ON the supply of the panel until the circuit is not been checked by lab staff.

5. Every student is responsible for any damage to the panel or any component which
is assigned for lab work.

6. Do not go to any table to assist any student without permission of lab staff, if so, may
be punished for that.

7. Always come to lab with your file and file work should be completed.

8. Please keep your bag at right place inside the lab.

9. Always get checked and signed your file work in time (experiment which was performed
in previous lab positively to be checked) after that faculty may not check your work.

iii
BTU SYLLABUS

NETWORK PROGRAMMING LAB (4CS4-23)

S.No. Name of Experiment


1. Implement the following Data structures in Java
i) Linked Lists ii) Stacks iii) Queues iv) Set v) Map

2. Perform setting up and Installing Hadoop in its three operating modes: Standalone,
Pseudodistributed, Fully distributed.

3. Implement the following file management tasks in Hadoop:


(i) Adding files and directories
(ii)Retrieving files
(iii)Deleting files

4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.

5. Write a Map Reduce program that mines weather data.

6. Implement Matrix Multiplication with Hadoop Map Reduce

7. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter
your data.

8. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views
functions, and indexes.

9. Solve some real-life big data problems.

iii
LAB INTRODUCTION

The Big Data Analytics(BDA) Lab focuses on large-scale data analytics problems that arise in
different application domains and disciplines starting with fundamental concepts. The aim of
BDA lab is to let user get hands on experience of Hadoop framework practically. The user will
be exposed to installation of Hadoop and its related components like YARN, Hive, Pig etc. User
will be writing programs based on map-reduce which is programming unit of Hadoop
framework. Along with this user will be working with basics of Pig and Hive commands and
analyzing data using both Pig and Hive.

iv
8CS4-21: Big Data Analytics Lab

Experiment NO. 1

Objective: Implement the following Data structures in Java


i) Linked Lists ii) Stacks iii) Queues iv) Set v) Map

1.1 Linked List


// Java Program to Demonstrate
// Implementation of LinkedList
// class

import java.util.*;

public class LL {
public static void main(String args[])
{

LinkedList<String> ll = new LinkedList<String>();


ll.add("A");
ll.add("B");
ll.addLast("C");
ll.addFirst("D");
ll.add(2, "E");
System.out.println(ll);

ll.remove("B");
ll.remove(3);
ll.removeFirst();
ll.removeLast();

System.out.println(ll);
}
}

1.2 Stack
// Java code for stack implementation
import java.io.*;
import java.util.*;
class Test
{
static void stack_push(Stack<Integer> stack)
{
for(int i = 0; i < 5; i++)
{
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 1
8CS4-21: Big Data Analytics Lab

stack.push(i);
}
}

static void stack_pop(Stack<Integer> stack)


{
System.out.println("Pop Operation:");

for(int i = 0; i < 5; i++)


{
Integer y = (Integer) stack.pop();
System.out.println(y);
}
}

// Displaying element on the top of the stack


static void stack_peek(Stack<Integer> stack)
{
Integer element = (Integer) stack.peek();
System.out.println("Element on stack top: " + element);
}

static void stack_search(Stack<Integer> stack, int element)


{
Integer pos = (Integer) stack.search(element);

if(pos == -1)
System.out.println("Element not found");
else
System.out.println("Element is found at position: " +
pos);
}

public static void main (String[] args)


{
Stack<Integer> stack = new Stack<Integer>();
stack_push(stack);
stack_pop(stack);
stack_push(stack);
stack_peek(stack);
stack_search(stack, 2);
stack_search(stack, 6);
}
}

1.3 Queue
// Java program to demonstrate a Queue
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 2
8CS4-21: Big Data Analytics Lab

import java.util.LinkedList;
import java.util.Queue;

public class QueueExample {

public static void main(String[] args)


{
Queue<Integer> q = new LinkedList<>();
for (int i = 0; i < 5; i++)
q.add(i);

System.out.println("Elements of queue "+ q);

int removedele = q.remove();


System.out.println("removed element-"+ removedele);
System.out.println(q);

int head = q.peek();


System.out.println("head of queue-"+ head);
int size = q.size();
System.out.println("Size of queue-"+ size);
}
}
1.4 Set
// Java program Illustrating Set Interface
import java.util.*;
public class SetExample {

public static void main(String[] args)


{
Set<String> hash_Set = new HashSet<String>();
hash_Set.add("Geeks");
hash_Set.add("For");
hash_Set.add("Geeks");
hash_Set.add("Example");
hash_Set.add("Set");
System.out.println(hash_Set);
}
}

1.5 Map
// Java Program to Demonstrate
// Working of Map interface

import java.util.*;
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 3
8CS4-21: Big Data Analytics Lab

class MapExample {

public static void main(String args[])


{

Map<String, Integer> hm
= new HashMap<String, Integer>();
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));

for (Map.Entry<String, Integer> me :


hm.entrySet())
{ System.out.print(me.getKey() + ":");
System.out.println(me.getValue());
}
}
}

Vivo Voce:
1 What is Collection Framework?
2. What is LinkedList?
3. What is Set/SortedSet?
4. What is Map?
5. How to create Queue/Stack in Java?

References:

1. “Java Complete Reference”, Herber Schildt, TMH Publications.


2. Hadoop In Action By Chuck Lam, Manning Publications
3. Hadoop The Definitive Guide By Tom White, Oreilly Publications

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 4


8CS4-21: Big Data Analytics Lab

EXPERIMENT No. 2

Objective: Perform setting up and Installing Hadoop in its three operating modes:

Standalone, Pseudodistributed, Fully distributed.

2.1 Local Standalone Mode


The standalone mode is the default mode for Hadoop. With empty configuration files, Hadoop will
run completely on the local machine. Because there’s no need to communicate with other nodes, the
standalone mode doesn’t use HDFS, nor will it launch any of the Hadoop daemons. Its primary use
is for developing and debugging the application logic of a MapReduce program without the
additional complexity of interacting with the daemons.
1. Place softwares into system downloads:
i. hadoop-2.7.2.tar.gz
ii. jdk-8u77-linux-i586.tar.gz
2. Extraction of tar files
Extract above two files in downloads itself. This can be done by right clicking on the tar
file by selecting extract here option.
3. Rename the extracted file hadoop2.7.2 to hadoop.
4. Update ubuntu operating system
The ubuntu can be updated with the following
command user@user-ThinkCentre-E73:~$ sudo apt
update
The above command asks the user enter password. Enter the corresponding password and
press enter key.

5. Install openssh server

The next step is to install openssh server by typing the following command in
terminal user@user-ThinkCentre-E73:~$ sudo apt-get install openssh-server

6. Verify SSH installation

The first step is to check whether SSH is installed on your nodes. We can easily do this by use
of the "which" UNIX command:
user@user-ThinkCentre-E73:~$ which ssh

7. Setup password less ssh to localhost


Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 5
8CS4-21: Big Data Analytics Lab
user@user-ThinkCentre-E73:~$ ssh-keygen -t rsa

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 6


8CS4-21: Big Data Analytics Lab

8. Add the generated key into authorized keys

ser@user-ThinkCentre-E73:~$cat
/home/user/.ssh/id_rsa.pub>>/home/user/.ssh/authorized_keys

9. To check local host connection type ssh localhost in terminal.

user@user-ThinkCentre-E73:~$ ssh localhost

10. After connecting to local host type exit.

user@user-ThinkCentre-E73:~$ exit
11. Open bashrc file and add the following lines at the end of file

user@user-ThinkCentre-E73:~$ sudo gedit .bashrc


export JAVA_HOME=/home/user/Downloads/jdk1.8.0_77
export PATH=$PATH:$JAVA_HOME/bin
HADOOP_HOME=/home/user/Downloads/hadoop
export PATH=$PATH:/home/user/Downloads/hadoop/bin
export PATH=$PATH:/home/user/Downloads/hadoop/sbin

12. Now apply all the changes into the current running system by typing the
following command in terminal

user@user-ThinkCentre-E73:~$ source ~/.bashrc

13. Now verify the java version by giving the following command in terminal.
user@user-ThinkCentre-E73:~$ echo $JAVA_HOME

14. Now verify the hadooop version by giving the following command in terminal.

user@user-ThinkCentre-E73:~$ hadoop version

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 7


8CS4-21: Big Data Analytics Lab

2.2 Pseudo Distributed Mode

The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons running on a
single machine. This mode complements the standalone mode for debugging your code, allowing
you to examine memory usage, HDFS input/output issues, and other daemon interactions. In order to
work with pseudo distributed mode in addition to the above process the four XML files (core-
site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml) are to be updated as below.

2.2.1 core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

2.2.2 hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

2.2.3 mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 8


8CS4-21: Big Data Analytics Lab

2.2.4 yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration

In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the
JobTracker, respectively. In hdfs-site.xml we specify the default replication factor for HDFS, which
should only be one because we’re running on only one node. The yarn-site.xml is used to configure
yarn into Hadoop.

The hadoop-env.sh file contains other variables for defining your Hadoop environment. To set
environment variables we have to add following lines at the end of file in hadoop-env.sh.

export JAVA_HOME=/home/user/Downloads/jdk1.8.0_77
export HADOOP_HOME=/home/user/Downloads/hadoop

2.3 Fully Distributed Mode

In the discussion below we’ll use the following server names:


■ master—The master node of the cluster and host of the NameNode and JobTracker daemons
■ backup—The server that hosts the Secondary NameNode daemon
■ slave1, slave2, slave3, ...—The slave boxes of the cluster running both DataNode and TaskTracker
daemons

1. First decide how may nodes are to be grouped . Assume one of it as master and other as slave1,
slave2 respectively. You can download hadoop-2.7.2.tar.gz file for hadoop

2. Extract it to a Downloads folder say, /home/user/Downloads on master.


NOTE: Master and all the slaves must have the same user name. Slaves should not have hadoop
folder since it will be automatically created during installation.
To change name of host

$sudo gedit /etc/hostname

In terminal it will ask enter password user123

Now host name file is opened Remove old name and type new name as master, slave1 and slave2
respectively. Then save and exit. To apply these changes restart the master and slave nodes.

3. Add the association between the hostnames and the ip address for the master and the slaves
on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to
ping to each other.
user@master:~$ cd /etc

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 9


8CS4-21: Big Data Analytics Lab

user@master:/etc$ sudo vi hosts (OR) user@master:/etc$ sudo gedit hosts

In the slave1 and slave2 also perform the same operation.

4. Next we need to copy the public key to every slave node as well as the master node.
Make sure that the master is able to do a password-less ssh to all the slaves.

$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@slave1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@slave2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit

5. open bashrc file and add the following lines at the end of file

user@user-ThinkCentre-E73:~$ sudo gedit .bashrc


export JAVA_HOME=/home/user/Downloads/jdk1.8.0_77
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/home/user/Downloads/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

7. Edit Hadoop environment files

Add following lines at the end of script in etc/hadoop/hadoop-env.sh


export JAVA_HOME=/home/user/Downloads/jdk1.8.0_77
export HADOOP_HOME=/home/user/Downloads/hadoop
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Add following lines at start of script in etc/hadoop/yarn-env.sh
export JAVA_HOME=/home/user/Downloads/jdk1.8.0_77
export HADOOP_HOME=/home/user/Downloads/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 10


8CS4-21: Big Data Analytics Lab

8. Set the configuration files for fully distributed mode as below.

core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/user/Downloads/hadoop/tmp</value>
</property>
</configuration>

hdfs-site.xml :

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml :

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 11
8CS4-21: Big Data Analytics Lab

<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8040</value>
</property>
</configuration>

9. Add the slave entries in slaves file on master machine:


slave1
slave2

10. Use the following commands to copy hadoop from master node to slave nodes.

user@master:~/Downloads$ scp -r hadoop slave1:/home/user/Downloads/hadoop


user@master:~/Downloads$ scp -r hadoop slave2:/home/user/Downloads/hadoop

Vivo Voce:
1. What are various modes of Hadoop?
2. What are various configuration files in hadoop?
3. Explain all three modes of hadoop installation.

References:

1. Hadoop In Action By Chuck Lam, Manning Publications


2. Hadoop The Definitive Guide By Tom White, Oreilly Publications

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 12


8CS4-21: Big Data Analytics Lab

EXPERIMENT No. 3
Objective: Implement the following file management tasks in Hadoop:

(i) Adding files and directories

(ii) Retrieving files

(iii) Deleting files

i) The cat command

HDFS Command that reads a file on HDFS and prints the content of that file to the standard output.

Usage:hdfsdfs –cat /path/to/file_in_hdfs

Command: hdfs dfs –cat

/new_edureka/test

ii) mkdir

HDFS Command to create the directory in HDFS.


Usage: hdfs dfs –mkdir /directory_name
Command:hdfs dfs –mkdir /new_edureka

iii) cp

HDFS Command to copy files from source to destination. This command allows
multiple sources as well, in which case the destination must be a directory.

Usage: hdfs dfs -cp <src> <dest>


Command: hdfs dfs -cp/user/hadoop/file1 /user/hadoop/file2
Command: hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

iv) rm

HDFS Command to remove the file from HDFS.


Usage: hdfs dfs –rm <path>
Command:hdfs dfs –rm /new_edureka/test

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 13


8CS4-21: Big Data Analytics Lab

v) get

HDFS Command to copy files from hdfs to the local file system.

Usage: hdfs dfs -get <src> <localdst>


Command:hdfs dfs –get /user/test /home/edureka

Vivo Voce:
1. How local and distributed file system differ ?
2. How to copy file from local file system to hdfs?
3. Explain various commands used in hdfs.

References:

1. Hadoop In Action By Chuck Lam, Manning Publications


2. Hadoop The Definitive Guide By Tom White, Oreilly Publications

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 14


8CS4-21: Big Data Analytics Lab

EXPERIMENT No. 4

Objective: Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm.

// Mapper Code
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WCMapper extends


MapReduceBase implements
Mapper<LongWritable,Text, Text, IntWritable>
{

public void map(LongWritable key, Text value, OutputCollector<Text,


IntWritable> output, Reporter rep)
throws IOException
{
String line = value.toString();
for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}

//Reducer

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 15
8CS4-21: Big Data Analytics Lab

public class WCReducer extends


MapReduceBase implements
Reducer<Text,IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> value,


OutputCollector<Text, IntWritable> output,
Reporter rep) throws IOException
{
int count = 0;
while (value.hasNext())
{
IntWritable i = value.next();
count += i.get();
}
output.collect(key, new IntWritable(count));
}
}

//Driver
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WCDriver extends Configured implements Tool

{ public int run(String args[]) throws IOException


{
if (args.length < 2)
{
System.out.println("Please give valid inputs");
return -1;
}
JobConf conf = new JobConf(WCDriver.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WCMapper.class);
conf.setReducerClass(WCReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 16


8CS4-21: Big Data Analytics Lab

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}

public static void main(String args[]) throws Exception


{
int exitCode = ToolRunner.run(new WCDriver(), args);
System.out.println(exitCode);
}
}

Vivo Voce:
1. How map and reduce differ?
2. Explain combiner and partitioner ?
3. Explain various commands used in hdfs.

References:

1. Hadoop In Action By Chuck Lam, Manning Publications


2. Hadoop The Definitive Guide By Tom White, Oreilly Publications

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 17


8CS4-21: Big Data Analytics Lab

EXPERIMENT No. 5
Objective: Write a Map Reduce program that mines weather data.

//Mapper

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable>
{ private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus
signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]"))
{ context.write(new Text(year), new IntWritable(airTemperature));
}
}
}

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 18


8CS4-21: Big Data Analytics Lab

// Reducer

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException
{ int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values)
{ maxValue = Math.max(maxValue,
value.get());
}
context.write(key, new IntWritable(maxValue));
}
}

// Driver

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output
path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 19
8CS4-21: Big Data Analytics Lab

Vivo Voce:
1. How map and reduce differ?
2. Explain combiner and partitioner ?
3. Explain weather data mining in hadoop.

References:

1. Hadoop In Action By Chuck Lam, Manning Publications


2. Hadoop The Definitive Guide By Tom White, Oreilly Publications

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 20


8CS4-21: Big Data Analytics Lab

EXPERIMENT No. 6

Objective: Implement Matrix Multiplication with Hadoop Map Reduce


.
//Mapper

package com.lendap.hadoop;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class Map


extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text,
Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
// (M, i, j, Mij);
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("M"))
{ for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
// outputKey.set(i,k);
outputValue.set(indicesAndValue[0] + "," +
indicesAndValue[2]
+ "," + indicesAndValue[3]);
// outputValue.set(M,j,Mij);
context.write(outputKey, outputValue);
}
} else {
// (N, j, k, Njk);
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("N," + indicesAndValue[1] + ","
+ indicesAndValue[3]);
Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 21
8CS4-21: Big Data Analytics Lab

context.write(outputKey, outputValue);
}
}
}
}

// Reducer

package com.lendap.hadoop;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.HashMap;

public class Reduce


extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text,
Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context
context)
throws IOException, InterruptedException
{ String[] value;
//key=(i,k),
//Values = [(M/N,j,V/W),..]
HashMap<Integer, Float> hashA = new HashMap<Integer,
Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer,
Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("M")) {
hashA.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
}
}
int n =
Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float m_ij;
float n_jk;
for (int j = 0; j < n; j++) {
m_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
n_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 22


8CS4-21: Big Data Analytics Lab

result += m_ij * n_jk;


}
if (result != 0.0f) {
context.write(null,
new Text(key.toString() + "," +
Float.toString(result)));
}
}
}

//Driver

package com.lendap.hadoop;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class MatrixMultiply {

public static void main(String[] args) throws Exception


{ if (args.length != 2) {
System.err.println("Usage: MatrixMultiply <in_dir>
<out_dir>");
System.exit(2);
}
Configuration conf = new Configuration();
// M is an m-by-n matrix; N is an n-by-p matrix.
conf.set("m", "1000");
conf.set("n", "100");
conf.set("p", "1000");
@SuppressWarnings("deprecation")
Job job = new Job(conf, "MatrixMultiply");
job.setJarByClass(MatrixMultiply.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 23


8CS4-21: Big Data Analytics Lab

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
}

// Input file in sparse matrix form.

Vivo Voce:
1. How map and reduce differ?
2. Explain driver program in this experiment?
3. What is sparse matrix format.

References:

1. Hadoop In Action By Chuck Lam, Manning Publications


2. Hadoop The Definitive Guide By Tom White, Oreilly Publications

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 24


8CS4-21: Big Data Analytics Lab

EXPERIMENT No. 7

Objective: Install and Run Pig then write Pig Latin scripts to sort, group, join, project,
and filter your data.

1. Install & Run


Pig comes built-in with Cloudera CDH. Just run the pig on terminal by typing:
pig

Table/Relation

Student ID First Last Name Phone City


Name

001 Rajiv Reddy 9848022337 Hyderabad

002 siddarth Battacharya 9848022338 Kolkata

003 Rajesh Khanna 9848022339 Delhi

004 Preethi Agarwal 9848022330 Pune

005 Trupthi Mohanthy 9848022336 Bhuwaneshwar

006 Archana Mishra 9848022335 Chennai

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 25


8CS4-21: Big Data Analytics Lab

2. Load data into HDFS


student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );

3. Dump the relation: dump student;

4. Group operator

(i)group_data = GROUP student_details by age;


(ii) group_multiple = GROUP student_details by (age, city);

5. Join Operator
customers.txt

1,Ramesh,32,Ahmedabad,2000.00

2,Khilan,25,Delhi,1500.00

3,kaushik,23,Kota,2000.00

4,Chaitali,25,Mumbai,6500.00

5,Hardik,27,Bhopal,8500.00

6,Komal,22,MP,4500.00

7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000

100,2009-10-08 00:00:00,3,1500

101,2009-11-20 00:00:00,2,1560

103,2008-05-20 00:00:00,4,2060

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 26


8CS4-21: Big Data Analytics Lab

grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')

as (id:int, name:chararray, age:int, address:chararray, salary:int);

grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')

as (oid:int, date:chararray, customer_id:int, amount:int);

Self - join

Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at
least one relation. Assuming two same relations exists: customer1 and customer2 on customers.txt
file.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

Inner Join

Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when
there is a match in both tables.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;

Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An
outer join operation is carried out in three ways −

 Left outer join


 Right outer join
 Full outer join

grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

6. Filter Operator

filter_data = FILTER student_details BY city == 'Chennai';

7. Order by Operator:

grunt> order_by_data = ORDER student_details BY age DESC;

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 27


8CS4-21: Big Data Analytics Lab

Vivo Voce:
1. How pig and piglatin differ?
2. What are the two modes of pig execution?
3. What is pig architecture? Explain .

References:

1. Hadoop In Action By Chuck Lam, Manning Publications


2. Hadoop The Definitive Guide By Tom White, Oreilly Publications

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 28


8CS4-21: Big Data Analytics Lab

EXPERIMENT No. 8

Objective: Install and Run Hive then use Hive to create, alter, and drop databases,
tables, views, functions, and indexes.

1. Install & Run


Hive comes built-in with Cloudera CDH. Open terminal and just type: hive
2. Create Database
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
3. Drop Database
hive> DROP DATABASE IF EXISTS userdb;

4. Create Table
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

5. Load Data into Table

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'


OVERWRITE INTO TABLE employee;

6. Alter Table

hive> ALTER TABLE employee RENAME TO emp;

7. Drop Table
hive> DROP TABLE IF EXISTS employee;

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 29


8CS4-21: Big Data Analytics Lab

8. Built-in Functions (See reference: https://www.tutorialspoint.com/hive/hive_built_in_functions.htm)


(i) hive> SELECT round(2.6) from temp;
(ii) hive> SELECT floor(2.6) from temp;
(iii) hive> SELECT ceil(2.6) from temp;

9. Working with Views


(i) hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;

(ii) hive> DROP VIEW emp_30000;

10. Working with Indexes

(i) hive> CREATE INDEX inedx_salary ON TABLE employee(salary)


AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

(ii) hive> DROP INDEX index_salary ON employee;

Vivo Voce:
1. How hive works ?
2. What are the various operators used in hive?
3. What is hive architecture? Explain .

References:

1. Hadoop In Action By Chuck Lam, Manning Publications


2. Hadoop The Definitive Guide By Tom White, Oreilly Publications

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 30


8CS4-21: Big Data Analytics Lab

EXPERIMENT No. 9

Objective: Solve some real-life big data problems

1. Example 1
The University of Alabama has more than 38,000 students and an ocean of data. In the past
when there were no real solutions to analyze that much data, some of them seemed useless.
Now, administrators can use analytics and data visualizations for this data to draw out patterns
of students revolutionizing the university’s operations, recruitment, and retention efforts.

2. Example 2
Wearable devices and sensors have been introduced in the healthcare industry which can provide
real-time feed to the electronic health record of a patient. One such technology is Apple. Appl
has come up with Apple HealthKit, CareKit, and ResearchKit. The main goal is to empower iPh-
one users to store and access their real-time health records on their phones.

3. Example 3
Food and Drug Administration (FDA) which runs under the jurisdiction of the Federal
Government of the USA leverages the analysis of big data to discover patterns and associations
to identify and examine the expected or unexpected occurrences of food-based infections.

4. Example 4
Spotify, on-demand music providing platform, uses Big Data Analytics, collects data from all its
users around the globe, and then uses the analyzed data to give informed music
recommendations and suggestions to every individual user. Amazon Prime that offers, videos,
music, and Kindle books in a one-stop shop is also big on using big data.

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 31


8CS4-21: Big Data Analytics Lab

5. Example 5
IBM Deep Thunder, which is a research project by IBM, provides weather forecasting through
high-performance computing of big data. IBM is also assisting Tokyo with improved weather
forecasting for natural disasters or predicting the probability of damaged power lines.

6. Example 6
Uber generates and uses a huge amount of data regarding drivers, their vehicles, locations,
every trip from every vehicle, etc. All this data is analyzed and then used to predict supply,
demand, location of drivers, and fares that will be set for every trip.
And guess what? We too make use of this application when we choose a route to save fuel
and time, based on our knowledge of having taken that particular route sometime in the past.
In this case, we analyzed and made use of the data that we had previously acquired on ac-
count of our experience, and then we used it to make a smart decision. It’s pretty cool that
big data has played parts not only in big fields but also in our smallest day-to-day life deci-
sions too.

Vivo Voce:
1. Think of some problem of Big Data that Hadoop can solve ?
2. How Pig and Hive differs?
3. Discuss some solutions of big data problem in health care industry. .

References:

1. Hadoop In Action By Chuck Lam, Manning Publications


2. Hadoop The Definitive Guide By Tom White, Oreilly Publications
3. https://intellipaat.com/blog/10-big-data-examples-application-of-big-data-in-real-life/

Dr. Varun Prakash Saxena , Department of CSE,GWEC Ajmer 32

You might also like