DAN Lab ManuaL

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 53

BIG DATA ANALYTICS LABORATORY MANUAL

(19IT24601)

For VI-Semester B.Tech IT

MAHENDRA ENGINEERING COLLEGE

(AUTONOMOUS)

Department of Information Technology

Mahendhirapuri, Mallasamudram(W)

Namakkal-637 503.

1
Extract Copy of Syllabus:

19IT24601 BIG DATA ANALYTICS LABORATORY LTPC

0 04 2

Hadoop
1. Study of Hadoop and HDFS
2. Implement word count / frequency programs using MapReduce
3. Implement an MR program that processes a weather dataset

R - Programming
4. Implement Linear and logistic Regression
5. To implement the SVM / Decision tree classification techniques
6. Implement clustering techniques
7. Visualize data using any plotting framework
8. Implement an application that stores big data in Hbase / MongoDB / Pig using
Hadoop / R.
TOTAL: 45 PERIODS

2
INDEX

Page
Ex. No. NAME OF THE EXPERIMENT
No.

1 STUDY, INSTALL, CONFIGURE AND RUN


HADOOP AND HDFS

2 IMPLEMENT WORD COUNT / FREQUENCY


PROGRAMS USING MAPREDUCE

3 IMPLEMENT AN MR PROGRAM THAT PROCESSES


A WEATHER DATASET

4 IMPLEMENT LINEAR AND LOGISTIC


REGRESSION

5 IMPLEMENT SVM / DECISION TREE


CLASSIFICATION TECHNIQUES

6 IMPLEMENT CLUSTERING TECHNIQUES

7 VISUALIZE DATA USING ANY PLOTTING


FRAMEWORK
IMPLEMENT AN APPLICATION THAT STORES BIG
8 DATA IN HBASE / MONGODB / PIG
USING HADOOP / R.

3
19IT24601 BIG DATA ANALYTICS LABORATORY

CYCLE – I

Contents

Hadoop
1. Study of Hadoop and HDFS
2. Implement word count / frequency programs using MapReduce
3. Implement an MR program that processes a weather dataset
R - Programming
4. Implement Linear and logistic Regression

CYCLE – II

Contents

R - Programming
5. To implement the SVM / Decision tree classification techniques
6. Implement clustering techniques
7. Visualize data using any plotting framework
8. Implement an application that stores big data in Hbase / MongoDB / Pig using
Hadoop / R.

4
EX.NO: 1 INSTALL, CONFIGURE AND RUN HADOOP AND HDFS

DATE:

AIM:
To install a single-node Hadoop cluster backed by the Hadoop Distributed File System
PROCEDURE:
1) Installing Java
Hadoop is a framework written in Java for running applications on large clusters of commodity
hardware. Hadoop needs Java 6 or above to work.
Step 1: Download tar and extract
Download Jdk tar.gz file for linux-64 bit, extract it into “/usr/local”
# cd /opt
# sudo tar xvpzf /home/itadmin/Downloads/jdk-8u5-linux-x64.tar.gz
# cd /opt/jdk1.8.0_05
Step 2: Set Environments
• Open the “/etc/profile” file and Add the following line as per the version
• Set a environment for Java
• Use the root user to save the /etc/proflie or use gedit instead of vi .
• The 'profile' file contains commands that ought to be run for login shells
# sudo vi /etc/profile
#--insert JAVA_HOME
JAVA_HOME=/opt/jdk1.8.0_05
#--in PATH variable just append at the end of the line
PATH=$PATH:$JAVA_HOME/bin
#--Append JAVA_HOME at end of the export statement
export PATH JAVA_HOME
save the file using by pressing “Esc” key followed by :wq!
Step 3: Source the /etc/profile
# source /etc/profile
Step 4: Update the java alternatives
1. By default OS will have a open jdk. Check by “java -version”. You will be prompt
“openJDK”
2. If you also have openjdk installed then you'll need to update the java alternatives:
3. If your system has more than one version of Java, configure which one your system
causes by entering the following command in a terminal window
4. By default OS will have a open jdk. Check by “java -version”. You will be prompt
“JavaHotSpot(TM) 64-Bit Server”
# update-alternatives --install "/usr/bin/java" java "/opt/jdk1.8.0_05/bin/java" 1
# update-alternatives --config java
--type selection number:
5
# java -version
2) configuressh
•Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine
if you want to use Hadoop on it (which is what we want to do in this exercise).For our single-
node setup of Hadoop, we therefore need to configure SSH access to local host The need to
create a Password-less SSH Key generation based authentication is so that master node can then
login to slave nodes (and the secondary node) to start/stop them easily without any delays for
authentication
•If you skip this step, then have to provide passwordGenerate an SSH key for the user.
Then Enable password-less SSH access to you sudo apt-get install openssh-server
--You will be asked to enter password,
root@abc [ ]#sshlocalhost
root@abc[ ]# ssh-keygen
root@abc[ ]# ssh-copy-id -i localhost
--After above 2 steps, You will be connected without password,
root@abc[ ]# sshlocalhost
root@abc[ ]# exit
3) Hadoop installation
•Now Download Hadoop from the official Apache, preferably a stable release version
ofHadoop 2.7.x and extract the contents of the Hadoop package to a location of your choice.
•For example, choose location as “/opt/”
Step 1: Download the tar.gz file of latest version Hadoop( hadoop-2.7.x) from the official site .
Step 2: Extract (untar) the downloaded file from this commands to /opt/big data
root@abc[]# cd /opt
root@abc[/opt]# sudo tar xvpzf /home/itadmin/Downloads/hadoop-2.7.0.tar.gz
root@abc[/opt]# cd hadoop-2.7.0/
Like java, update Hadop environment variable in /etc/profile
# sudo vi /etc/profile
#--insert HADOOP_PREFIX
HADOOP_PREFIX=/opt/hadoop-2.7.0
#--in PATH variable just append at the end of the line
PATH=$PATH:$HADOOP_PREFIX/bin
#--Append HADOOP_PREFIX at end of the export statement
export PATH JAVA_HOME HADOOP_PREFIX
save the file using by pressing “Esc” key followed by :wq!
Step 3: Source the /etc/profile
# source /etc/profile
Verify Hadoop installation
# cd $HADOOP_PREFIX
# bin/hadoop version
3.1) Modify the Hadoop Configuration Files
•In this section, we will configure the directory where Hadoop will store its configuration files,
the network ports it listens to, etc. Our setup will use Hadoop Distributed File System,(HDFS),
even though we are using only a single local machine.

6
•Add the following properties in the various hadoop configuration files which is available under
$HADOOP_PREFIX/etc/hadoop/
•core-site.xml, hdfs-site.xml, mapred-site.xml & yarn-site.xml
Update Java, hadoop path to the Hadoop environment file
# cd $HADOOP_PREFIX/etc/hadoop
# vi hadoop-env.sh
Paste following line at beginning of the file
export JAVA_HOME=/usr/local/jdk1.8.0_05
export HADOOP_PREFIX=/opt/hadoop-2.7.0
Modify the core-site.xml
# cd $HADOOP_PREFIX/etc/hadoop
# vi core-site.xml
Paste following between <configuration> tags
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
< /configuration>
Modify the hdfs-site.xml
# vi hdfs-site.xml
Paste following between <configuration> tags
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
YARN configuration - Single Node
modify the mapred-site.xml
# cpmapred-site.xml.template mapred-site.xml
# vi mapred-site.xml
Paste following between <configuration> tags
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Modiy yarn-site.xml
# vi yarn-site.xml
Paste following between <configuration> tags
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>

7
<value>mapreduce_shuffle</value>
</property>
</configuration>
Formatting the HDFS file-system via the NameNode
•The first step to starting up your Hadoop installation is formatting the Hadoop files system
which is implemented on top of the local file system of our “cluster” which includes only our
local machine. We need to do this the first time you set up a Hadoop cluster.
•Do not format a running Hadoop file system as you will lose all the data currently in the cluster
(in HDFS)
root@abc[ ]# cd $HADOOP_PREFIX
root@abc[ ]# bin/hadoopnamenode -format
Start NameNode daemon and DataNode daemon: (port 50070)
root@abc[ ]# sbin/start-dfs.sh
To know the running daemons jut type jps or /usr/local/jdk1.8.0_05/bin/jps
Start ResourceManager daemon and NodeManager daemon: (port 8088)
root@abc[ ]# sbin/start-yarn.sh
To stop the running process
root@abc[ ]# sbin/stop-dfs.sh
To know the running daemons jut type jps or /usr/local/jdk1.8.0_05/bin/jps
Start ResourceManager daemon and NodeManager daemon: (port 8088)
root@abc[ ]# sbin/stop-yarn.sh
Make the HDFS directories required to execute MapReduce jobs:
$ bin/hdfsdfs -mkdir /user
$ bin/hdfsdfs -mkdir /user/mit
•Copy the input files into the distributed filesystem:
$ bin/hdfsdfs -put <input-path>/* /input
•Run some of the examples provided:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar
grep /input /output '(CSE)'
•Examine the output files:
Copy the output files from the distributed filesystem to the local filesystem and examine them:
$ bin/hdfsdfs -get output output
$ cat output/* or
•View the output files on the distributed filesystem:
$ bin/hdfsdfs -cat /output/*

RESULT:
Thus the installation and configuration of Hadoop and HDFS is successfully executed.

8
EX.NO: 2 IMPLEMENT WORD COUNT / FREQUENCY PROGRAMS
USING MAPREDUCE
DATE:

Procedure:

Prepare:
1. Download MapReduceClient.jar (Link: https://github.com/MuhammadBilalYar/HADOOP-
INSTALLATION-ON-WINDOW-10/blob/master/MapReduceClient.jar)
2. Download Input_file.txt (Link: https://github.com/MuhammadBilalYar/HADOOP-
INSTALLATION-ON-WINDOW-10/blob/master/input_file.txt)

Place both files in "C:/"


Hadoop Operation:
1. Open cmd in Administrative mode and move to "C:/Hadoop-2.8.0/sbin" and start cluster

Start-all.cmd

2. Create an input directory in HDFS.


hadoop fs -mkdir /input_dir
3. Copy the input text file named input_file.txt in the input directory (input_dir) of HDFS.
hadoop fs -put C:/input_file.txt /input_dir
4. Verify input_file.txt available in HDFS input directory (input_dir).
hadoop fs -ls /input_dir/
9
5. Verify content of the copied file.
hadoop dfs -cat /input_dir/input_file.txt

10
6. Run MapReduceClient.jar and also provide input and out directories.
hadoop jar C:/MapReduceClient.jar wordcount /input_dir /output_dir

7. Verify content for generated output file.


hadoop dfs -cat /output_dir/*

11
Some Other useful commands
8) To leave Safe mode
hadoop dfsadmin –safemode leave
9) To delete file from HDFS directory
hadoop fs -rm -r /iutput_dir/input_file.txt
10) To delete directory from HDFS directory
hadoop fs -rm -r /iutput_dir

RESULT:
Thus the Word count program to use Map and reduce tasks is demonstrated successfully.
12
EX.NO: 3 IMPLEMENT AN MR PROGRAM THAT PROCESSES
AWEATHER DATASET
DATE:

AIM:
Map Reduce program to that processes a weather dataset R.

PROCEDURE:
1. Analyze the input file content.
2. Develop the code.
a. Writing a map function.
b. Writing a reduce function.
c. Writing the Driver class.
3. Compiling the source.
4. Building the JAR file.
5. Starting the DFS.
6. Creating Input path in HDFS and moving the data into Input path.
7. Executing the program.
PROGRAM CODING:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**
* @author devinline
*/
public class CalculateMaxAndMinTemeratureWithTime {
public static String calOutputName = "California";
public static String nyOutputName = "Newyork";
public static String njOutputName = "Newjersy";
public static String ausOutputName = "Austin";
public static String bosOutputName = "Boston";
public static String balOutputName = "Baltimore";

public static class WhetherForcastMapper extends


Mapper<Object, Text, Text, Text> {

public void map(Object keyOffset, Text dayReport, Context con)


13
throws IOException, InterruptedException {
StringTokenizer strTokens = new StringTokenizer(
dayReport.toString(), "\t");
int counter = 0;
Float currnetTemp = null;
Float minTemp = Float.MAX_VALUE;
Float maxTemp = Float.MIN_VALUE;
String date = null;
String currentTime = null;
String minTempANDTime = null;
String maxTempANDTime = null;

while (strTokens.hasMoreElements()) {
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
if (minTemp > currnetTemp) {
minTemp = currnetTemp;
minTempANDTime = minTemp + "AND" + currentTime;
}
if (maxTemp < currnetTemp) {
maxTemp = currnetTemp;
maxTempANDTime = maxTemp + "AND" + currentTime;
}
}
}
counter++;
}
// Write to context - MinTemp, MaxTemp and corresponding time
Text temp = new Text();
temp.set(maxTempANDTime);
Text dateText = new Text();
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
}

temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);

}
}

14
public static class WhetherForcastReducer extends
Reducer<Text, Text, Text, Text> {
MultipleOutputs<Text, Text> mos;

public void setup(Context context) {


mos = new MultipleOutputs<Text, Text>(context);
}

public void reduce(Text key, Iterable<Text> values, Context context)


throws IOException, InterruptedException {
int counter = 0;
String reducerInputStr[] = null;
String f1Time = "";
String f2Time = "";
String f1 = "", f2 = "";
Text result = new Text();
for (Text value : values) {

if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
}

else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
}

counter = counter + 1;
}
if (Float.parseFloat(f1) > Float.parseFloat(f2)) {

result = new Text("Time: " + f2Time + " MinTemp: " + f2 + "\t"


+ "Time: " + f1Time + " MaxTemp: " + f1);
} else {

result = new Text("Time: " + f1Time + " MinTemp: " + f1 + "\t"


+ "Time: " + f2Time + " MaxTemp: " + f2);
}
String fileName = "";
if (key.toString().substring(0, 2).equals("CA")) {
fileName = CalculateMaxAndMinTemeratureTime.calOutputName;
} else if (key.toString().substring(0, 2).equals("NY")) {
fileName = CalculateMaxAndMinTemeratureTime.nyOutputName;
} else if (key.toString().substring(0, 2).equals("NJ")) {
fileName = CalculateMaxAndMinTemeratureTime.njOutputName;
} else if (key.toString().substring(0, 3).equals("AUS")) {
fileName = CalculateMaxAndMinTemeratureTime.ausOutputName;
} else if (key.toString().substring(0, 3).equals("BOS")) {
15
fileName = CalculateMaxAndMinTemeratureTime.bosOutputName;
} else if (key.toString().substring(0, 3).equals("BAL")) {
fileName = CalculateMaxAndMinTemeratureTime.balOutputName;
}
String strArr[] = key.toString().split("_");
key.set(strArr[1]); //Key is date value
mos.write(fileName, key, result);
}

@Override
public void cleanup(Context context) throws IOException,
InterruptedException {
mos.close();
}
}

public static void main(String[] args) throws IOException,


ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Wheather Statistics of USA");
job.setJarByClass(CalculateMaxAndMinTemeratureWithTime.class);

job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

MultipleOutputs.addNamedOutput(job, calOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, nyOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, njOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, bosOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, ausOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, balOutputName,
TextOutputFormat.class, Text.class, Text.class);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
Path pathInput = new Path(
"hdfs://192.168.213.133:54310/weatherInputData/input_temp.txt");
Path pathOutputDir = new Path(
"hdfs://192.168.213.133:54310/user/hduser1/testfs/output_mapred3");
FileInputFormat.addInputPath(job, pathInput);
FileOutputFormat.setOutputPath(job, pathOutputDir);
16
try {
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}}}

OUTPUT:

whether output directory is in place on HDFS. Execute following command to verify the same.

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -ls


/user/hduser1/testfs/output_mapred3
Found 8 items
-rw-r--r-- 3 zytham supergroup 438 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Austin-r-00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Baltimore-r-00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Boston-r-00000
-rw-r--r-- 3 zytham supergroup 511 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/California-r-00000
-rw-r--r-- 3 zytham supergroup 146 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Newjersy-r-00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Newyork-r-00000
-rw-r--r-- 3 zytham supergroup 0 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/_SUCCESS
-rw-r--r-- 3 zytham supergroup 0 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/part-r-00000

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -cat


/user/hduser1/testfs/output_mapred3/Austin-r-00000
25-Jan-2018 Time: 12:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 35.7
26-Jan-2018 Time: 22:00:093 MinTemp: -27.0 Time: 05:12:345 MaxTemp: 55.7
27-Jan-2018 Time: 02:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 55.7
29-Jan-2018 Time: 14:00:093 MinTemp: -17.0 Time: 02:34:542 MaxTemp: 62.9
30-Jan-2018 Time: 22:00:093 MinTemp: -27.0 Time: 05:12:345 MaxTemp: 49.2
31-Jan-2018 Time: 14:00:093 MinTemp: -17.0 Time: 03:12:187 MaxTemp: 56.0

RESULT:
Thus the Map Reduce program that processes a weather dataset R is executed successfully.

17
EX.NO: 4a IMPLEMENT LINEAR REGRESSION

DATE:

AIM:
To write the implementation of linear regression.
PROCEDURE:
1. Linear regression is used to predict a quantitative outcome variable (y) on the basis of one or
multiple predictor variables (x)
2. The goal is to build a mathematical formula that defines y as a function of the x variable
3. When you build a regression model, you need to assess the performance of the predictive
model.
4. Two important metrics are commonly used to assess the performance of the predictive
regression model:
5. Root Mean Squared Error, which measures the model prediction error. It corresponds to the
average difference between the observed known values of the outcome and the predicted value
by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2) %>% sqrt().
The lower the RMSE, the better the model.
6. R-square, representing the squared correlation between the observed known outcome values
and the predicted values by the model. The higher the R2, the better the model.
PROGRAM:
X=c(151,174,138,186,128,136,179,163,152,131)
Y=c(63,81,56,91,47,57,76,72,62,48)
plot(X,Y)
relation=lm(Y~X)
print(relation)
print(summary(relation))
a=data.frame(X=170)
result=predict(relation,a)
print(result)
png(file="linearregression.png")
plot(Y,X,col="green",main="Height & Weight
Regression",abline(lm(X~Y)),cex=1.3,pch=16,Xlab="Weight in kg",Ylab="Height in cm")
dev.off()

18
> a=data.frame(X=170)
>result=predict(relation,a)
>print(result)
>png(file="linearregression.png")
>plot(Y,X,col="green",main="Height & Weight
Regression",abline(lm(X~Y)),cex=1.3,pch=16,Xlab="Weight in kg",Ylab="Height in cm")
>dev.off()
OUTPUT:
1
76.22869
RStudioGD
2

RESULT:
Thus the implementation of linear regression was executed and verified successfully.

19
EX.NO: 4b IMPLEMENT LOGISTIC REGRESSION
DATE:

AIM:
To write the implementation of logistic regression.
PROCEDURE:
1. Logistic regression is used to predict the class of individuals based on one or multiple
predictor variables (x).
2. It is used to model a binary outcome, that is a variable, which can have only two possible
values: 0 or 1, yes or no, diseased or non-diseased.
3. Logistic regression belongs to a family, named Generalized Linear Model (GLM), developed
for extending the linear regression model to other situations.
4. Other synonyms are binary logistic regression, binomial logistic regression and logic model.
5. Logistic regression does not return directly the class of observations. It allows us to estimate
the probability (p) of class membership. The probability will range between 0 and 1.
PROGRAM:
input=mtcars[,c("am","cyl","hp","wt")]
am.data=glm(formula=am~cyl+hp+wt,data=input,family = binomial)
print(summary(am.data))

20
OUTPUT:
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8

RESULT:
Thus the implementation of logistic regression was executed and verified successfully.

21
EX.NO: 5
DATE: IMPLEMENT SVM / DECISION TREE CLASSIFICATION
TECHNIQUES

AIM:

To implement SVM/Decision Tree Classification Techniques

IMPLEMENTATION:(SVM)
To use SVM in R, we have a package e1071. The package is not preinstalled, hence one
needs to run the line “install.packages(“e1071”) to install the package and then import the
package contents using the library command--library(e1071).

R CODE:
x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60)
 
#Create a data frame of the data
train=data.frame(x,y)
 
#Plot the dataset
plot(train,pch=16)
 
#Linear regression
model <- lm(y ~ x, train)
 
#Plot the model using abline
abline(model)
 
#SVM
library(e1071)
 
#Fit a model. The function syntax is very similar to lm function
model_svm <- svm(y ~ x , train)
 
#Use the predictions on the data
pred <- predict(model_svm, train)
 
#Plot the predictions and the plot to see our model fit
points(train$x, pred, col = "blue", pch=4)
 
#Linear model has a residuals part which we can extract and directly calculate rmse
error <- model$residuals
lm_error <- sqrt(mean(error^2)) # 3.832974
 
#For svm, we have to manually calculate the difference between actual values (train$y) with our
predictions (pred)
error_2 <- train$y - pred
22
svm_error <- sqrt(mean(error_2^2)) # 2.696281
# perform a grid search
svm_tune <- tune(svm, y ~ x, data = train,
 ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9))
)
print(svm_tune)
 
#Parameter tuning of ‘svm’:
  
# - sampling method: 10-fold cross validation
 
#- best parameters:
# epsilon cost
#0 8
 
#- best performance: 2.872047
 
#The best model
best_mod <- svm_tune$best.model
best_mod_pred <- predict(best_mod, train)
 
error_best_mod <- train$y - best_mod_pred
 
# this value can be different on your computer
# because the tune method randomly shuffles the data
best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738
 plot(svm_tune)
 plot(train,pch=16)
points(train$x, best_mod_pred, col = "blue", pch=4)

OUTPUT:

23
24
IMPLEMENTATION:(DECISION TREE)
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("party")
The basic syntax for creating a decision tree in R is
ctree(formula, data)
input data:
We will use the R in-built data set named readingSkills to create a decision tree. It describes the
score of someone's readingSkills if we know the variables "age","shoesize","score" and whether
the person is a native speaker or not.
# Load the party package. It will automatically load other dependent packages.
library(party)
# Print some records from data set readingSkills.
print(head(readingSkills))
# Load the party package. It will automatically load other dependent packages.
We will use the ctree() function to create the decision tree and see its graph.
# Load the party package. It will automatically load other dependent packages.
library(party)
# Create the input data frame.
input.dat <- readingSkills[c(1:105),]
# Give the chart file a name.
png(file = "decision_tree.png")
# Create the tree.
output.tree <- ctree(nativeSpeaker ~ age + shoeSize + score, data = input.dat)
# Plot the tree.
plot(output.tree)
# Save the file.
dev.off()

25
OUTPUT:

RESULT:
Thus the implementation of SVM and decision tree classification was executed and
verified successfully.

26
EX.NO:6 IMPLEMENT CLUSTERING TECHNIQUES
DATE:

AIM:
To implement clustering Techniques

PROGRAM CODING:

Installing and loading required R packages

install.packages("factoextra")
install.packages("cluster")
install.packages("magrittr")
library("cluster")
library("factoextra")
library("magrittr")

Data preparation

# Load and prepare the data

data("USArrests")
my_data <- USArrests %>% na.omit() %>% # Remove missing values (NA) scale() # Scale
variables
27
# View the firt 3 rows
head(my_data, n = 3)

Distance measures

res.dist <- get_dist(USArrests, stand = TRUE, method = "pearson") fviz_dist(res.dist, gradient


= list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

PARTITION CLUSTERING:

28
Determining the optimal number of clusters: use factoextra::fviz_nbclust()
library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")

Compute and visualize k-means clustering


set.seed(123)
km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data,ellipse.type = "convex", palette = "jc,
ggtheme = theme_minimal())

MODEL BASED CLUSTERING:

# Load the data


29
library("MASS") data("geyser")
# Scatter plot library("ggpubr")
ggscatter(geyser, x = "duration", y = "waiting")+ geom_density2d() # Add 2D density

library("mclust")
data("diabetes")
head(diabetes, 3)

Model-based clustering can be computed using the function Mclust() as follow:


library(mclust)
df <- scale(diabetes[, -1]) # Standardize the data

30
mc <- Mclust(df) # Model-based-clustering
summary(mc) # Print a summary

mc$modelName # Optimal selected model ==> "VVV"


mc$G # Optimal number of cluster => 3 head(mc$z, 30) # Probality to belong to a given
cluster head(mc$classification, 30) # Cluster assignement of each observation

VISUALIZING MODEL-BASED CLUSTERING

library(factoextra)
# BIC values used for choosing the number of clusters
31
fviz_mclust(mc, "BIC", palette = "jco")
# Classification: plot showing the clustering
fviz_mclust(mc, "classification", geom = "point", pointsize = 1.5, palette = "jco")
# Classification uncertainty
fviz_mclust(mc, "uncertainty", palette = "jco")

32
RESULT:
Thus the implementation of clustering techniques using partitioned and model based
clustering was executed and verified successfully.

EX.NO:7 VISUALIZE DATA USING ANY PLOTTING FRAMEWORK


DATE:

AIM:
To implement Data visualization is to provide an efficient graphical display for
summarizing and reasoning about quantitative information.
1. Histogram
Histogram is basically a plot that breaks the data into bins (or breaks) and shows
frequency distribution of these bins. You can change the breaks also and see the effect it has
data visualization in terms of understandability.
Note: We have used par(mfrow=c(2,5)) command to fit multiple graphs in same page for sake of
clarity( see the code below).
PROGRAM:
library(RColorBrewer)
data(VADeaths)
par(mfrow=c(2,3))
hist(VADeaths,breaks=10, col=brewer.pal(3,"Set3"),main="Set3 3 colors")
hist(VADeaths,breaks=3 ,col=brewer.pal(3,"Set2"),main="Set2 3 colors")
hist(VADeaths,breaks=7, col=brewer.pal(3,"Set1"),main="Set1 3 colors")
hist(VADeaths,,breaks= 2, col=brewer.pal(8,"Set3"),main="Set3 8 colors")
33
hist(VADeaths,col=brewer.pal(8,"Greys"),main="Greys 8 colors")
hist(VADeaths,col=brewer.pal(8,"Greens"),main="Greens 8 colors")
OUTPUT:

2.1. Line Chart


Below is the line chart showing the increase in air passengers over given time period. Line
Charts are commonly preferred when we are to analyses a trend spread over a time period.
Furthermore, line plot is also suitable to plots where we need to compare relative changes in
quantities across some variable (like time). Below is the code:
PROGRAM:
data(AirPassengers)
plot(AirPassengers,type="l") #Simple Line Plot

OUTPUT:

34
2.2. Bar Chart
Bar Plots are suitable for showing comparison between cumulative totals across several groups.
Stacked Plots are used for bar plots for various categories. Here’s the code:
PROGRAM:
data("iris")
barplot(iris$Petal.Length) #Creating simple Bar Graph
barplot(iris$Sepal.Length,col = brewer.pal(3,"Set1"))
barplot(table(iris$Species,iris$Sepal.Length),col = brewer.pal(3,"Set1")) #Stacked Plot
OUTPUT:

3. Box Plot
Box Plot shows 5 statistically significant numbers the minimum, the 25th percentile, the
median, the 75th percentile and the maximum. It is thus useful for visualizing the spread of the
data is and deriving inferences accordingly.
PROGRAM:
data(iris)
par(mfrow=c(2,2))
boxplot(iris$Sepal.Length,col="red")
boxplot(iris$Sepal.Length~iris$Species,col="red")
boxplot(iris$Sepal.Length~iris$Species,col=heat.colors(3))
35
boxplot(iris$Sepal.Length~iris$Species,col=topo.colors(3))
boxplot(iris$Petal.Length~iris$Species) #Creating Box Plot between two variable
OUTPUT:

4. Scatter Plot (including 3D and other features)


Scatter plots help in visualizing data easily and for simple data inspection. Here’s the code for
simple scatter and multivariate scatter plot:
PROGRAM:
plot(x=iris$Petal.Length) #Simple Scatter Plot
plot(x=iris$Petal.Length,y=iris$Species) #Multivariate Scatter Plot

36
OUTPUT:

5. Heat Map
One of the most innovative data visualizations in R, the heat map emphasizes color intensity to
visualize relationships between multiple variables. The result is an attractive 2D image that is
easy to interpret. As a basic example, a heat map highlights the popularity of competing items
by ranking them according to their original market launch date. It breaks it down further by
providing sales statistics and figures over the course of time.
PROGRAM:
# simulate a dataset of 10 points

x<‐rnorm(10,mean=rep(1:5,each=2),sd=0.7)

y<‐rnorm(10,mean=rep(c(1,9),each=5),sd=0.1)

dataFrame<‐data.frame(x=x,y=y)
set.seed(143)

dataMatrix<‐as.matrix(dataFrame)[sample(1:10),] # convert to class 'matrix', then shuffle the


rows of the matrix
heatmap(dataMatrix) # visualize hierarchical clustering via a heatmap

OUTPUT:

37
6. Correlogram
Correlated data is best visualized through corrplot. The 2D format is similar to a heat map, but it
highlights statistics that are directly related.
Most correlograms highlight the amount of correlation between datasets at various points in
time. Comparing sales data between different months or years is a basic example.
PROGRAM:
#data("mtcars")

corr_matrix <‐ cor(mtcars)


# with circles
corrplot(corr_matrix)
# with numbers and lower
corrplot(corr_matrix,method = 'number',type = "lower")

OUTPUT:

38
7. Area Chart
39
Area charts express continuity between different variables or data sets. It's akin to the traditional
line chart you know from grade school and is used in a similar fashion.
Most area charts highlight trends and their evolution over the course of time, making them
highly effective when trying to expose underlying trends whether they're positive or negative.
PROGRAM:
data("airquality") #dataset used
airquality %>%
group_by(Day) %>%
summarise(mean_wind = mean(Wind)) %>%
ggplot() +
geom_area(aes(x = Day, y = mean_wind)) +
labs(title = "Area Chart of Average Wind per Day",
subtitle = "using airquality data", y = "Mean Wind")
OUTPUT:

RESULT:
Thus the data is visualized using different frameworks executed successfully.

EX.NO:8 IMPLEMENT AN APPLICATION THAT STORES BIG DATA IN HBASE /


40
MONGODB / PIG USING HADOOP / R.
DATE:

MongoDB with R
1) To use MongoDB with R, first, we have to download and install MongoDB Next, start
MongoDB. We can start MongoDB like so:
mongod
2) Inserting data
Let’s insert the crimes data from data.gov to MongoDB. The dataset reflects reported incidents
of crime (with the exception of murders where data exists for each victim) that occurred in the
City of Chicago since 2001.
library (ggplot2)
library (dplyr)
library (maps)
library (ggmap)
library (mongolite)
library (lubridate)
library (gridExtra)
crimes=data.table::fread("Crimes_2001_to_present.csv")
names (crimes)
Output:
ID' 'Case Number' 'Date' 'Block' 'IUCR' 'Primary Type' 'Description' 'Location Description'
'Arrest''Domestic' 'Beat' 'District' 'Ward' 'Community Area' 'FBI Code' 'X Coordinate' 'Y
Coordinate' 'Year' 'Updated On' 'Latitude' 'Longitude' 'Location'
3) Let’s remove spaces in the column names to avoid any problems when we query it from
MongoDB.
names(crimes) = gsub(" ","",names(crimes))
names(crimes) 'ID' 'CaseNumber' 'Date' 'Block' 'IUCR' 'PrimaryType' 'Description'
'LocationDescription' 'Arrest' 'Domestic' 'Beat' 'District' 'Ward' 'CommunityArea' 'FBICode'
'XCoordinate' 'YCoordinate' 'Year' 'UpdatedOn' 'Latitude' 'Longitude' 'Location'

4) Let’s use the insert function from the mongolite package to insert rows to a collection in
MongoDB.Let’s create a database called Chicago and call the collection crimes.
my_collection = mongo(collection = "crimes", db = "Chicago") # create connection,
database and collection
my_collection$insert(crimes)
Let’s check if we have inserted the “crimes” data.
my_collection$count()
6261148
We see that the collection has 6261148 records
6) First, let’s look what the data looks like by displaying one record:
my_collection$iterate()$one()
$ID
1454164

41
$Case Number
' G185744'
$Date
' 04/01/2001 06:00:00 PM'
$Block
' 049XX N MENARD AV'
$IUCR
0910'
$Primary Type
' MOTOR VEHICLE THEFT'
$Description
' AUTOMOBILE'
$Location Description
' STREET'
$Arrest
' false'
$Domestic
' false'
$Beat
1622
$District
16
$FBICode
' 07'
$XCoordinate
1136545
$YCoordinate
1932203
$Year
2001
$Updated On
' 08/17/2015 03:03:40 PM'
$Latitude
41.970129962
$Longitude
87.773302309
$Location
'(41.970129962, -87.773302309)'
7) How many distinct “Primary Type” do we have?
length(my_collection$distinct("PrimaryType"))
35
As shown above, there are 35 different crime primary types in the database. We will see the
patterns of the most common crime types below.
8) Now, let’s see how many domestic assaults there are in the collection.
my_collection$count('{"PrimaryType":"ASSAULT", "Domestic" : "true" }')
82470
9) To get the filtered data and we can also retrieve only the columns of interest.
query1= my_collection$find('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }')
query2= my_collection$find('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }',
fields = '{"_id":0, "PrimaryType":1, "Domestic":1}')
ncol(query1) # with all the columns
42
ncol(query2) # only the selected columns
22
2
10) To find out “Where do most crimes take place?” use the following command.
my_collection$aggregate('[{"$group":{"_id":"$LocationDescription", "Count":
{"$sum":1}}}]')%>%na.omit()%>%
arrange(desc(Count))%>%head(10)%>%
ggplot(aes(x=reorder(`_id`,Count),y=Count))+
geom_bar(stat="identity",color='skyblue',fill='#b35900')+geom_text(aes(label = Count), color =
"blue") +coord_flip()+xlab("Location Description")

11)If loading the entire dataset we are working with does not slow down our analysis, we can
use data.table or dplyr but when dealing with big data, using MongoDB can give us
performance boost as the whole data will not be loaded into memory. We can reproduce the
above plot without using MongoDB, like so:
crimes%>%group_by(`LocationDescription`)%>%summarise(Total=n())%>%
arrange(desc(Total))%>%head(10)%>%
ggplot(aes(x=reorder(`LocationDescription`,Total),y=Total))+
geom_bar(stat="identity",color='skyblue',fill='#b35900')+geom_text(aes(label = Total), color =
"blue") +coord_flip()+xlab("Location Description")

43
12) What if we want to query all records for certain columns only? This helps us to load only
the columns we want and to save memory for our analysis.
my_collection$find('{}', fields = '{"_id":0, "Latitude":1, "Longitude":1,"Year":1}')
13) We can explore any patterns of domestic crimes. For example, are they common in certain
days/hours/months?
domestic=my_collection$find('{"Domestic":"true"}', fields = '{"_id":0,
"Domestic":1,"Date":1}')
domestic$Date= mdy_hms(domestic$Date)
domestic$Weekday = weekdays(domestic$Date)
domestic$Hour = hour(domestic$Date)
domestic$month = month(domestic$Date,label=TRUE)
WeekdayCounts = as.data.frame(table(domestic$Weekday))
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday",
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red")
+ xlab("Day of the Week") + ylab("Total Domestic Crimes")+
ggtitle("Domestic Crimes in the City of Chicago Since 2001")+
theme(axis.title.x=element_blank(),axis.text.y =
element_text(color="blue",size=11,angle=0,hjust=1,vjust=0),
axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5),
axis.title.y = element_text(size=14),
plot.title=element_text(size=16,color="purple",hjust=0.5))

44
14) Domestic crimes are common over the weekend than in weekdays? What could be the
reason? We can also see the pattern for each day by hour:
DayHourCounts = as.data.frame(table(domestic$Weekday, domestic$Hour))
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1),
size=1.4)+ylab("Count")+
ylab("Total Domestic Crimes")+ggtitle("Domestic Crimes in the City of Chicago Since 2001")+
theme(axis.title.x=element_text(size=14),axis.text.y =
element_text(color="blue",size=11,angle=0,hjust=1,vjust=0),
axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5),
axis.title.y = element_text(size=14),
legend.title=element_blank(),
plot.title=element_text(size=16,color="purple",hjust=0.5))

45
15) The crimes peak mainly around mid-night. We can also use one color for weekdays and
another color for weekend as shown below.
DayHourCounts$Type = ifelse((DayHourCounts$Var1 == "Sunday") | (DayHourCounts$Var1
== "Saturday"), "Weekend", "Weekday")
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type),
size=2, alpha=0.5) +
ylab("Total Domestic Crimes")+ggtitle("Domestic Crimes in the City of Chicago Since 2001")+
theme(axis.title.x=element_text(size=14),axis.text.y =
element_text(color="blue",size=11,angle=0,hjust=1,vjust=0),
axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5),
axis.title.y = element_text(size=14),
legend.title=element_blank(),
plot.title=element_text(size=16,color="purple",hjust=0.5))

16) The difference between weekend and weekdays are clearer from this figure than from the
previous plot. We can also see the above pattern from a heat map.
DayHourCounts$Var1 = factor(DayHourCounts$Var1, ordered=TRUE, levels=c("Monday",
"Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) +
scale_fill_gradient(name="Total MV Thefts", low="white", high="red") +
ggtitle("Domestic Crimes in the City of Chicago Since 2001")+theme(axis.title.y =
element_blank())+ylab("")+theme(axis.title.x=element_text(size=14),axis.text.y =
element_text(size=13),axis.text.x = element_text(size=13), axis.title.y =
element_text(size=14),
legend.title=element_blank(),plot.title=element_text(size=16,color="purple",hjust=0.5))

46
17) Let’s see the pattern of other crime types. Let’s focus on four of the most common ones.
crimes=my_collection$find('{}', fields = '{"_id":0, "PrimaryType":1,"Year":1}')
crimes%>%group_by(PrimaryType)%>%summarize(Count=n())%>%arrange(desc(Count))%>
%head(4)
Imported 6261148 records. Simplifying into dataframe...
PrimaryType Count
THEFT 1301434
BATTERY 1142377
CRIMINAL DAMAGE 720143
NARCOTICS 687790
18) As shown in the table above, the most common crime type is theft followed by battery.
Narcotics is fourth most common while criminal damage is the third most common crime type
in the city of Chicago. Now, let’s generate plots by day and hour.
four_most_common=crimes%>%group_by(PrimaryType)%>%summarize(Count=n())%>
%arrange(desc(Count))%>%head(4)
four_most_common=four_most_common$PrimaryType
crimes=my_collection$find('{}', fields = '{"_id":0, "PrimaryType":1,"Date":1}')
crimes=filter(crimes,PrimaryType %in%four_most_common)
crimes$Date= mdy_hms(crimes$Date)
crimes$Weekday = weekdays(crimes$Date)
crimes$Hour = hour(crimes$Date)
crimes$month=month(crimes$Date,label = TRUE)
g = function(data){WeekdayCounts = as.data.frame(table(data$Weekday))
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday",
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red") +
xlab("Day of the Week") + theme(axis.title.x=element_blank(),axis.text.y =
element_text(color="blue",size=10,angle=0,hjust=1,vjust=0),
47
axis.text.x = element_text(color="blue",size=10,angle=0,hjust=.5,vjust=.5),
axis.title.y = element_text(size=11),
plot.title=element_text(size=12,color="purple",hjust=0.5)) }
g1=g(filter(crimes,PrimaryType=="THEFT"))+ggtitle("Theft")+ylab("Total Count")
g2=g(filter(crimes,PrimaryType=="BATTERY"))+ggtitle("BATTERY")+ylab("Total Count")
g3=g(filter(crimes,PrimaryType=="CRIMINAL DAMAGE"))+ggtitle("CRIMINAL
DAMAGE")+ylab("Total Count")
g4=g(filter(crimes,PrimaryType=="NARCOTICS"))+ggtitle("NARCOTICS")+ylab("Total
Count")
grid.arrange(g1,g2,g3,g4,ncol=2)

From the plots above, we see that theft is most common on Friday. Battery and criminal
damage, on the other hand, are highest at weekend. We also observe that narcotics decreases
over weekend. We can also see the pattern of the above four crime types by hour:

48
19) We can also see a map for domestic crimes only:
domestic=my_collection$find('{"Domestic":"true"}', fields = '{"_id":0, "Latitude":1,
"Longitude":1,"Year":1}')
LatLonCounts=as.data.frame(table(round(domestic$Longitude,2),round(domestic$Latitude,2)))
LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2))
ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq),
fill="red")+
ggtitle("Domestic Crimes")+labs(alpha="Count")+theme(plot.title = element_text(hjust=0.5))

49
20) Let’s see where motor vehicle theft is common:

Domestic crimes show concentration over two areas whereas motor vehicle theft is wide spread
over large part of the city of Chicag

RESULT:
The program was executed successfully.

50
BIG DATA VIVA QUESTIONS
1. What do you understand by the term 'big data'?
Big data deals with complex and large sets of data that cannot be handled using conventional
software.
2. How is big data useful for businesses?
Big Data helps organizations understand their customers better by allowing them to draw
conclusions from large data sets collected over the years. It helps them make better decisions.
3. What is the Port Number for NameNode?
NameNode – Port 50070
4. What is the function of the JPS command?
 The JPS command is used to test whether all the Hadoop daemons are running correctly or not.
5. What is the command to start up all the Hadoop daemons together?
./sbin/start-all.sh
6. Name a few features of Hadoop.
 Some of the most useful features of Hadoop,
1. It's open source nature.
2. User-friendly.
3. Scalability.
4. Data locality.
5. Data recovery.
7. What are the five V’s of Big Data?
 The five V’s of Big data are Volume, Velocity, Variety, Veracity, and Value.
8. What are the components of HDFS?
 The two main components of HDFS are:
1. Name Node
2. Data Node
9. How is Hadoop related to Big Data?
Hadoop is a framework that specializes in big data operations.
10. Name a few data management tools used with Edge Nodes?
Oozie, Flume, Ambari, and Hue are some of the data management tools that work with edge
nodes in Hadoop.
11. What are the steps to deploy a Big Data solution?
The three steps to deploying a Big Data solution are:
1. Data Ingestion
2. Data Storage and
51
3. Data Processing
12. How many modes can Hadoop be run in?
 Hadoop can be run in three modes— Standalone mode, Pseudo-distributed mode and fully-
distributed mode.
13.    Name the core methods of a reducer
 The three core methods of a reducer are,
1. setup()
2. reduce()
3. cleanup()
14.    What is the command for shutting down all the Hadoop Daemons together?
./sbin/stop-all.sh
15. What is the role of NameNode in HDFS?
NameNode is responsible for processing metadata information for data blocks within HDFS.
16. What is FSCK?
FSCK (File System Check) is a command used to detect inconsistencies and issues in the file.
17. What are the real-time applications of Hadoop?
 Some of the real-time applications of Hadoop are in the fields of:
 Content management.
 Financial agencies.
 Defense and cybersecurity.
 Managing posts on social media.
18. What is the function of HDFS?
 The HDFS (Hadoop Distributed File System) is Hadoop’s default storage unit. It is used for
storing different types of data in a distributed environment.
19. What is commodity hardware?
Commodity hardware can be defined as the basic hardware resources needed to run the Apache
Hadoop framework.
20. Name a few daemons used for testing JPS command.
 NameNode
 NodeManager
 DataNode
 ResourceManager
21. What are the most common input formats in Hadoop?
 Text Input Format
 Key Value Input Format
52
 Sequence File Input Format
22. Name a few companies that use Hadoop.
 Yahoo, Facebook, Netflix, Amazon, and Twitter.
23. What is the default mode for Hadoop?
 Standalone mode is Hadoop's default mode. It is primarily used for debugging purpose.
24. What is the role of Hadoop in big data analytics?
By providing storage and helping in the collection and processing of data, Hadoop helps in the
analytics of big data.
25. What are the components of YARN?
 The two main components of YARN (Yet Another Resource Negotiator) are:
 Resource Manager
 Node Manager
26. Explain the core methods of a Reducer.
There are three core methods of a reducer. They are-
setup() – This is used to configure different parameters like heap size, distributed cache and
input data.
reduce() – A parameter that is called once per key with the concerned reduce task
cleanup() – Clears all temporary files and called only at the end of a reducer task.
27. Talk about the different tombstone markers used for deletion purposes in HBase.
This Big Data interview question dives into your knowledge of HBase and its working.
There are three main tombstone markers used for deletion in HBase. They are-
Family Delete Marker – For marking all the columns of a column family.
Version Delete Marker – For marking a single version of a single column.
Column Delete Marker – For marking all the versions of a single column.
28. What is Apache Hive?
Ans. Basically, a tool which we call a data warehousing tool is Hive. However, Hive gives SQL
queries to perform an analysis and also an abstraction. Although, Hive it is not a database it
gives you logical abstraction over the databases and the tables.
29. What kind of applications is supported by Apache Hive?
Ans. All those client applications which are written in Java, PHP, Python, C++ or Ruby by
exposing its thrift server, Hive support them.
30. What is PIG?
Pig is a scripting platform that allows users to write MapReduce operations using a scripting
language called Pig Latin. Apache Pig is a platform for analyzing large data sets Pig Scripts are
converted into MapReduce Jobs which runs on data stored in HDFS.
53

You might also like