DWDM - Lab

Demonstration of Data Structures in R
_____________________________________________________________________________
Aim:
To determine the demonstration of data structures in R.
Algorithm:
i) VECTOR:
 To create a vector, we use the c() function.

 Another way to create a vector is the assign() function.
 An easy way to make integer vectors is to use the : operator.
ii) MATRICES:
 To create a matrix in R you need to use the function called matrix().

 The arguments to this matrix() are the set of elements in the vector.
 Then pass how many numbers of rows and how many numbers of columns you want to
have in your matrix.
NOTE: By default, matrices are in column-wise order.
iii) DATA FRAME:
 To create a data frame we use the data.frame() function in R.

 Then pass each of the vectors you have created as arguments to the function.
iv) LIST:
 List can be created using the list() function.

 We can delete a component by assigning NULL to it.
Data Structures in R:
i) Vector
ii) Matrices
iii) Data Frame
iv) List
i) Vector :
 Sequence of elements which share the same data type is known as Vector.
 A Vector is a basic data structure which plays an important role in R programming.
 If data has only one dimension, like a set of digits, then vectors can be used to represent
it.
ii) Matrices:
 A two-dimensional rectangular data set is known as a Matrix.
 It is used when data is a higher dimensional array.
 But it contains only data of single class. Eg: Only character or numeric.
iii) Data Frame:
 A Data Frame is a two-dimensional array-like structure or a table in which a column

contains values of one variable, and rows contains one set of values from each column.
 It is a list of equal length vectors.
iv) List:
 A List is a data structure which has components of mixed data types. In R, lists are the
second type of vector
 It is used when data cannot be represent by data Frame and it is very flexible.
 A list is a generic vector which contains other objects.

PROGRAM:
#Vector: Most simplest structure in R and have only one data type.
X <- c(1,2,3,4)
x
#List: Recursive vectors can handle different data types.

Y <- list(1,2,3,4)
y
my.list<-list(name=c(“Robort”,”Emma”),age=c(65,54),retired=c(TRUE,FALSE))
my.list
#looking for age alone here.
my.list$age
my.list[[“age”]
my.list[[“age”][1]
my.list[[“age”][2]
#Similarly
my.list[[3]]
my.list[[3]][2]
#Matrices: A single table with rows and columns of data.

B=matrix(c(2,4,3,1,5,7),nrow=3,ncol=2)
B
#To access elements in the matrices.
Before comma- Row
After comma- Column
B[1, ]->Row
B[ ,2]->Column
B[1,2]->Both values
#Data Frame: A single table with rows & columns of data. Each column can be
a different data types.
Consider the following vectors:
Product=c(“Bag”,”Shoes”,”Belt”,”Belt”)
Total_price=c(500,1000,150,200)
Color=c(“Blue”,”red”,”red”,”Blue”)
Quantity=c(5,2,3,4)
Product_details<-data.frame=c(Product,total_price,Color,Quantity,StringAsFactor=FALSE)
Product_details
Product_details<-data.frame=( Product=c(“Bag”,”Shoes”,”Belt”,”Belt”)
Total_price=c(500,1000,150,200)
Color=c(“Blue”,”red”,”red”,”Blue”)
Quantity=c(5,2,3,4),StringAsFactor=FALSE))
Product_details
class(Product_details)
Product_details[ ,2]
Product_details[2, ]
Product_details[2,2]
Product_details$Product
Output:
Result: Thus the demonstration of data structures in R is successfully executed and verified
To Perform The Statistical Analysis Of Data
______________________________________________________________________________
Aim:
To perform the statistical analysis of data.
Measures of Central Tendency:
 Mean
 Median
 Mode
Mean: The arithmetic average of a distribution of scores.
(17+4+33+2+51+23+3+41+18+2+4+2)/12
Mean= 16.67
Median: The median is the middle value in a list ordered from smallest to largest.
Median = (4+17)/2
= 10.5
Mode: The score in the distribution that occurs most frequently.
Mode=2
Box Plot: A graphic representation of the distribution of scores on a variable that includes the
range, the median, and the interquartile range.
Hist: Histogram can be created using the hist() function in R programming language. This
function takes is plotted in a vector of values for which the histogram.
Program:
x <- c(8,2,7,1,2,9,8,2,10,9)
#Exploratory Data Analysis
hist(x)
boxplot(x)
#Mean: The mean is the average of the numbers
sum(x)/length(x)
#?mean -(?)used to get the answer if we don't know.
#Function in base R
mean(x)
#Median: the middle number given the numbers are in order (sorted)
sort(x)
#?median
median(x)
#Mode: The number which appears most often in a set of numbers.
#There is no function in base R to find mode of set of numbers
x <- c(8,2,7,1,2,9,8,2,10,9)
#Function to find Mode
#?table
y <- table(x)
names(y)[which(y==max(y))]
#or in single line
names(table(x))[table(x)==max(table(x))]
#Testing if there are two or more numbers with same frequency
x <- c(8,2,7,1,2,9,8,2,10,9,8)
sort(x)
#Mode
#Mean, Median and Mode using `mtcars dataset
head(mtcars)
x <- mtcars$wt
#Mean
mean(x)
#Median
median(x)
#Mode
y <- table(x)
names(y)[which(y==max(y))]
#or
#Mean, Median and Mode using `airquality` dataset
I am using `airquality` dataset because it has missing values
#Summary Statistics
dim(airquality)
names(airquality)
str(airquality)
head(airquality)
#Column names with missing Values
names(airquality)[colSums(is.na(airquality)) > 0]
airquality$Ozone
airquality$Solar.R
x <- airquality$Solar.R
table(is.na(x))
#Mean
mean(x)
?mean
mean(x, na.rm = TRUE)
#Median
median(x)
median(x, na.rm = TRUE)
#Mode
We will not have issue of removing NA for finding Mode
sort(table(x))
#x<- airquality$Solar.R
# sort(table(x))
# sort(table(x, useNA = "always"))

#Summary fuctions
summary() #Base R
describe() #Package `psych`
summary(mtcars)
summary(airquality)
#install.packages("psych")
library(psych)
describe(mtcars)
describe(airquality)
Measures of Shapes:
 Skewness
-Negative Skew
-Positive Skew
 Kurtosis
Skewness: When a distribution of scores has a high number of scores clustered at one end of the
distribution with relatively few scores spread out toward the other end of the distribution,
forming a tail.
Negative Skew: In a skewed distribution, when most of the scores are clustered at the
higher end of the distribution with a few scores creating a tail at the lower end of the distribution.
Positive Skew: In a skewed distribution, when most of the scores are clustered at the
lower end of the distribution with a few scores creating a tail at the higher end of the distribution.
Kurtosis: It is a measure of the combined weight of a distribution's tails relative to the center of
the distribution
Program:
# Calculate Kurtosis in R
install.packages("moments")
library(moments)
test <- c(41,34,39,34,34,32,37,32,43,43,24,32)
kurtosis(test)
skewness(test)
Measures of Variability:
 Range
 Variance
 Standard Deviation
Range: The range is simply the difference between the largest score (the maximum value) and
the smallest score (the minimum value) of a distribution.
Variance: The variance provides a statistical average of the amount of dispersion in a

distribution of scores. The other word sum of the squared deviations divided by the number of
cases in the population, or by the number of cases minus one in the sample.
Standard Deviation: Deviation, in this case, refers to the difference between an individual score
in a distribution and the average score for the distribution. So if the average score for a
distribution is 10, and an individual child has a score of 12, the deviation is 2. The other word in
the term standard deviation is standard.
Variance and Standard Deviation Formulae:

Quartile: The Quartile Deviation can be defined mathematically as half of the difference
between the upper and lower quartile.
Interquartile Range(IQR): The difference between the 75th percentile and 25th percentile
scores in a distribution.
Program:
# Calculate Standard Error in R
x<- c(15,13,12,35,12,12,11,13,12,13,15,11,13,12,15)
# Calculate Standard Error in R
# using the SD function / SQRT of vector length
sd(x)/sqrt(length(x))
# set up standard deviation in R example
test <- c(41,34,39,34,34,32,37,32,43,43,24,32)
# standard deviation R function
# sample standard deviation in r
sd(test)
# calculate variance in R
test <- c(41,34,39,34,34,32,37,32,43,43,24,32)
var(test)
# quartile in R example
test = c(9,9,8,9,10,9,3,5,6,8,9,10,11,12,13,11,10)
# get quartile in r code (single line)
quartile(test, prob=c(.25,.5,.75))
# quartile in R example - summary function
test = c(9,9,8,9,10,9,3,5,6,8,9,10,11,12,13,11,10)
summary(test)
# how to find interquartile range in R
x =c(5, 10,12,15,20,25,27,30, 35)
IQR(x)
# interquartile in R example - summary function
x =c(5, 10,12,15,20,25,27,30, 35)
summary(x)
Output:
Result: Thus the statistical analysis of data is successfully executed and verified.
Demonstration of Association Rule Mining using Apriori Algorithm on super market data.
______________________________________________________________________________
Aim:
To demonstrate the Association Rule Mining using Apriori Algorithm on super market data.
Algorithm:
Step 1: Load required library
‘arules’ package provides the infrastructure for representing, manipulating, and
analyzing transaction data and patterns.
library(arules)
'arulesviz’ package visualizing Association Rules and Frequent Itemsets.
library(arulesViz)
‘RColorBrewer‘ is a ColorBrewer Palette which provides color schemes for
maps and other graphics.
library(RColorBrewer)
Step 2: Import the dataset
'Groceries‘ dataset is predefined in the R package.
Step 3: Applying apriori() function
The default behavior is to mine the rules with minimum support of 0.1 and 0.8
as the minimum confidence.
Step 4: Applying inspect() function.
It displays the first 10 strong association rules. the result of an expression.
Step 5: Applying itemFrequencyPlot() function
Creates a bar plot for item frequencies/ support.
Program:
# Loading Libraries
library(arules)
library(arulesViz)
library(RColorBrewer)
# import dataset
data("Groceries")
# using apriori() function
rules<-apriori(Groceries,
parameter = list(supp = 0.01, conf = 0.2))
# using inspect() function
inspect(rules[1:10])
# using itemFrequencyPlot() function
arules::itemFrequencyPlot(Groceries, topN = 20,
col = brewer.pal(8, 'Pastel2'),
main = 'Relative Item Frequency Plot'; ,
type = "relative";,
ylab = "Item Frequency (Relative)")

Output:
Result:
Thus the Demonstration of Association Rule Mining using Apriori Algorithm on super
market data is successfully executed and verified.
Demonstration of FP Growth algorithm on supermarket data
______________________________________________________________________________
Aim:
To Demonstration of FP Growth algorithm on supermarket data
Algorithm:
 Counting the occurrence of individual items.

 Filter out non frequency items using minimum support
 Order the items based on individual occurrences.
 Create the tree and the transactions one by one.
Program:
library("rCBA")
data("iris")
train <- sapply(iris,as.factor)
train <- data.frame(train, check.names=FALSE)
txns <- as(train,"transactions")
rules = rCBA::fpgrowth(txns, support=0.03, confidence=0.03, maxLength=2,

consequent="Species",
parallel=FALSE)
predictions <- rCBA::classification(train,rules)
table(predictions)
sum(as.character(train$Species)==as.character(predictions),na.rm=TRUE)/length(predictions)
prunedRules <- rCBA::pruning(train, rules, method="m2cba", parallel=FALSE)
predictions <- rCBA::classification(train, prunedRules)
table(predictions)
sum(as.character(train$Species)==as.character(predictions),na.rm=TRUE)/length(predictions)
Output:
Result:
Thus the Demonstration of FP Growth on super market data is successfully executed and
verified.
To perform the classification by decision tree induction using R.

______________________________________________________________________________
Aim:
To perform the classification by decision tree induction using R.
Algorithm:
 Select the best attribute.

 Assign A as a decision tree root node.
 For each value of A, the descendant of the node.
 Assign the classification to each leaf node.
 If the data is correctly classified : stop
 Or else : iterate over the tree.
Program:
library(party)
input.dat <-library(party)
input.dat <- readingSkills[c(1:105),]
png(file = "decision_tree.png")
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)
plot(output.tree)
dev.off()
dim(readingSkills)
input.dat[2,]
readingSkills[c(1:105),]
Output :
Result :
Thus the classification by decision tree induction using R is performed and successfully
executed and verified.
To perform classification using Bayesian classification algorithm using R.

Aim:
To perform classification using Bayesian Classification Algorithm using R.
Algorithm:
Step 1: Import required libraries. ...

Step 2: Load the data set. ...
Step 3: Check the structure of the dataset. ...
Step 4: Checking the summary. ...
Step 5: Train - Test Split. ...
Step 6: Separate the test labels from the test data. ...
Step 7: Train the model. ...
Step 8: Make predictions.
Step 9:Compare the predicted and actual values.
Program:
library(naivebayes)
library(dplyr)
library(ggplot2)
library(psych)
#Read data file
getwd()
data <- read.csv('https://raw.githubusercontent.com/bkrai/Statistical-Modeling-and-Graphs-with-

R/main/binary.csv')
#contingency table
xtabs(~admit + rank, data = data)

#Rank & admit are categorical variables
data$rank <- as.factor(data$rank)
data$admit <- as.factor(data$admit)
# Visualization
pairs.panels(data[-1])
data %>%
group_by(admit) %>%
ggplot(aes(x=admit, y=gre, fill=admit)) +
geom_boxplot()
data %>%
ggplot(aes(x=admit, y=gpa, fill=admit)) +
geom_boxplot() +
ggtitle('Box Plot')
data %>%
ggplot(aes(x=gre, fill=admit)) +
geom_density(alpha=0.8, color='black') +
ggtitle('Density Plot')
data %>%
ggplot(aes(x=gpa, fill=admit)) +
geom_density(alpha=0.8, color='black') +
ggtitle('Density Plot')
#Split data into Training (80%) and Testing (20%) datasets
set.seed(1234)
ind <- sample(2,nrow(data),replace=TRUE, prob=c(0.8,.2))
train <- data[ind==1,]
test <- data[ind==2,]
# Naive Bayes
model <- naive_bayes(admit ~ ., data = train)
model
plot(model)
# numeric predictors - means (1st col) & sd's (2nd col)
train %>% filter(admit=="0") %>%
summarize(mean(gre), sd(gre))
# Predict
p <- predict(model, train, type= 'prob')
head(cbind(p, train))
# Misclassification error - train data
p1 <- predict(model, train)
(tab1 <- table(p1, train$admit))
1 - sum(diag(tab1))/ sum(tab1)
# Misclassification error - test data
p2 <- predict(model, test)
(tab2 <- table(p2, test$admit))
1 - sum(diag(tab2))/ sum(tab2)
Output :
Result:
Thus the classification using Bayesian classification algorithm is performed using R,

successfully executed and verified.
Perform the cluster analysis by k-means method using R.
______________________________________________________________________________
Aim:
To perform the cluster analysis by k-means method using R.
Algorithm:
1. choose the number K clusters.

2. Select at random K points, the centroids(Not necessarily from the given data).
3. Assign each data point to closest centroid that forms K clusters.
4. Compute and place the new centroid of each centroid.
5. Reassign each data point to new cluster.
Program:
# Installing Packages
install.packages("ClusterR")
install.packages("cluster")
# Loading package
library(ClusterR)
library(cluster)
# Removing initial label of
# Species from original dataset
iris_1 <- iris[, -5]
# Fitting K-Means clustering Model
# to training dataset
set.seed(240) # Setting seed

kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
kmeans.re
# Cluster identification for
# each observation
kmeans.re$cluster
# Confusion Matrix
cm <- table(iris$Species, kmeans.re$cluster)
cm
# Model Evaluation and visualization
plot(iris_1[c("Sepal.Length", "Sepal.Width")])
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster,
main = "K-means with 3 clusters")
## Plotiing cluster centers
kmeans.re$centers
kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]
# cex is font size, pch is symbol
points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")],

col = 1:3, pch = 8, cex = 3)
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster iris"),
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')
Output:
Result:
Thus the cluster analysis by k-means method using R is successfully executed and
verified.
Perform the hierarchical clustering using R Programming.
______________________________________________________________________________
Aim:
To Perform the hierarchical clustering using R Programming
Algorithm:
1. Make each data point in a single point cluster that forms N clusters.
2. Take the two closest data points and make them one cluster that forms N-1 clusters.
3. Take the two closest clusters and make them one cluster that forms N-2 clusters.
4. Repeat steps 3 until there is only one cluster.
Program:
# Installing the package
install.packages("dplyr")
# Loading package
library(dplyr)
# Summary of dataset in package
head(mtcars)
# Finding distance matrix
distance_mat <- dist(mtcars, method = 'euclidean')
distance_mat
# Fitting Hierarchical clustering Model
# to training dataset
set.seed(240) # Setting seed

Hierar_cl <- hclust(distance_mat, method = "average")
Hierar_cl
# Plotting dendrogram
plot(Hierar_cl)
# Choosing no. of clusters
# Cutting tree by height
abline(h = 110, col = "green")
# Cutting tree by no. of clusters
fit <- cutree(Hierar_cl, k = 3 )
fit
table(fit)
rect.hclust(Hierar_cl, k = 3, border = "green")
Output:
Result:
Thus the hierarchical clustering using R Programming is performed ,executed and verified.
Study of Regression Analysis using R programming.
______________________________________________________________________________
Aim:
To Study of Regression Analysis using R programming
Algorithm:
Step 1: Load the data into R. Follow these four steps for each dataset.
Step 2: Make sure your data meet the assumptisons.
Step 3: Perform the linear regression analysis.
Step 4: Check for homoscedasticity.
Step 5: Visualize the results with a graph.
Step 6: Report your results.
Program:
# Generate random IQ values with mean = 30 and sd =2
IQ <- rnorm(40, 30, 2)
# Sorting IQ level in ascending order
IQ <- sort(IQ)
# Generate vector with pass and fail values of 40 students
result <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 1, 1, 0, 1)
# Data Frame
df <- as.data.frame(cbind(IQ, result))
# Print data frame
print(df)
# output to be present as PNG file
png(file="LogisticRegressionGFG.png")
# Plotting IQ on x-axis and result on y-axis
plot(IQ, result, xlab = "IQ Level",
ylab = "Probability of Passing")
# Create a logistic model
g = glm(result~IQ, family=binomial, df)
# Create a curve based on prediction using the regression model
curve(predict(g, data.frame(IQ=x), type="resp"), add=TRUE)
# This Draws a set of points
# Based on fit to the regression model
points(IQ, fitted(g), pch=30)
# Summary of the regression model
summary(g)
# saving the file
dev.off()
Output:
Result:
Thus Study of Regression Analysis using R programming was executed successfully and
verified
Outlier detection using R programming.
______________________________________________________________________________
Aim:
To determine outlier detection using R program
Algorithm:
 Loading the Dataset

 Detect Outliers With Box plot Function
 Replacing Outliers with NULL Values
 Verify All Outliers Are Replaced With NULL
Program:
library(DMwR2)
set.seed(937573)
x <- rnorm(1000)
x[1:5] <- c(7, 10, - 5, 16, - 23)
x
#visualizing outlier using boxplot

boxplot(x)
#visualizing outlier using LOF algorithm

iris2 <- iris[,1:4]
outlier.scores <- lofactor(iris2, k=5)
plot(density(outlier.scores))
#visualizing outlier using biplot

outliers <- order(outlier.scores, decreasing=T)[1:5]
print(outliers)
n <- nrow(iris2)
labels <- 1:n
labels[-outliers] <- "."
biplot(prcomp(iris2), cex=.8, xlabs=labels)
Output:
Result:
Thus the outlier detection using R program was successfully executed and Verified.

DWDM - Lab

Uploaded by

Copyright:

Available Formats

You might also like

DWDM - Lab

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWDM - Lab

Uploaded by

Copyright:

Available Formats

Demonstration of Data Structures in R

To determine the demonstration of data structures in R.

 To create a vector, we use the c() function.

 To create a matrix in R you need to use the function called matrix().

NOTE: By default, matrices are in column-wise order.

iii) DATA FRAME:

 To create a data frame we use the data.frame() function in R.

 List can be created using the list() function.

iii) Data Frame

 A Vector is a basic data structure which plays an important role in R programming.

 A two-dimensional rectangular data set is known as a Matrix.

 It is used when data is a higher dimensional array.

iii) Data Frame:

 A Data Frame is a two-dimensional array-like structure or a table in which a column

 It is a list of equal length vectors.

 A list is a generic vector which contains other objects.

#List: Recursive vectors can handle different data types.

#Matrices: A single table with rows and columns of data.

To perform the statistical analysis of data.

Measures of Central Tendency:

Mean: The arithmetic average of a distribution of scores.

Mode: The score in the distribution that occurs most frequently.

#Exploratory Data Analysis

#Mean: The mean is the average of the numbers

#?mean -(?)used to get the answer if we don't know.

#Mode: The number which appears most often in a set of numbers.

#There is no function in base R to find mode of set of numbers

#Function to find Mode

#Testing if there are two or more numbers with same frequency

#Mean, Median and Mode using `mtcars dataset

#Mean, Median and Mode using `airquality` dataset

I am using `airquality` dataset because it has missing values

#Column names with missing Values

mean(x, na.rm = TRUE)

median(x, na.rm = TRUE)

We will not have issue of removing NA for finding Mode

# sort(table(x, useNA = "always"))

describe() #Package `psych`

test <- c(41,34,39,34,34,32,37,32,43,43,24,32)

Variance: The variance provides a statistical average of the amount of dispersion in a

Variance and Standard Deviation Formulae:

# Calculate Standard Error in R

# Calculate Standard Error in R

# using the SD function / SQRT of vector length

# set up standard deviation in R example

test <- c(41,34,39,34,34,32,37,32,43,43,24,32)

# standard deviation R function

# sample standard deviation in r

test <- c(41,34,39,34,34,32,37,32,43,43,24,32)

# get quartile in r code (single line)

# quartile in R example - summary function

# how to find interquartile range in R

x =c(5, 10,12,15,20,25,27,30, 35)

# interquartile in R example - summary function

x =c(5, 10,12,15,20,25,27,30, 35)

Step 1: Load required library

‘arules’ package provides the infrastructure for representing, manipulating, and

analyzing transaction data and patterns.