DWDM - Lab

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 40

Demonstration of Data Structures in R

_____________________________________________________________________________

Aim:

To determine the demonstration of data structures in R.

Algorithm:

i) VECTOR:

 To create a vector, we use the c() function.


 Another way to create a vector is the assign() function.
 An easy way to make integer vectors is to use the : operator.

ii) MATRICES:

 To create a matrix in R you need to use the function called matrix().


 The arguments to this matrix() are the set of elements in the vector.
 Then pass how many numbers of rows and how many numbers of columns you want to
have in your matrix.

NOTE: By default, matrices are in column-wise order.

iii) DATA FRAME:

 To create a data frame we use the data.frame() function in R.


 Then pass each of the vectors you have created as arguments to the function.

iv) LIST:

 List can be created using the list() function.


 We can delete a component by assigning NULL to it.

Data Structures in R:

i) Vector

ii) Matrices

iii) Data Frame

iv) List
i) Vector :

 Sequence of elements which share the same data type is known as Vector.

 A Vector is a basic data structure which plays an important role in R programming.

 If data has only one dimension, like a set of digits, then vectors can be used to represent
it.

ii) Matrices:

 A two-dimensional rectangular data set is known as a Matrix.

 It is used when data is a higher dimensional array.

 But it contains only data of single class. Eg: Only character or numeric.

iii) Data Frame:

 A Data Frame is a two-dimensional array-like structure or a table in which a column


contains values of one variable, and rows contains one set of values from each column.

 It is a list of equal length vectors.

iv) List:

 A List is a data structure which has components of mixed data types. In R, lists are the
second type of vector

 It is used when data cannot be represent by data Frame and it is very flexible.

 A list is a generic vector which contains other objects.


PROGRAM:

#Vector: Most simplest structure in R and have only one data type.
X <- c(1,2,3,4)
x

#List: Recursive vectors can handle different data types.


Y <- list(1,2,3,4)
y
my.list<-list(name=c(“Robort”,”Emma”),age=c(65,54),retired=c(TRUE,FALSE))
my.list
#looking for age alone here.
my.list$age
my.list[[“age”]
my.list[[“age”][1]
my.list[[“age”][2]
#Similarly
my.list[[3]]
my.list[[3]][2]

#Matrices: A single table with rows and columns of data.


B=matrix(c(2,4,3,1,5,7),nrow=3,ncol=2)
B
#To access elements in the matrices.
Before comma- Row
After comma- Column
B[1, ]->Row
B[ ,2]->Column
B[1,2]->Both values

#Data Frame: A single table with rows & columns of data. Each column can be
a different data types.
Consider the following vectors:
Product=c(“Bag”,”Shoes”,”Belt”,”Belt”)
Total_price=c(500,1000,150,200)
Color=c(“Blue”,”red”,”red”,”Blue”)
Quantity=c(5,2,3,4)
Product_details<-data.frame=c(Product,total_price,Color,Quantity,StringAsFactor=FALSE)
Product_details
Product_details<-data.frame=( Product=c(“Bag”,”Shoes”,”Belt”,”Belt”)
Total_price=c(500,1000,150,200)
Color=c(“Blue”,”red”,”red”,”Blue”)
Quantity=c(5,2,3,4),StringAsFactor=FALSE))
Product_details
class(Product_details)
Product_details[ ,2]
Product_details[2, ]
Product_details[2,2]
Product_details$Product
Output:

Result: Thus the demonstration of data structures in R is successfully executed and verified
To Perform The Statistical Analysis Of Data

______________________________________________________________________________

Aim:

To perform the statistical analysis of data.

Measures of Central Tendency:

 Mean

 Median

 Mode

Mean: The arithmetic average of a distribution of scores.

(17+4+33+2+51+23+3+41+18+2+4+2)/12

Mean= 16.67

Median: The median is the middle value in a list ordered from smallest to largest.

Median = (4+17)/2

= 10.5

Mode: The score in the distribution that occurs most frequently.

Mode=2

Box Plot: A graphic representation of the distribution of scores on a variable that includes the
range, the median, and the interquartile range.

Hist: Histogram can be created using the hist() function in R programming language. This
function takes is plotted in a vector of values for which the histogram.

Program:
x <- c(8,2,7,1,2,9,8,2,10,9)

#Exploratory Data Analysis

hist(x)

boxplot(x)

#Mean: The mean is the average of the numbers

sum(x)/length(x)

#?mean -(?)used to get the answer if we don't know.

#Function in base R

mean(x)

#Median: the middle number given the numbers are in order (sorted)

sort(x)

#?median

median(x)

#Mode: The number which appears most often in a set of numbers.

#There is no function in base R to find mode of set of numbers

x <- c(8,2,7,1,2,9,8,2,10,9)

#Function to find Mode

#?table

y <- table(x)

names(y)[which(y==max(y))]
#or in single line

names(table(x))[table(x)==max(table(x))]

#Testing if there are two or more numbers with same frequency

x <- c(8,2,7,1,2,9,8,2,10,9,8)

sort(x)

#Mode

names(table(x))[table(x)==max(table(x))]

#Mean, Median and Mode using `mtcars dataset

head(mtcars)

x <- mtcars$wt

#Mean

mean(x)

#Median

median(x)

#Mode

y <- table(x)

names(y)[which(y==max(y))]

#or

names(table(x))[table(x)==max(table(x))]

#Mean, Median and Mode using `airquality` dataset

I am using `airquality` dataset because it has missing values

#Summary Statistics
dim(airquality)

names(airquality)

str(airquality)

head(airquality)

#Column names with missing Values

names(airquality)[colSums(is.na(airquality)) > 0]

airquality$Ozone

airquality$Solar.R

x <- airquality$Solar.R

table(is.na(x))

#Mean

mean(x)

?mean

mean(x, na.rm = TRUE)

#Median

median(x)

median(x, na.rm = TRUE)

#Mode

We will not have issue of removing NA for finding Mode

sort(table(x))

names(table(x))[table(x)==max(table(x))]

#x<- airquality$Solar.R

# sort(table(x))

# sort(table(x, useNA = "always"))


#Summary fuctions

summary() #Base R

describe() #Package `psych`

summary(mtcars)

summary(airquality)

#install.packages("psych")

library(psych)

describe(mtcars)

describe(airquality)

Measures of Shapes:

 Skewness

-Negative Skew

-Positive Skew

 Kurtosis

Skewness: When a distribution of scores has a high number of scores clustered at one end of the
distribution with relatively few scores spread out toward the other end of the distribution,
forming a tail.

Negative Skew: In a skewed distribution, when most of the scores are clustered at the
higher end of the distribution with a few scores creating a tail at the lower end of the distribution.

Positive Skew: In a skewed distribution, when most of the scores are clustered at the
lower end of the distribution with a few scores creating a tail at the higher end of the distribution.

Kurtosis: It is a measure of the combined weight of a distribution's tails relative to the center of
the distribution
Program:

# Calculate Kurtosis in R

install.packages("moments")

library(moments)

test <- c(41,34,39,34,34,32,37,32,43,43,24,32)

kurtosis(test)

skewness(test)

Measures of Variability:

 Range

 Variance

 Standard Deviation

Range: The range is simply the difference between the largest score (the maximum value) and
the smallest score (the minimum value) of a distribution.

Variance: The variance provides a statistical average of the amount of dispersion in a


distribution of scores. The other word sum of the squared deviations divided by the number of
cases in the population, or by the number of cases minus one in the sample.

Standard Deviation: Deviation, in this case, refers to the difference between an individual score
in a distribution and the average score for the distribution. So if the average score for a
distribution is 10, and an individual child has a score of 12, the deviation is 2. The other word in
the term standard deviation is standard.

Variance and Standard Deviation Formulae:


Quartile: The Quartile Deviation can be defined mathematically as half of the difference
between the upper and lower quartile.

Interquartile Range(IQR): The difference between the 75th percentile and 25th percentile
scores in a distribution.

Program:

# Calculate Standard Error in R

x<- c(15,13,12,35,12,12,11,13,12,13,15,11,13,12,15)

# Calculate Standard Error in R

# using the SD function / SQRT of vector length

sd(x)/sqrt(length(x))

# set up standard deviation in R example

test <- c(41,34,39,34,34,32,37,32,43,43,24,32)

# standard deviation R function

# sample standard deviation in r

sd(test)
# calculate variance in R

test <- c(41,34,39,34,34,32,37,32,43,43,24,32)

var(test)

# quartile in R example

test = c(9,9,8,9,10,9,3,5,6,8,9,10,11,12,13,11,10)

# get quartile in r code (single line)

quartile(test, prob=c(.25,.5,.75))

# quartile in R example - summary function

test = c(9,9,8,9,10,9,3,5,6,8,9,10,11,12,13,11,10)

summary(test)

# how to find interquartile range in R

x =c(5, 10,12,15,20,25,27,30, 35)

IQR(x)

# interquartile in R example - summary function

x =c(5, 10,12,15,20,25,27,30, 35)

summary(x)
Output:

Result: Thus the statistical analysis of data is successfully executed and verified.

Demonstration of Association Rule Mining using Apriori Algorithm on super market data.
______________________________________________________________________________

Aim:

To demonstrate the Association Rule Mining using Apriori Algorithm on super market data.

Algorithm:

Step 1: Load required library

‘arules’ package provides the infrastructure for representing, manipulating, and

analyzing transaction data and patterns.

library(arules)

'arulesviz’ package visualizing Association Rules and Frequent Itemsets.

library(arulesViz)

‘RColorBrewer‘ is a ColorBrewer Palette which provides color schemes for

maps and other graphics.

library(RColorBrewer)

Step 2: Import the dataset

'Groceries‘ dataset is predefined in the R package.

Step 3: Applying apriori() function

The default behavior is to mine the rules with minimum support of 0.1 and 0.8

as the minimum confidence.

Step 4: Applying inspect() function.

It displays the first 10 strong association rules. the result of an expression.

Step 5: Applying itemFrequencyPlot() function

Creates a bar plot for item frequencies/ support.

Program:
# Loading Libraries

library(arules)

library(arulesViz)

library(RColorBrewer)

# import dataset

data("Groceries")

# using apriori() function

rules<-apriori(Groceries,

parameter = list(supp = 0.01, conf = 0.2))

# using inspect() function

inspect(rules[1:10])

# using itemFrequencyPlot() function

arules::itemFrequencyPlot(Groceries, topN = 20,

col = brewer.pal(8, 'Pastel2'),

main = 'Relative Item Frequency Plot'; ,

type = "relative";,

ylab = "Item Frequency (Relative)")


Output:

Result:

Thus the Demonstration of Association Rule Mining using Apriori Algorithm on super
market data is successfully executed and verified.
Demonstration of FP Growth algorithm on supermarket data

______________________________________________________________________________

Aim:
To Demonstration of FP Growth algorithm on supermarket data
Algorithm:

 Counting the occurrence of individual items.


 Filter out non frequency items using minimum support
 Order the items based on individual occurrences.
 Create the tree and the transactions one by one.

Program:

library("rCBA")

data("iris")

train <- sapply(iris,as.factor)

train <- data.frame(train, check.names=FALSE)

txns <- as(train,"transactions")

rules = rCBA::fpgrowth(txns, support=0.03, confidence=0.03, maxLength=2,


consequent="Species",

parallel=FALSE)

predictions <- rCBA::classification(train,rules)

table(predictions)

sum(as.character(train$Species)==as.character(predictions),na.rm=TRUE)/length(predictions)
prunedRules <- rCBA::pruning(train, rules, method="m2cba", parallel=FALSE)

predictions <- rCBA::classification(train, prunedRules)

table(predictions)

sum(as.character(train$Species)==as.character(predictions),na.rm=TRUE)/length(predictions)

Output:

Result:

Thus the Demonstration of FP Growth on super market data is successfully executed and
verified.

To perform the classification by decision tree induction using R.


______________________________________________________________________________

Aim:

To perform the classification by decision tree induction using R.

Algorithm:

 Select the best attribute.


 Assign A as a decision tree root node.
 For each value of A, the descendant of the node.
 Assign the classification to each leaf node.
 If the data is correctly classified : stop
 Or else : iterate over the tree.

Program:

library(party)

input.dat <-library(party)

input.dat <- readingSkills[c(1:105),]

png(file = "decision_tree.png")

output.tree <- ctree(

nativeSpeaker ~ age + shoeSize + score,

data = input.dat)

plot(output.tree)

dev.off()

dim(readingSkills)

input.dat[2,]

readingSkills[c(1:105),]

Output :
Result :

Thus the classification by decision tree induction using R is performed and successfully
executed and verified.

To perform classification using Bayesian classification algorithm using R.


Aim:

To perform classification using Bayesian Classification Algorithm using R.

Algorithm:

Step 1: Import required libraries. ...


Step 2: Load the data set. ...
Step 3: Check the structure of the dataset. ...
Step 4: Checking the summary. ...
Step 5: Train - Test Split. ...
Step 6: Separate the test labels from the test data. ...
Step 7: Train the model. ...
Step 8: Make predictions.
Step 9:Compare the predicted and actual values.

Program:

library(naivebayes)

library(dplyr)

library(ggplot2)

library(psych)

#Read data file

getwd()

data <- read.csv('https://raw.githubusercontent.com/bkrai/Statistical-Modeling-and-Graphs-with-


R/main/binary.csv')

#contingency table

xtabs(~admit + rank, data = data)


#Rank & admit are categorical variables

data$rank <- as.factor(data$rank)

data$admit <- as.factor(data$admit)

# Visualization

pairs.panels(data[-1])

data %>%

group_by(admit) %>%

ggplot(aes(x=admit, y=gre, fill=admit)) +

geom_boxplot()

data %>%

ggplot(aes(x=admit, y=gpa, fill=admit)) +

geom_boxplot() +

ggtitle('Box Plot')

data %>%

ggplot(aes(x=gre, fill=admit)) +

geom_density(alpha=0.8, color='black') +

ggtitle('Density Plot')

data %>%

ggplot(aes(x=gpa, fill=admit)) +

geom_density(alpha=0.8, color='black') +

ggtitle('Density Plot')
#Split data into Training (80%) and Testing (20%) datasets

set.seed(1234)

ind <- sample(2,nrow(data),replace=TRUE, prob=c(0.8,.2))

train <- data[ind==1,]

test <- data[ind==2,]

# Naive Bayes

model <- naive_bayes(admit ~ ., data = train)

model

plot(model)

# numeric predictors - means (1st col) & sd's (2nd col)

train %>% filter(admit=="0") %>%

summarize(mean(gre), sd(gre))

# Predict

p <- predict(model, train, type= 'prob')

head(cbind(p, train))

# Misclassification error - train data

p1 <- predict(model, train)

(tab1 <- table(p1, train$admit))

1 - sum(diag(tab1))/ sum(tab1)
# Misclassification error - test data

p2 <- predict(model, test)

(tab2 <- table(p2, test$admit))

1 - sum(diag(tab2))/ sum(tab2)

Output :
Result:

Thus the classification using Bayesian classification algorithm is performed using R,


successfully executed and verified.
Perform the cluster analysis by k-means method using R.

______________________________________________________________________________

Aim:
To perform the cluster analysis by k-means method using R.
Algorithm:

1. choose the number K clusters.


2. Select at random K points, the centroids(Not necessarily from the given data).
3. Assign each data point to closest centroid that forms K clusters.
4. Compute and place the new centroid of each centroid.
5. Reassign each data point to new cluster.

Program:

# Installing Packages

install.packages("ClusterR")

install.packages("cluster")

# Loading package

library(ClusterR)

library(cluster)

# Removing initial label of

# Species from original dataset

iris_1 <- iris[, -5]

# Fitting K-Means clustering Model

# to training dataset

set.seed(240) # Setting seed


kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)

kmeans.re

# Cluster identification for

# each observation

kmeans.re$cluster

# Confusion Matrix

cm <- table(iris$Species, kmeans.re$cluster)

cm

# Model Evaluation and visualization

plot(iris_1[c("Sepal.Length", "Sepal.Width")])

plot(iris_1[c("Sepal.Length", "Sepal.Width")],

col = kmeans.re$cluster)

plot(iris_1[c("Sepal.Length", "Sepal.Width")],

col = kmeans.re$cluster,

main = "K-means with 3 clusters")

## Plotiing cluster centers

kmeans.re$centers

kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]

# cex is font size, pch is symbol

points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")],


col = 1:3, pch = 8, cex = 3)

y_kmeans <- kmeans.re$cluster

clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],

y_kmeans,

lines = 0,

shade = TRUE,

color = TRUE,

labels = 2,

plotchar = FALSE,

span = TRUE,

main = paste("Cluster iris"),

xlab = 'Sepal.Length',

ylab = 'Sepal.Width')

Output:
Result:

Thus the cluster analysis by k-means method using R is successfully executed and
verified.
Perform the hierarchical clustering using R Programming.

______________________________________________________________________________

Aim:

To Perform the hierarchical clustering using R Programming

Algorithm:

1. Make each data point in a single point cluster that forms N clusters.
2. Take the two closest data points and make them one cluster that forms N-1 clusters.
3. Take the two closest clusters and make them one cluster that forms N-2 clusters.
4. Repeat steps 3 until there is only one cluster.

Program:

# Installing the package

install.packages("dplyr")

# Loading package

library(dplyr)

# Summary of dataset in package

head(mtcars)

# Finding distance matrix

distance_mat <- dist(mtcars, method = 'euclidean')

distance_mat

# Fitting Hierarchical clustering Model

# to training dataset

set.seed(240) # Setting seed


Hierar_cl <- hclust(distance_mat, method = "average")

Hierar_cl

# Plotting dendrogram

plot(Hierar_cl)

# Choosing no. of clusters

# Cutting tree by height

abline(h = 110, col = "green")

# Cutting tree by no. of clusters

fit <- cutree(Hierar_cl, k = 3 )

fit

table(fit)

rect.hclust(Hierar_cl, k = 3, border = "green")

Output:
Result:

Thus the hierarchical clustering using R Programming is performed ,executed and verified.
Study of Regression Analysis using R programming.

______________________________________________________________________________

Aim:

To Study of Regression Analysis using R programming

Algorithm:

Step 1: Load the data into R. Follow these four steps for each dataset.
Step 2: Make sure your data meet the assumptisons.
Step 3: Perform the linear regression analysis.
Step 4: Check for homoscedasticity.
Step 5: Visualize the results with a graph.
Step 6: Report your results.

Program:

# Generate random IQ values with mean = 30 and sd =2

IQ <- rnorm(40, 30, 2)

# Sorting IQ level in ascending order

IQ <- sort(IQ)

# Generate vector with pass and fail values of 40 students

result <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1,

1, 0, 0, 0, 1, 1, 0, 0, 1, 0,

0, 0, 1, 0, 0, 1, 1, 0, 1, 1,

1, 1, 1, 0, 1, 1, 1, 1, 0, 1)

# Data Frame
df <- as.data.frame(cbind(IQ, result))

# Print data frame

print(df)

# output to be present as PNG file

png(file="LogisticRegressionGFG.png")

# Plotting IQ on x-axis and result on y-axis

plot(IQ, result, xlab = "IQ Level",

ylab = "Probability of Passing")

# Create a logistic model

g = glm(result~IQ, family=binomial, df)

# Create a curve based on prediction using the regression model

curve(predict(g, data.frame(IQ=x), type="resp"), add=TRUE)

# This Draws a set of points

# Based on fit to the regression model

points(IQ, fitted(g), pch=30)

# Summary of the regression model

summary(g)
# saving the file

dev.off()

Output:

Result:

Thus Study of Regression Analysis using R programming was executed successfully and
verified
Outlier detection using R programming.

______________________________________________________________________________

Aim:

To determine outlier detection using R program

Algorithm:

 Loading the Dataset


 Detect Outliers With Box plot Function
 Replacing Outliers with NULL Values
 Verify All Outliers Are Replaced With NULL

Program:

library(DMwR2)
set.seed(937573)
x <- rnorm(1000)
x[1:5] <- c(7, 10, - 5, 16, - 23)
x

#visualizing outlier using boxplot


boxplot(x)

#visualizing outlier using LOF algorithm


iris2 <- iris[,1:4]
outlier.scores <- lofactor(iris2, k=5)
plot(density(outlier.scores))

#visualizing outlier using biplot


outliers <- order(outlier.scores, decreasing=T)[1:5]
print(outliers)
n <- nrow(iris2)
labels <- 1:n
labels[-outliers] <- "."
biplot(prcomp(iris2), cex=.8, xlabs=labels)
Output:
Result:

Thus the outlier detection using R program was successfully executed and Verified.

You might also like