Professional Documents
Culture Documents
Avanthi'S Research &technological Academy: Data Mining Lab
Avanthi'S Research &technological Academy: Data Mining Lab
Avanthi'S Research &technological Academy: Data Mining Lab
ACADEMY
JNTUK/REGD.NO 19HQ1A0516
CERTIFICATE
Certified that is bonafied record of practical work done by
year 2021-2022 .
Signature of the
1)External:
EXPERIMENT -1
1.
Implement all basic R commands
EXPERIMENT – 2
2.
Interact data through .csv files (Import from and export to .csv files)
EXPERIMENT – 3
3.
Get and Clean data using swirl exercises. (Use ‘swirl’ package, library and install that
topic from swirl)
EXPERIMENT – 4
4.
Visualize all Statistical measures (Mean, Mode, Median, Range, Inter Quartile Range
etc., using Histograms, Boxplots and Scatter Plots)
EXPERIMENT – 5
5.
Create a data frame with the following structure
EXPERIMENT – 6
6.
Write R Program using ‘apply’ group of functions to create and apply normalization
function on each of the numeric variables/columns of iris dataset
EXPERIMENT – 7
7.
Create a data frame with 10 observations and 3 variables and add new rows and
columns to it using ‘rbind’ and ‘cbind’ function
EXPERIMENT – 8
8.
Implement linear and multiple regression on ‘mtcars’ dataset to estimate the value of
‘mpg’ variable, with best R and plot the original values in ‘green’ and predicted
2
values in ‘red’
EXPERIMENT – 9
9.
Implement k-means clustering using R
EXPERIMENT – 10
10.
Implement k-medoids clustering using R
EXPERIMENT – 11
11.
Implement density based clustering on iris dataset
EXPERIMENT – 12
12.
implement decision trees using ‘readingSkills’ dataset
EXPERIMENT – 13
13.
Implement decision trees using ‘iris’ dataset using package party and ‘rpart’
EXPERIMENT – 14
14.
Use a Corpus() function to create a data corpus then Build a term Matrix and Reveal
word frequencies
R Programming Language
R is an open-source programming language that is widely used as a
statistical software and data analysis tool. R generally comes with the
Command-line interface. R is available across widely used platforms like
Windows, Linux, and macOS. Also, the R programming language is the
latest cutting-edge tool.
It was designed by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development
Core Team. R programming language is an implementation of the S
programming language.
Basic Statistics: The most common basic statistics terms are the mean,
mode, and median. These are all known as “Measures of Central
Tendency.” So using the R language we can measure central tendency very
easily.
Static graphics: R is rich with facilities for creating and developing
interesting static graphics. R contains functionality for many plot types
including graphic maps, mosaic plots, biplots, and the list goes on.
Probability distributions: Probability distributions play a vital role in
statistics and by using R we can easily handle various types of probability
distribution such as Binomial Distribution, Normal Distribution, Chi-
squared Distribution and many more.
Data analysis: It provides a large, coherent and integrated collection of
tools for data analysis.
Advantages of R:
• R is the most comprehensive statistical analysis package. As new
technology and concepts often appear first in R.
• As R programming language is an open source. Thus, you can run R
anywhere and at any time.
Experiment- 1
AIM : Implement all basic R commands.
Source Code:
Mat<-matrix(c(1:9),3,3)
Rm(list=ls())
Ls()
Answer1<- choose(3, 2)
Answer2<- choose(3, 7)
Answer3<- choose(7, 3)
Printanswer1)
Print(answer2)
Print(answer3)
Replace( ): It replaces the values x with index given in list by those given
N values if necessary the values in n are recycled.
Names<-c(“Suresh”,“sita”,”Anu”, “manasa”,”Riya”,”Ramesh”, “Roopa”,”Neha”)
Roll_no<- 1:8
Marks<-c(15, 20, 3, -1, 14, -2, 10, 13)
Full_marks<- c(20, 10, 44, 21, 24, 36, 20, 13)
df<- data.frame(Roll_no, Names, marks, full_marks)
print(“original DF”)
print(df)
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516
AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY
print(“Replaced Value”)
data<- replace(df$Marks, df$Marks<0, 0)
print(data)
X1<-1.2
X2<-1.8
X3<- -1.3
X4<- 1.7
Round(x1)
Round(x2)
Round(x3)
all( )& any( ): The any() and all() functions are handy shortcut. They
report whether any or all their arguments are here.
X<- 1:10
All(x > 88)
All(x > 0)
Any(x > 8)
Any(x > 88)
X1<- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
X2<- c(4, 2, 8, NA, 11)
Min(x1)
Min(x2, na.rm = FALSE)
Min(x2, na.rm = TRUE)
Arr = array(2:13, dim = c(2, 3, 2))
Print(arr)
Max(x1)
Max(x2, na.rm = FALSE)
max(x2, na.rm = TRUE)
Arr = array(2:13, dim = c(2, 3, 2))
Print(arr)
Sum( ) & Mean( ): sum(), mean() methods are available in R which are
used to compute the specified operation over the
arguments specified in the method. Max(arr)
Vec = c(1, 2, 3, 4)
Print(“sum of the vector:”)
Print(sum(vec))
Print(mean(vec))
Print(“product of the vector:”)
Print(prod(vec))
Vec<- 1:5
Vec
Vec_rev <- rev(vec)
Vec_rec
Experiment- 2
AIM : Interact data through .csv files (Import from and export
to .csv files).
Source Code:
Csv_data<- read.csv(file=”data.csv”)
Print(csv_data)
Print(ncol(csv_data))
Print(nrow(csv_data))
Experiment- 3
AIM : Get and Clean data using swirl exercises. (Use ‘swirl’
package, library and install that topic from swirl).
Source Code:
Swirl is designed in such a way that you can create your own
interactive content and share it freely with students in your
classroom or around the world.
Install.packages(“swirl”)
Experiment- 4
AIM : Visualize all Statistical measures (Mean, Mode, Median, Range,
Inter Quartile Range etc., using Histograms, Boxplots and
Scatter Plots).
Source Code:
Data<- iris
Head(data)
Mean(data$Sepal.Length)
MEDIAN:- The median is the central number of a data set. ... This is the
median. If there are 2 numbers in the middle, the median is
the average of those 2 numbers. The mode is the number in
a data set that occurs most frequently.
Median(data$Sepal.Length)
MODE:- The mode is the value that occurs most often. The mode is the
only average that can have no value, one value or more than
one value. When finding the mode, it helps to order the
numbers first.
Tab<- table(data$Sepal.Length)
Sort(tab, decreasing = TRUE)
RANGE:- The range can then be easily computed, as you have guessed,
by subtracting the minimum from the maximum.
Max(data$Sepal.Length) – min(data$Sepal.Length)
IQR(data$Sepal.Length)
HISTOGRAM:
A histogram gives an idea about the distribution of a quantitative
variable. The idea is to break the range of values into intervals and
count how many observations fall into each interval. Histograms are a
bit similar to barplots, but histograms are used for quantitative variables
whereas barplots are used for qualitative variables. To draw a histogram
in R, use hist()
Data<-iris
Hist<-(data$Sepal.Length)
BOXPLOT:
Boxplots are really useful in descriptive statistics and are often
underused (mostly because it is not well understood by the public). A
boxplot graphically represents the distribution of a quantitative variable
by visually displaying five common location summary (minimum,
median, first/third quartiles and maximum) and any observation that
was classified as a suspected outlier using the interquartile range (IQR)
criterion.
Boxplot(data$Sepal.Length)
Plot(data$Sepal.Length, data$Petal.Length)
SCATTER PLOT
BOX PLOT
SCATTER PLOT:
Scatterplots allow to check whether there is a potential link between
two quantitative variables. For this reason, scatterplots are often used to
visualize a potential correlation between two variables. For instance,
when drawing a scatterplot of the length of the sepal and the length of
the petal.
Experiment- 5
AIM : Create a data frame with the following structure.
Source Code:
Result<-data.frame(df$emp_name, df$salary)
Print(result)
Result<- df[1:2]
Print(result1)
(c): Extract 3rd and 5th row with 2nd and 4th column.
Result2 = df[c(3, 5),c(2,4)]
Print(result2)
Experiment- 6
AIM : Write R Program using ‘apply’ group of functions to create an
apply normalization function on each of the numeric variables
/columns of iris dataset to transform them into.
Source Code:
(a). 0 to 1 range with min-max normalization.
Min_max_norm <- function(x) { (x-min(x)) / (max(x) – min(x)) }
Iris_norm <- as.data.frame( lapply(iris[1:4], min_max_norm))
Head(iris_norm)
Experiment- 7
AIM : Create a data frame with 10 observations and 3 variables and add
new rows and columns to it using ‘rbind’ and ‘cbind’ function.
Source Code:
Rbind : row-bind
The name of the rbind R function stands for row-bind. The rbind
function can be used to combine several vectors, matrices and/or
data frames by rows.
rbind() function in R Language is used to combine specified
Vector, Matrix or Data Frame by rows. deparse. level: This value
determines how the column names generated. The default value of
deparse.
Cbind : column-bind
A common data manipulation task in R involves merging two data
frames together. One of the simplest ways to do this is with the
cbind function.
The cbind function – short for column bind – is a merge function
that can be used to combine two data frames with the same number
of multiple rows into a single data frame.
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516
AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY
Rbind:-
Rank <- 1:10
Country<- c(“china”, “India”, “united states”,”Indonesia”,”Pakistan”,”Brazil”,
“Nigeria”, “Bangladesh”, “Mexico”)
Population.2019 <- c(1433783686, 1366417754, 329064917, 270625568,
216565318, 211049527, 200963599, 163046161, 145872256, 127575529)
Population.2018 <- c(147647786, 1352642280, 32709265, 267670543,
212228286, 209469323,195874683, 161376708, 145734038, 126190788)
Growth.Rate <- c(“0.43%”, “1.02%”,”0.60%”,”1.10%”,”2.04%”,”0.75”,”2.60”,
“1.03%”,”0.9%”,”1.10%”)
Dataframe.Worldpopulation <- data.frame(rank, country, population.2019,
population.2018, growth.rate)
Dataframe.worldpopulation
Japan.Population <- dataframe(11, “Japan”, ” 126860301, 12702192,”-
0.27%”)
Names(Japan.population) <- c(“Rank”, “country”, “population.2019”,
”population.2018”,”Growth.rate”)
Worldpopulation.Newdf <- rbind(Dataframe.Worldpopulation,
Japan.population)
Worldpopulation.Newdf
Dataframe.Worldpopulation
Cbind:-
DataFrame.WorldPopulation $area.km.sqaure <- c(9706961,
3287590,9372610, 1904569, 881912, 8515767, 923768, 147570,
17098242,1964375)
DataFrame.WorldPopulation
R bind:
C bind:
Experiment- 8
AIM : Write R program to implement linear and multiple regression on
‘mtcars’ dataset to estimate the value of ‘mpg’ variable, with best R2
and plot the original values in ‘green’ and predicted values in ‘red’.
Source Code:
The built-in mtcars data frame contains information about 32 cars,
including their weight, fuel efficiency (in miles-per-gallon), speed,
etc. (To find out more about the dataset, use help(mtcars)
The plots shows a (linear) relationship!. Then if we want to
perform linear regression to determine the coefficients of a linear
model, we would use the lm function.
abline(fit,col=3,lwd=2)
bs <- round(coef(fit), 3)
lmlab <- paste0("mpg = ", bs[1],ifelse(sign(bs[2])==1, " + ", " - "), abs(bs[2]),
" wt ")
mtext(lmlab, 3, line=-2)
Experiment- 9
AIM : Implement k-means clustering using R.
Source Code: K-means
K Means Clustering in R Programming is an Unsupervised Non-
linear algorithm that cluster data based on similarity or similar
groups.
It seeks to partition the observations into a pre-specified number of
clusters. Segmentation of data takes place to assign each training
example to a segment called a cluster.
In the unsupervised algorithm, high reliance on raw data is given
with large expenditure on manual review for review of relevance is
given. It is used in a variety of fields like Banking, healthcare,
retail, Media, etc..
Data(iris)
Str(iris)
library(cluster)
iris_1 <- iris[, -5]
set.seed(240)
kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
kmeans.re
kmeans.re$cluster
Experiment- 10
AIM : Implement k-medoids clustering using R.
Source Code: k-medoids
K-Medoids (also called as Partitioning Around Medoid)
algorithm was proposed in 1987 by Kaufman and Rousseeuw. A
medoid can be defined as the point in the cluster, whose
dissimilarities with all the other points in the cluster is minimum.
set.seed(1234)
x <- rnorm(24, mean=rep(1:3, each=4), sd=0.2)
y <- rnorm(24, mean=rep(c(1,2,1), each=4), sd=0.2)
data <- data.frame(x, y)
plot(x, y, col="blue", pch=19, cex=1)
text(x+0.05, y+0.05, labels=as.character(1:24))
library(cluster)
kmedoidObj <- pam(x=data, k=4)
names(kmedoidObj)
kmedoidObj$objective
par(mfrow=c(2,2), mar=c(3,3,3,3))
for(i in 1:4){ plot(x, y, col=kmedoidObj$clustering, pch=19,
cex=1) points(kmedoidObj$medoids, col=1:4, pch=4, cex=3,
lwd=3)}
Experiment- 11
AIM : Implement density based clustering on iris dataset
Source Code:
data(iris)
str(iris)
install.packages("fpc")
library(fpc)
iris_1 <- iris[-5]
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516
AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY
set.seed(220)
Dbscan_cl <- dbscan(iris_1, eps = 0.45, MinPts = 5)
Dbscan_cl
Dbscan_cl$cluster
table(Dbscan_cl$cluster, iris$Species)
plot(Dbscan_cl, iris_1, main = "DBScan")
plot(Dbscan_cl, iris_1, main = "Petal Width vs Sepal Length")
Experiment- 12
AIM : Implement decision trees using ‘readingSkills’ dataset.
Source Code:
library(datasets)
library(caTools)
library(party)
library(dplyr)
library(magrittr)
data("readingSkills")
head(readingSkills)
Experiment- 13
AIM : Implement decision trees using ‘iris’ dataset using package party
and ‘rpart’
Source Code:
Str(iris)
set.seed(1234)
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
trainData <- iris[ind==1,]
testData <- iris[ind==2,]
library(party)
print(iris_ctree)
plot(iris_ctree)