Avanthi'S Research &technological Academy: Data Mining Lab

AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
DATA MINING LAB

For
B. Tech COMPUTER SCIENCE &ENGINEERING

(Applicable for batches admitted from 2019-2020)
NAME : G. JAYA PRAKASH NARAYANA

ROLL NO : 19HQ1A0516
STREAM : COMPUTER SCIENCE & ENGG..
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516

ACADEMY
AVANTHI’S RESEARCH &

TECHNOLOGICAL ACADEMY
BASAVAPALEM(V),BHOGAPURAM(M),VIZIANAGARAM DIST.A.P.
JNTUK/REGD.NO 19HQ1A0516
CERTIFICATE
Certified that is bonafied record of practical work done by
JAYA PRAKASH NARAYANA G of III-I semester in the DATA MINING LAB
of department of COMPUTER SCIENCE & ENGINEERING during the academic
year 2021-2022 .
No.of Experiments done &certified : 14
Signature of the
Signature of the examiner Laboratory In-charge
1)External:
2)Internal: Signature of the H.O.D

ACADEMY
CONTENTS
Date Page Marks Remarks
S.NO Name of the Experiment no. awarded
EXPERIMENT -1
1.
Implement all basic R commands
EXPERIMENT – 2
2.
Interact data through .csv files (Import from and export to .csv files)
EXPERIMENT – 3
3.
Get and Clean data using swirl exercises. (Use ‘swirl’ package, library and install that
topic from swirl)
EXPERIMENT – 4
4.
Visualize all Statistical measures (Mean, Mode, Median, Range, Inter Quartile Range
etc., using Histograms, Boxplots and Scatter Plots)
EXPERIMENT – 5
5.
Create a data frame with the following structure
EXPERIMENT – 6
6.
Write R Program using ‘apply’ group of functions to create and apply normalization
function on each of the numeric variables/columns of iris dataset
EXPERIMENT – 7
7.
Create a data frame with 10 observations and 3 variables and add new rows and
columns to it using ‘rbind’ and ‘cbind’ function
EXPERIMENT – 8
8.
Implement linear and multiple regression on ‘mtcars’ dataset to estimate the value of
‘mpg’ variable, with best R and plot the original values in ‘green’ and predicted
2
values in ‘red’
EXPERIMENT – 9
9.
Implement k-means clustering using R
EXPERIMENT – 10
10.
Implement k-medoids clustering using R
EXPERIMENT – 11
11.
Implement density based clustering on iris dataset
EXPERIMENT – 12
12.
implement decision trees using ‘readingSkills’ dataset
EXPERIMENT – 13
13.
Implement decision trees using ‘iris’ dataset using package party and ‘rpart’
EXPERIMENT – 14
14.
Use a Corpus() function to create a data corpus then Build a term Matrix and Reveal
word frequencies

ACADEMY
R Programming Language
 R is an open-source programming language that is widely used as a
statistical software and data analysis tool. R generally comes with the
Command-line interface. R is available across widely used platforms like
Windows, Linux, and macOS. Also, the R programming language is the
latest cutting-edge tool.
 It was designed by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development
Core Team. R programming language is an implementation of the S
programming language.

ACADEMY
Features of R Programming Language

Statistical Features of R:
 Basic Statistics: The most common basic statistics terms are the mean,
mode, and median. These are all known as “Measures of Central
Tendency.” So using the R language we can measure central tendency very
easily.
 Static graphics: R is rich with facilities for creating and developing
interesting static graphics. R contains functionality for many plot types
including graphic maps, mosaic plots, biplots, and the list goes on.
 Probability distributions: Probability distributions play a vital role in
statistics and by using R we can easily handle various types of probability
distribution such as Binomial Distribution, Normal Distribution, Chi-
squared Distribution and many more.
 Data analysis: It provides a large, coherent and integrated collection of
tools for data analysis.
Advantages of R:
• R is the most comprehensive statistical analysis package. As new
technology and concepts often appear first in R.
• As R programming language is an open source. Thus, you can run R
anywhere and at any time.

ACADEMY
Experiment- 1
AIM : Implement all basic R commands.
Source Code:
ls( ): ls function in R programming is used to list of all the objects

that are present in the present working directory.
 vec <- c(1,2,3)
 mat <-matrix(c(1:4), 2)
 arr <-
Rm( ): In R language it is used to delete objects from the memory. It

can be used with ls( ) to delete all objects.
 Rm(list1)
 ls

ACADEMY
Beta( ): The beta function in R can be implemented using beta(a,b),

where as a, b are non-negative numbers.
 Similarly, the function beta (a,b) returns the natural algorithm of

the beta function.
 X_beta <- seq(0, 1.5 by=.025 )
 Y_beta <- dbeta)(x_beta, shape1= 2,shape2 = 4.5)
 Plot(y_beta)
Rm(list=ls): It is used to Remove Objects from a Specified

Environment. list=ls() is base in this command that means
you are referring to all the objects present in the
workspace. similarly, rm() is used to remove all the objects
from the workspace when you use list=ls() as base.
 Vec<-c(1,2,3,4)
 List1=list(“number”=c(1,2,3),”characters”=c(“a”,”b”,””))

ACADEMY
 Mat<-matrix(c(1:9),3,3)
 Rm(list=ls())
 Ls()
Gamma(x): It can be Implemented using the gamma(x), where the

argument represents a non-negative numeric vector. It is to
be noted that any negative argument will not produce a
result as shown in below.
 X1<-c(2, 3, 5)
 X2<-c(6, 7, 8)
 X3<-c(-1, -2, -3)
 Gama(x1)
 Gama(x2)
 Gama(x3)
Choose( ): R language offers a direct function that can be compute the

ncr value without writing the whole code for computing nor value.
ACADEMY
 Answer1<- choose(3, 2)
 Printanswer1)
 Print(answer2)
 Print(answer3)
Factorial( ): R language offers a factorial function that can compute the

factorial of a number without writing the while code for
computing factorial is.
 Answer1<- factorial(c(0, 1, 2 ,3 ,4))
 Print(answer1)
Replace( ): It replaces the values x with index given in list by those given
N values if necessary the values in n are recycled.
 Names<-c(“Suresh”,“sita”,”Anu”, “manasa”,”Riya”,”Ramesh”, “Roopa”,”Neha”)
 Roll_no<- 1:8
 Marks<-c(15, 20, 3, -1, 14, -2, 10, 13)
 Full_marks<- c(20, 10, 44, 21, 24, 36, 20, 13)
 df<- data.frame(Roll_no, Names, marks, full_marks)
 print(“original DF”)
 print(df)
ACADEMY
 print(“Replaced Value”)
 data<- replace(df$Marks, df$Marks<0, 0)
 print(data)
List(values): Lists are the R objects which contain elements of different

types like − numbers, strings, vectors and another list
inside it. A list can also contain a matrix or a function as its
elements. List is created using list() function
 x<- list(mt = matrix(1:6, nrow =2), lt – letters[1:8],n =c[1:10])
 cat(“whole List:\n”)
 print(x)
Round(x,n): It is used to round off values to a specific number of

Decimal value.
ACADEMY
 X1<-1.2
 X2<-1.8
 X3<- -1.3
 X4<- 1.7
 Round(x1)
 Round(x2)
 Round(x3)
Ceiling( ): It returns the smalller integer that is greater than or equal to

The value passed to it as argument.
 #using ceiling() method
 Answer1<- ceiling(1.2)
 Answer4<- ceiling(-2.6)
 Print(answer1)
 Print(answer2)
 Print(answer3)
 Print(answer4)

ACADEMY
Floor( ): It is used to return the largest integer that is smaller than or

Equal into values present its as argument.
 Answer1<- floor(1.2)
 Answer4<- floor(-2.6)
 Print(answer1)
 Print(answer2)
 Print(answer3)
 Print(answer4)
all( )& any( ): The any() and all() functions are handy shortcut. They
report whether any or all their arguments are here.
 X<- 1:10
 All(x > 88)
 All(x > 0)
 Any(x > 8)
 Any(x > 88)

ACADEMY
Min( ): It is used to calculate the minimum of vector elements or

Minimum of a particular column.
 X1<- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
 X2<- c(4, 2, 8, NA, 11)
 Min(x1)
 Min(x2, na.rm = FALSE)
 Min(x2, na.rm = TRUE)
 Arr = array(2:13, dim = c(2, 3, 2))
 Print(arr)
Max( ): It is used to find the maximum element present in an object.
 Max(x1)
 Max(x2, na.rm = FALSE)
 max(x2, na.rm = TRUE)
 Arr = array(2:13, dim = c(2, 3, 2))
 Print(arr)

ACADEMY
Sum( ) & Mean( ): sum(), mean() methods are available in R which are
used to compute the specified operation over the
arguments specified in the method. Max(arr)
 Vec = c(1, 2, 3, 4)
 Print(“sum of the vector:”)
 Print(sum(vec))
 Print(mean(vec))
 Print(“product of the vector:”)
 Print(prod(vec))
rev( ): It is used to return the reverse version of the data.
 Vec<- 1:5
 Vec
 Vec_rev <- rev(vec)
 Vec_rec

ACADEMY
Table( ): It is used to create categorized representation of data with

variable name and frequency in form of a table.
 df = data.frame(“Name”= c(“abc”, “cde”, “def”), “Gender”= c(“male”,
“Female”, “male”) )
 table(df)

ACADEMY
Experiment- 2
AIM : Interact data through .csv files (Import from and export
to .csv files).
Source Code:
CSV: A CSV file is a commonly used file extension when it comes to

spreadsheets. Even software programs that don't look and feel
like a spreadsheet application will frequently offer a CSV as an
output file for downloading a data set, such as a report of results,
actions, or contacts.
 A comma-separated values (CSV) file is a delimited text file that

uses a comma to separate values. Each line of the file is a data
record. Each record consists of one or more fields, separated by
commas. The use of the comma as a field separator is the source of
the name for this file format
 Create a valid data in Excel Sheet.

ACADEMY
 Save the data in a valid location with .csv format.
 Importing the csv file in R console.
 Csv_data<- read.csv(file=”data.csv”)
 Print(csv_data)
 Print(ncol(csv_data))
 Print(nrow(csv_data))

ACADEMY
Experiment- 3
AIM : Get and Clean data using swirl exercises. (Use ‘swirl’
package, library and install that topic from swirl).
Source Code:
SWIRL: swirl is a platform for teaching R programming and data

science. However, an educational platform is only as good as
the content it delivers to the students.
 Swirl is designed in such a way that you can create your own
interactive content and share it freely with students in your
classroom or around the world.
 The swirlify R packages provides a comprehensive toolbox for

swirl instructors. Our authoring tools will guide you effortlessly
through the process of creating interactive content, so that you can
focus on the message you want to convey to students.
 swirl is a software package for the R programming language that

turns the R console into an interactive learning environment. Users
receive immediate feedback as they are guided through self-paced
lessons in data science and R programming.
Install Swirl Package:
Step-1: Install swirl package.
 Install.packages(“swirl”)

ACADEMY
Step-2: select Region to continue.

ACADEMY
Step-3: Installing swirl packages.
Step-4: Type library(swirl) to check its libraries.

 Library(swirl)
Step-5: Type swirl ( ) when you are ready to begin.

 swirl()

ACADEMY
Step-6: Enter a Name that you wanted to be called as.
Step-7: Type … to continue
Step-8: select 1 for basics of R programming.

ACADEMY
Step-9: select 1 to move to R programming.
Step-10: Select 0 after going through the menu to enter in to

Repository.

ACADEMY
Step-11: The Console will be redirected to Git hub in your Browser.
Step-12: Select Getting and cleaning data to check the file.

ACADEMY
Experiment- 4
AIM : Visualize all Statistical measures (Mean, Mode, Median, Range,
Inter Quartile Range etc., using Histograms, Boxplots and
Scatter Plots).
Source Code:
MEAN:- Mean is an essential concept in mathematics and statistics.

The mean is the average or the most common value in a
collection of numbers. In statistics, it is a measure of central
tendency of a probability distribution along median and mode.
It is also referred to as an expected value.
 Data<- iris
 Head(data)
 Mean(data$Sepal.Length)
MEDIAN:- The median is the central number of a data set. ... This is the
median. If there are 2 numbers in the middle, the median is
the average of those 2 numbers. The mode is the number in
a data set that occurs most frequently.

ACADEMY
 Median(data$Sepal.Length)
MODE:- The mode is the value that occurs most often. The mode is the
only average that can have no value, one value or more than
one value. When finding the mode, it helps to order the
numbers first.
 Tab<- table(data$Sepal.Length)
 Sort(tab, decreasing = TRUE)
RANGE:- The range can then be easily computed, as you have guessed,
by subtracting the minimum from the maximum.
 Max(data$Sepal.Length) – min(data$Sepal.Length)

ACADEMY
The interquartile range (i.e., the

difference between the first and third quartile) can be
computed with the IQR() function.
 IQR(data$Sepal.Length)
HISTOGRAM:
A histogram gives an idea about the distribution of a quantitative
variable. The idea is to break the range of values into intervals and
count how many observations fall into each interval. Histograms are a
bit similar to barplots, but histograms are used for quantitative variables
whereas barplots are used for qualitative variables. To draw a histogram
in R, use hist()
 Data<-iris
 Hist<-(data$Sepal.Length)

ACADEMY
BOXPLOT:
Boxplots are really useful in descriptive statistics and are often
underused (mostly because it is not well understood by the public). A
boxplot graphically represents the distribution of a quantitative variable
by visually displaying five common location summary (minimum,
median, first/third quartiles and maximum) and any observation that
was classified as a suspected outlier using the interquartile range (IQR)
criterion.
 Boxplot(data$Sepal.Length)
 Plot(data$Sepal.Length, data$Petal.Length)
SCATTER PLOT
BOX PLOT
SCATTER PLOT:
Scatterplots allow to check whether there is a potential link between
two quantitative variables. For this reason, scatterplots are often used to
visualize a potential correlation between two variables. For instance,
when drawing a scatterplot of the length of the sepal and the length of
the petal.

ACADEMY
Experiment- 5
AIM : Create a data frame with the following structure.
EMP ID EMP NAME SALARY START DATE

1 Satish 5000 01-11-2013
2 Vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 Pallavi 4500 23-10-2000
a. Extract two column names using column name.

b. Extract the first two rows and then all colums.
Data Frame :- A Data Frame is a table or a two Dimensional array-like

structure in which each column contains values of one variable and each
row contains one set of values from each column.
 The column names should be non-empty.

 The row names should be unique.
 The data stored in a data frame can be of numeric, factor or
character type.
 Each column should contain same number of data items.
 Data Frames are data displayed in a format as a table. Data

Frames can have different types of data inside it. While the first
column can be character , the second and third can be numeric or
logical .

ACADEMY
Source Code:
Step-1: Enter the valid Data to Insert in Data Frame.
 Df<- data.frame(emp_id=c(1:5), emp_name=c(“satish”,”vani”,

”ramesh”,”Praveen”,”Pallavi”), salary=c(5000,10000,9500,4500),
start_date=c(“2013-11-01”,”2011-06-05”,”1999-09-21”,”2005-09-
13”,”2000-10-23”))
 Print(df)
(a): Extracting columns using column names.
 Result<-data.frame(df$emp_name, df$salary)
 Print(result)

ACADEMY
(b) : Extracting the First two rows.
 Result<- df[1:2]
 Print(result1)
(c): Extract 3rd and 5th row with 2nd and 4th column.
 Result2 = df[c(3, 5),c(2,4)]
 Print(result2)

ACADEMY
Experiment- 6
AIM : Write R Program using ‘apply’ group of functions to create an
apply normalization function on each of the numeric variables
/columns of iris dataset to transform them into.
a. 0 to 1 range with min-max normalization.

b. a value around 0 with z-score normalization.
 The most common reason to normalize variables is when you’re
conducting some type of multivariate analysis (i.e. you want to
understand the relationship between several predictor variables and
a response variable) and you want each variable to contribute
equally to the analysis
 By normalizing the variables, we can be sure that each variable
contributes equally to the analysis. Two common ways to
normalize (or “scale”) variables include:
 Min-Max Normalization: (X – min(X)) / (max(X) – min(X))

 Z-Score Standardization: (X – μ) / σ
Source Code:
(a). 0 to 1 range with min-max normalization.
 Min_max_norm <- function(x) { (x-min(x)) / (max(x) – min(x)) }
 Iris_norm <- as.data.frame( lapply(iris[1:4], min_max_norm))
 Head(iris_norm)

ACADEMY
(b). a value around 0 with z-score normalization.

 Mean(iris$Sepal.Length.Width)
 Sd(iris$Sepal.Width)

ACADEMY
Experiment- 7
AIM : Create a data frame with 10 observations and 3 variables and add
new rows and columns to it using ‘rbind’ and ‘cbind’ function.
Source Code:
Rbind : row-bind
 The name of the rbind R function stands for row-bind. The rbind
function can be used to combine several vectors, matrices and/or
data frames by rows.
 rbind() function in R Language is used to combine specified
Vector, Matrix or Data Frame by rows. deparse. level: This value
determines how the column names generated. The default value of
deparse.
Cbind : column-bind
 A common data manipulation task in R involves merging two data
frames together. One of the simplest ways to do this is with the
cbind function.
 The cbind function – short for column bind – is a merge function
that can be used to combine two data frames with the same number
of multiple rows into a single data frame.
ACADEMY
Rbind:-
 Rank <- 1:10
 Country<- c(“china”, “India”, “united states”,”Indonesia”,”Pakistan”,”Brazil”,
“Nigeria”, “Bangladesh”, “Mexico”)
 Population.2019 <- c(1433783686, 1366417754, 329064917, 270625568,
216565318, 211049527, 200963599, 163046161, 145872256, 127575529)
 Population.2018 <- c(147647786, 1352642280, 32709265, 267670543,
212228286, 209469323,195874683, 161376708, 145734038, 126190788)
 Growth.Rate <- c(“0.43%”, “1.02%”,”0.60%”,”1.10%”,”2.04%”,”0.75”,”2.60”,
“1.03%”,”0.9%”,”1.10%”)
 Dataframe.Worldpopulation <- data.frame(rank, country, population.2019,
population.2018, growth.rate)
 Dataframe.worldpopulation
 Japan.Population <- dataframe(11, “Japan”, ” 126860301, 12702192,”-
0.27%”)
 Names(Japan.population) <- c(“Rank”, “country”, “population.2019”,
”population.2018”,”Growth.rate”)
 Worldpopulation.Newdf <- rbind(Dataframe.Worldpopulation,
Japan.population)
 Worldpopulation.Newdf
 Dataframe.Worldpopulation
Cbind:-
 DataFrame.WorldPopulation $area.km.sqaure <- c(9706961,
3287590,9372610, 1904569, 881912, 8515767, 923768, 147570,
17098242,1964375)
 DataFrame.WorldPopulation

ACADEMY
R bind:
C bind:

ACADEMY
Experiment- 8
AIM : Write R program to implement linear and multiple regression on
‘mtcars’ dataset to estimate the value of ‘mpg’ variable, with best R2
and plot the original values in ‘green’ and predicted values in ‘red’.
Source Code:
 The built-in mtcars data frame contains information about 32 cars,
including their weight, fuel efficiency (in miles-per-gallon), speed,
etc. (To find out more about the dataset, use help(mtcars)
 The plots shows a (linear) relationship!. Then if we want to
perform linear regression to determine the coefficients of a linear
model, we would use the lm function.
 plot(mpg ~ wt, data = mtcars, col=2)

 fit <- lm(mpg ~ wt, data = mtcars)
 summary(fit)

ACADEMY
 abline(fit,col=3,lwd=2)
 bs <- round(coef(fit), 3)
 lmlab <- paste0("mpg = ", bs[1],ifelse(sign(bs[2])==1, " + ", " - "), abs(bs[2]),
" wt ")
 mtext(lmlab, 3, line=-2)

ACADEMY
Experiment- 9
AIM : Implement k-means clustering using R.
Source Code: K-means
 K Means Clustering in R Programming is an Unsupervised Non-
linear algorithm that cluster data based on similarity or similar
groups.
 It seeks to partition the observations into a pre-specified number of
clusters. Segmentation of data takes place to assign each training
example to a segment called a cluster.
 In the unsupervised algorithm, high reliance on raw data is given
with large expenditure on manual review for review of relevance is
given. It is used in a variety of fields like Banking, healthcare,
retail, Media, etc..
 Data(iris)
 Str(iris)
 library(cluster)
 iris_1 <- iris[, -5]
 set.seed(240)
 kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
 kmeans.re

ACADEMY
 kmeans.re$cluster
 cm <- table(iris$Species, kmeans.re$cluster)

 cm
 plot(iris_1[c("Sepal.Length", "Sepal.Width")], col = kmeans.re$cluster)

 plot(iris_1[c("Sepal.Length", "Sepal.Width")], col = kmeans.re$cluster,
main = "K-means with 3 clusters")
 kmeans.re$center

ACADEMY
 kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]
 points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")], col = 1:3,

pch = 8, cex = 3)
 y_kmeans <- kmeans.re$cluster
 clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")], y_kmeans, lines =
0,shade = TRUE, color = TRUE,labels = 2,plotchar = FALSE, span =
TRUE,main = paste("Cluster iris"),xlab = 'Sepal.Length',ylab =
'Sepal.Width')

ACADEMY

ACADEMY
Experiment- 10
AIM : Implement k-medoids clustering using R.
Source Code: k-medoids
 K-Medoids (also called as Partitioning Around Medoid)
algorithm was proposed in 1987 by Kaufman and Rousseeuw. A
medoid can be defined as the point in the cluster, whose
dissimilarities with all the other points in the cluster is minimum.
 set.seed(1234)
 x <- rnorm(24, mean=rep(1:3, each=4), sd=0.2)
 y <- rnorm(24, mean=rep(c(1,2,1), each=4), sd=0.2)
 data <- data.frame(x, y)
 plot(x, y, col="blue", pch=19, cex=1)
 text(x+0.05, y+0.05, labels=as.character(1:24))
 library(cluster)
 kmedoidObj <- pam(x=data, k=4)
 names(kmedoidObj)
 kmedoidObj$objective
 par(mfrow=c(2,2), mar=c(3,3,3,3))
 for(i in 1:4){ plot(x, y, col=kmedoidObj$clustering, pch=19,
cex=1) points(kmedoidObj$medoids, col=1:4, pch=4, cex=3,
lwd=3)}

ACADEMY

ACADEMY
Experiment- 11
AIM : Implement density based clustering on iris dataset
Source Code:
 Density-Based Clustering of Applications with Noise(DBScan)

is an Unsupervised learning Non-linear algorithm. It does use the
idea of density reachability and density connectivity.
 The data is partitioned into groups with similar characteristics or

clusters but it does not require specifying the number of those
groups in advance. A cluster is defined as a maximum set of
densely connected points. It discovers clusters of arbitrary shapes
in spatial databases with noise.
 data(iris)
 str(iris)
 install.packages("fpc")
 library(fpc)
 iris_1 <- iris[-5]
ACADEMY
 set.seed(220)
 Dbscan_cl <- dbscan(iris_1, eps = 0.45, MinPts = 5)
 Dbscan_cl
 Dbscan_cl$cluster
 table(Dbscan_cl$cluster, iris$Species)
 plot(Dbscan_cl, iris_1, main = "DBScan")
 plot(Dbscan_cl, iris_1, main = "Petal Width vs Sepal Length")
PLOTTING THE DENSITY BASE

ACADEMY
Experiment- 12
AIM : Implement decision trees using ‘readingSkills’ dataset.
Source Code:
Decision Trees: Decision Trees are useful supervised Machine

learning algorithms that have the ability to perform both regression and
classification tasks. It is characterized by nodes and branches, where the
tests on each attribute are represented at the nodes, the outcome of this
procedure is represented at the branches and the class labels are
represented at the leaf nodes.
 Hence it uses a tree-like model based on various decisions that are

used to compute their probable outcomes. These types of tree-
based algorithms are one of the most widely used algorithms due to
the fact that these algorithms are easy to interpret and use.
 let us now examine this concept with the help of an example,
which in this case is the most widely used “readingSkills” dataset
by visualizing a decision tree for it and examining its accuracy.
 library(datasets)
 library(caTools)
 library(party)
 library(dplyr)
 library(magrittr)
 data("readingSkills")
 head(readingSkills)

ACADEMY
 sample_data = sample.split(readingSkills, SplitRatio = 0.8)

 train_data <- subset(readingSkills, sample_data == TRUE)
 test_data <- subset(readingSkills, sample_data == FALSE)
 model<- ctree(nativeSpeaker ~ ., train_data)
 plot(model)

ACADEMY
Experiment- 13
AIM : Implement decision trees using ‘iris’ dataset using package party
and ‘rpart’
Source Code:
DECISION TREES WITH PACKAGE PARTY :
 A computational toolbox for recursive partitioning. The core of the

package is ctree(), an implementation of conditional inference trees
which embed tree-structured regression models into a well defined
theory of conditional inference procedures.
 This non-parametric class of regression trees is applicable to all
kinds of regression problems, including nominal, ordinal, numeric,
censored as well as multivariate response variables and arbitrary
measurement scales of the covariates.
 Based on conditional inference trees, cforest() provides an
implementation of Breiman's random forests. The function mob()
implements an algorithm for recursive partitioning based on
parametric models.
 Str(iris)

ACADEMY
 set.seed(1234)
 ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
 trainData <- iris[ind==1,]
 testData <- iris[ind==2,]
 library(party)
 myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

 iris_ctree <- ctree(myFormula, data=trainData)
 table(predict(iris_ctree), trainData$Species)
 print(iris_ctree)
 plot(iris_ctree)

ACADEMY
 This non-parametric class of regression trees is applicable to all

kinds of regression problems, including nominal, ordinal, numeric,
censored as well as multivariate response variables and arbitrary
measurement scales of the covariates.

Avanthi'S Research &technological Academy: Data Mining Lab

Uploaded by

Copyright:

Available Formats

You might also like

Avanthi'S Research &technological Academy: Data Mining Lab

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Avanthi'S Research &technological Academy: Data Mining Lab

Uploaded by

Copyright:

Available Formats

AVANTHI’S RESEARCH &TECHNOLOGICAL

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DATA MINING LAB

B. Tech COMPUTER SCIENCE &ENGINEERING

NAME : G. JAYA PRAKASH NARAYANA

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516