Avanthi'S Research &technological Academy: Data Mining Lab

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

AVANTHI’S RESEARCH &TECHNOLOGICAL

ACADEMY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DATA MINING LAB


For

B. Tech COMPUTER SCIENCE &ENGINEERING


(Applicable for batches admitted from 2019-2020)

NAME : G. JAYA PRAKASH NARAYANA


ROLL NO : 19HQ1A0516
STREAM : COMPUTER SCIENCE & ENGG..

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

AVANTHI’S RESEARCH &


TECHNOLOGICAL ACADEMY
BASAVAPALEM(V),BHOGAPURAM(M),VIZIANAGARAM DIST.A.P.

JNTUK/REGD.NO 19HQ1A0516

CERTIFICATE
Certified that is bonafied record of practical work done by

JAYA PRAKASH NARAYANA G of III-I semester in the DATA MINING LAB

of department of COMPUTER SCIENCE & ENGINEERING during the academic

year 2021-2022 .

No.of Experiments done &certified : 14

Signature of the

Signature of the examiner Laboratory In-charge

1)External:

2)Internal: Signature of the H.O.D

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY
CONTENTS
Date Page Marks Remarks
S.NO Name of the Experiment no. awarded

EXPERIMENT -1
1.
Implement all basic R commands

EXPERIMENT – 2
2.
Interact data through .csv files (Import from and export to .csv files)

EXPERIMENT – 3
3.
Get and Clean data using swirl exercises. (Use ‘swirl’ package, library and install that
topic from swirl)

EXPERIMENT – 4
4.
Visualize all Statistical measures (Mean, Mode, Median, Range, Inter Quartile Range
etc., using Histograms, Boxplots and Scatter Plots)

EXPERIMENT – 5
5.
Create a data frame with the following structure

EXPERIMENT – 6
6.
Write R Program using ‘apply’ group of functions to create and apply normalization
function on each of the numeric variables/columns of iris dataset

EXPERIMENT – 7
7.
Create a data frame with 10 observations and 3 variables and add new rows and
columns to it using ‘rbind’ and ‘cbind’ function

EXPERIMENT – 8
8.
Implement linear and multiple regression on ‘mtcars’ dataset to estimate the value of
‘mpg’ variable, with best R and plot the original values in ‘green’ and predicted
2

values in ‘red’

EXPERIMENT – 9
9.
Implement k-means clustering using R

EXPERIMENT – 10
10.
Implement k-medoids clustering using R

EXPERIMENT – 11
11.
Implement density based clustering on iris dataset

EXPERIMENT – 12
12.
implement decision trees using ‘readingSkills’ dataset

EXPERIMENT – 13
13.
Implement decision trees using ‘iris’ dataset using package party and ‘rpart’

EXPERIMENT – 14
14.
Use a Corpus() function to create a data corpus then Build a term Matrix and Reveal
word frequencies

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

R Programming Language
 R is an open-source programming language that is widely used as a
statistical software and data analysis tool. R generally comes with the
Command-line interface. R is available across widely used platforms like
Windows, Linux, and macOS. Also, the R programming language is the
latest cutting-edge tool.
 It was designed by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development
Core Team. R programming language is an implementation of the S
programming language.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Features of R Programming Language


Statistical Features of R:

 Basic Statistics: The most common basic statistics terms are the mean,
mode, and median. These are all known as “Measures of Central
Tendency.” So using the R language we can measure central tendency very
easily.
 Static graphics: R is rich with facilities for creating and developing
interesting static graphics. R contains functionality for many plot types
including graphic maps, mosaic plots, biplots, and the list goes on.
 Probability distributions: Probability distributions play a vital role in
statistics and by using R we can easily handle various types of probability
distribution such as Binomial Distribution, Normal Distribution, Chi-
squared Distribution and many more.
 Data analysis: It provides a large, coherent and integrated collection of
tools for data analysis.

Advantages of R:
• R is the most comprehensive statistical analysis package. As new
technology and concepts often appear first in R.
• As R programming language is an open source. Thus, you can run R
anywhere and at any time.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 1
AIM : Implement all basic R commands.

Source Code:

ls( ): ls function in R programming is used to list of all the objects


that are present in the present working directory.
 vec <- c(1,2,3)
 mat <-matrix(c(1:4), 2)
 arr <-

Rm( ): In R language it is used to delete objects from the memory. It


can be used with ls( ) to delete all objects.
 Rm(list1)
 ls

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Beta( ): The beta function in R can be implemented using beta(a,b),


where as a, b are non-negative numbers.

 Similarly, the function beta (a,b) returns the natural algorithm of


the beta function.
 X_beta <- seq(0, 1.5 by=.025 )
 Y_beta <- dbeta)(x_beta, shape1= 2,shape2 = 4.5)
 Plot(y_beta)

Rm(list=ls): It is used to Remove Objects from a Specified


Environment. list=ls() is base in this command that means
you are referring to all the objects present in the
workspace. similarly, rm() is used to remove all the objects
from the workspace when you use list=ls() as base.
 Vec<-c(1,2,3,4)
 List1=list(“number”=c(1,2,3),”characters”=c(“a”,”b”,””))

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 Mat<-matrix(c(1:9),3,3)
 Rm(list=ls())
 Ls()

Gamma(x): It can be Implemented using the gamma(x), where the


argument represents a non-negative numeric vector. It is to
be noted that any negative argument will not produce a
result as shown in below.
 X1<-c(2, 3, 5)
 X2<-c(6, 7, 8)
 X3<-c(-1, -2, -3)
 Gama(x1)
 Gama(x2)
 Gama(x3)

Choose( ): R language offers a direct function that can be compute the


ncr value without writing the whole code for computing nor value.
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516
AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 Answer1<- choose(3, 2)
 Answer2<- choose(3, 7)
 Answer3<- choose(7, 3)
 Printanswer1)
 Print(answer2)
 Print(answer3)

Factorial( ): R language offers a factorial function that can compute the


factorial of a number without writing the while code for
computing factorial is.
 Answer1<- factorial(c(0, 1, 2 ,3 ,4))
 Print(answer1)

Replace( ): It replaces the values x with index given in list by those given
N values if necessary the values in n are recycled.
 Names<-c(“Suresh”,“sita”,”Anu”, “manasa”,”Riya”,”Ramesh”, “Roopa”,”Neha”)
 Roll_no<- 1:8
 Marks<-c(15, 20, 3, -1, 14, -2, 10, 13)
 Full_marks<- c(20, 10, 44, 21, 24, 36, 20, 13)
 df<- data.frame(Roll_no, Names, marks, full_marks)
 print(“original DF”)
 print(df)
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516
AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 print(“Replaced Value”)
 data<- replace(df$Marks, df$Marks<0, 0)
 print(data)

List(values): Lists are the R objects which contain elements of different


types like − numbers, strings, vectors and another list
inside it. A list can also contain a matrix or a function as its
elements. List is created using list() function
 x<- list(mt = matrix(1:6, nrow =2), lt – letters[1:8],n =c[1:10])
 cat(“whole List:\n”)
 print(x)

Round(x,n): It is used to round off values to a specific number of


Decimal value.
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516
AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 X1<-1.2
 X2<-1.8
 X3<- -1.3
 X4<- 1.7
 Round(x1)
 Round(x2)
 Round(x3)

Ceiling( ): It returns the smalller integer that is greater than or equal to


The value passed to it as argument.
 #using ceiling() method
 Answer1<- ceiling(1.2)
 Answer2<- ceiling(1.5)
 Answer3<- ceiling(2.6)
 Answer4<- ceiling(-2.6)
 Print(answer1)
 Print(answer2)
 Print(answer3)
 Print(answer4)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Floor( ): It is used to return the largest integer that is smaller than or


Equal into values present its as argument.
 Answer1<- floor(1.2)
 Answer2<- floor(1.5)
 Answer3<- floor(2.6)
 Answer4<- floor(-2.6)
 Print(answer1)
 Print(answer2)
 Print(answer3)
 Print(answer4)

all( )& any( ): The any() and all() functions are handy shortcut. They
report whether any or all their arguments are here.
 X<- 1:10
 All(x > 88)
 All(x > 0)
 Any(x > 8)
 Any(x > 88)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Min( ): It is used to calculate the minimum of vector elements or


Minimum of a particular column.

 X1<- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
 X2<- c(4, 2, 8, NA, 11)
 Min(x1)
 Min(x2, na.rm = FALSE)
 Min(x2, na.rm = TRUE)
 Arr = array(2:13, dim = c(2, 3, 2))
 Print(arr)

Max( ): It is used to find the maximum element present in an object.

 Max(x1)
 Max(x2, na.rm = FALSE)
 max(x2, na.rm = TRUE)
 Arr = array(2:13, dim = c(2, 3, 2))
 Print(arr)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Sum( ) & Mean( ): sum(), mean() methods are available in R which are
used to compute the specified operation over the
arguments specified in the method. Max(arr)
 Vec = c(1, 2, 3, 4)
 Print(“sum of the vector:”)
 Print(sum(vec))
 Print(mean(vec))
 Print(“product of the vector:”)
 Print(prod(vec))

rev( ): It is used to return the reverse version of the data.

 Vec<- 1:5
 Vec
 Vec_rev <- rev(vec)
 Vec_rec

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Table( ): It is used to create categorized representation of data with


variable name and frequency in form of a table.
 df = data.frame(“Name”= c(“abc”, “cde”, “def”), “Gender”= c(“male”,
“Female”, “male”) )
 table(df)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 2
AIM : Interact data through .csv files (Import from and export
to .csv files).

Source Code:

CSV: A CSV file is a commonly used file extension when it comes to


spreadsheets. Even software programs that don't look and feel
like a spreadsheet application will frequently offer a CSV as an
output file for downloading a data set, such as a report of results,
actions, or contacts.

 A comma-separated values (CSV) file is a delimited text file that


uses a comma to separate values. Each line of the file is a data
record. Each record consists of one or more fields, separated by
commas. The use of the comma as a field separator is the source of
the name for this file format

 Create a valid data in Excel Sheet.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 Save the data in a valid location with .csv format.

 Importing the csv file in R console.

 Csv_data<- read.csv(file=”data.csv”)
 Print(csv_data)
 Print(ncol(csv_data))
 Print(nrow(csv_data))

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 3
AIM : Get and Clean data using swirl exercises. (Use ‘swirl’
package, library and install that topic from swirl).

Source Code:

SWIRL: swirl is a platform for teaching R programming and data


science. However, an educational platform is only as good as
the content it delivers to the students.

 Swirl is designed in such a way that you can create your own
interactive content and share it freely with students in your
classroom or around the world.

 The swirlify R packages provides a comprehensive toolbox for


swirl instructors. Our authoring tools will guide you effortlessly
through the process of creating interactive content, so that you can
focus on the message you want to convey to students.

 swirl is a software package for the R programming language that


turns the R console into an interactive learning environment. Users
receive immediate feedback as they are guided through self-paced
lessons in data science and R programming.

Install Swirl Package:

Step-1: Install swirl package.

 Install.packages(“swirl”)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Step-2: select Region to continue.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Step-3: Installing swirl packages.

Step-4: Type library(swirl) to check its libraries.


 Library(swirl)

Step-5: Type swirl ( ) when you are ready to begin.


 swirl()

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Step-6: Enter a Name that you wanted to be called as.

Step-7: Type … to continue

Step-8: select 1 for basics of R programming.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Step-9: select 1 to move to R programming.

Step-10: Select 0 after going through the menu to enter in to


Repository.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Step-11: The Console will be redirected to Git hub in your Browser.

Step-12: Select Getting and cleaning data to check the file.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 4
AIM : Visualize all Statistical measures (Mean, Mode, Median, Range,
Inter Quartile Range etc., using Histograms, Boxplots and
Scatter Plots).

Source Code:

MEAN:- Mean is an essential concept in mathematics and statistics.


The mean is the average or the most common value in a
collection of numbers. In statistics, it is a measure of central
tendency of a probability distribution along median and mode.
It is also referred to as an expected value.

 Data<- iris
 Head(data)
 Mean(data$Sepal.Length)

MEDIAN:- The median is the central number of a data set. ... This is the
median. If there are 2 numbers in the middle, the median is
the average of those 2 numbers. The mode is the number in
a data set that occurs most frequently.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 Median(data$Sepal.Length)

MODE:- The mode is the value that occurs most often. The mode is the
only average that can have no value, one value or more than
one value. When finding the mode, it helps to order the
numbers first.

 Tab<- table(data$Sepal.Length)
 Sort(tab, decreasing = TRUE)

RANGE:- The range can then be easily computed, as you have guessed,
by subtracting the minimum from the maximum.

 Max(data$Sepal.Length) – min(data$Sepal.Length)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

The interquartile range (i.e., the


difference between the first and third quartile) can be
computed with the IQR() function.

 IQR(data$Sepal.Length)

HISTOGRAM:
A histogram gives an idea about the distribution of a quantitative
variable. The idea is to break the range of values into intervals and
count how many observations fall into each interval. Histograms are a
bit similar to barplots, but histograms are used for quantitative variables
whereas barplots are used for qualitative variables. To draw a histogram
in R, use hist()
 Data<-iris
 Hist<-(data$Sepal.Length)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

BOXPLOT:
Boxplots are really useful in descriptive statistics and are often
underused (mostly because it is not well understood by the public). A
boxplot graphically represents the distribution of a quantitative variable
by visually displaying five common location summary (minimum,
median, first/third quartiles and maximum) and any observation that
was classified as a suspected outlier using the interquartile range (IQR)
criterion.
 Boxplot(data$Sepal.Length)
 Plot(data$Sepal.Length, data$Petal.Length)

SCATTER PLOT

BOX PLOT

SCATTER PLOT:
Scatterplots allow to check whether there is a potential link between
two quantitative variables. For this reason, scatterplots are often used to
visualize a potential correlation between two variables. For instance,
when drawing a scatterplot of the length of the sepal and the length of
the petal.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 5
AIM : Create a data frame with the following structure.

EMP ID EMP NAME SALARY START DATE


1 Satish 5000 01-11-2013
2 Vani 7500 05-06-2011
3 Ramesh 10000 21-09-1999
4 Praveen 9500 13-09-2005
5 Pallavi 4500 23-10-2000

a. Extract two column names using column name.


b. Extract the first two rows and then all colums.

Data Frame :- A Data Frame is a table or a two Dimensional array-like


structure in which each column contains values of one variable and each
row contains one set of values from each column.

 The column names should be non-empty.


 The row names should be unique.
 The data stored in a data frame can be of numeric, factor or
character type.
 Each column should contain same number of data items.

 Data Frames are data displayed in a format as a table. Data


Frames can have different types of data inside it. While the first
column can be character , the second and third can be numeric or
logical .

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Source Code:

Step-1: Enter the valid Data to Insert in Data Frame.

 Df<- data.frame(emp_id=c(1:5), emp_name=c(“satish”,”vani”,


”ramesh”,”Praveen”,”Pallavi”), salary=c(5000,10000,9500,4500),
start_date=c(“2013-11-01”,”2011-06-05”,”1999-09-21”,”2005-09-
13”,”2000-10-23”))
 Print(df)

(a): Extracting columns using column names.

 Result<-data.frame(df$emp_name, df$salary)
 Print(result)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

(b) : Extracting the First two rows.

 Result<- df[1:2]
 Print(result1)

(c): Extract 3rd and 5th row with 2nd and 4th column.
 Result2 = df[c(3, 5),c(2,4)]

 Print(result2)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 6
AIM : Write R Program using ‘apply’ group of functions to create an
apply normalization function on each of the numeric variables
/columns of iris dataset to transform them into.

a. 0 to 1 range with min-max normalization.


b. a value around 0 with z-score normalization.
 The most common reason to normalize variables is when you’re
conducting some type of multivariate analysis (i.e. you want to
understand the relationship between several predictor variables and
a response variable) and you want each variable to contribute
equally to the analysis
 By normalizing the variables, we can be sure that each variable
contributes equally to the analysis. Two common ways to
normalize (or “scale”) variables include:

 Min-Max Normalization: (X – min(X)) / (max(X) – min(X))


 Z-Score Standardization: (X – μ) / σ

Source Code:
(a). 0 to 1 range with min-max normalization.
 Min_max_norm <- function(x) { (x-min(x)) / (max(x) – min(x)) }
 Iris_norm <- as.data.frame( lapply(iris[1:4], min_max_norm))
 Head(iris_norm)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

(b). a value around 0 with z-score normalization.


 Mean(iris$Sepal.Length.Width)
 Sd(iris$Sepal.Width)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 7
AIM : Create a data frame with 10 observations and 3 variables and add
new rows and columns to it using ‘rbind’ and ‘cbind’ function.

Source Code:
Rbind : row-bind
 The name of the rbind R function stands for row-bind. The rbind
function can be used to combine several vectors, matrices and/or
data frames by rows.
 rbind() function in R Language is used to combine specified
Vector, Matrix or Data Frame by rows. deparse. level: This value
determines how the column names generated. The default value of
deparse.

Cbind : column-bind
 A common data manipulation task in R involves merging two data
frames together. One of the simplest ways to do this is with the
cbind function.
 The cbind function – short for column bind – is a merge function
that can be used to combine two data frames with the same number
of multiple rows into a single data frame.
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516
AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Rbind:-
 Rank <- 1:10
 Country<- c(“china”, “India”, “united states”,”Indonesia”,”Pakistan”,”Brazil”,
“Nigeria”, “Bangladesh”, “Mexico”)
 Population.2019 <- c(1433783686, 1366417754, 329064917, 270625568,
216565318, 211049527, 200963599, 163046161, 145872256, 127575529)
 Population.2018 <- c(147647786, 1352642280, 32709265, 267670543,
212228286, 209469323,195874683, 161376708, 145734038, 126190788)
 Growth.Rate <- c(“0.43%”, “1.02%”,”0.60%”,”1.10%”,”2.04%”,”0.75”,”2.60”,
“1.03%”,”0.9%”,”1.10%”)
 Dataframe.Worldpopulation <- data.frame(rank, country, population.2019,
population.2018, growth.rate)
 Dataframe.worldpopulation
 Japan.Population <- dataframe(11, “Japan”, ” 126860301, 12702192,”-
0.27%”)
 Names(Japan.population) <- c(“Rank”, “country”, “population.2019”,
”population.2018”,”Growth.rate”)
 Worldpopulation.Newdf <- rbind(Dataframe.Worldpopulation,
Japan.population)
 Worldpopulation.Newdf
 Dataframe.Worldpopulation

Cbind:-
 DataFrame.WorldPopulation $area.km.sqaure <- c(9706961,
3287590,9372610, 1904569, 881912, 8515767, 923768, 147570,
17098242,1964375)
 DataFrame.WorldPopulation

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

R bind:

C bind:

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 8
AIM : Write R program to implement linear and multiple regression on
‘mtcars’ dataset to estimate the value of ‘mpg’ variable, with best R2
and plot the original values in ‘green’ and predicted values in ‘red’.
Source Code:
 The built-in mtcars data frame contains information about 32 cars,
including their weight, fuel efficiency (in miles-per-gallon), speed,
etc. (To find out more about the dataset, use help(mtcars)
 The plots shows a (linear) relationship!. Then if we want to
perform linear regression to determine the coefficients of a linear
model, we would use the lm function.

 plot(mpg ~ wt, data = mtcars, col=2)


 fit <- lm(mpg ~ wt, data = mtcars)
 summary(fit)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 abline(fit,col=3,lwd=2)
 bs <- round(coef(fit), 3)
 lmlab <- paste0("mpg = ", bs[1],ifelse(sign(bs[2])==1, " + ", " - "), abs(bs[2]),
" wt ")
 mtext(lmlab, 3, line=-2)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 9
AIM : Implement k-means clustering using R.
Source Code: K-means
 K Means Clustering in R Programming is an Unsupervised Non-
linear algorithm that cluster data based on similarity or similar
groups.
 It seeks to partition the observations into a pre-specified number of
clusters. Segmentation of data takes place to assign each training
example to a segment called a cluster.
 In the unsupervised algorithm, high reliance on raw data is given
with large expenditure on manual review for review of relevance is
given. It is used in a variety of fields like Banking, healthcare,
retail, Media, etc..

 Data(iris)
 Str(iris)

 library(cluster)
 iris_1 <- iris[, -5]
 set.seed(240)
 kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
 kmeans.re

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 kmeans.re$cluster

 cm <- table(iris$Species, kmeans.re$cluster)


 cm

 plot(iris_1[c("Sepal.Length", "Sepal.Width")], col = kmeans.re$cluster)


 plot(iris_1[c("Sepal.Length", "Sepal.Width")], col = kmeans.re$cluster,
main = "K-means with 3 clusters")
 kmeans.re$center

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]

 points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")], col = 1:3,


pch = 8, cex = 3)
 y_kmeans <- kmeans.re$cluster
 clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")], y_kmeans, lines =
0,shade = TRUE, color = TRUE,labels = 2,plotchar = FALSE, span =
TRUE,main = paste("Cluster iris"),xlab = 'Sepal.Length',ylab =
'Sepal.Width')

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 10
AIM : Implement k-medoids clustering using R.
Source Code: k-medoids
 K-Medoids (also called as Partitioning Around Medoid)
algorithm was proposed in 1987 by Kaufman and Rousseeuw. A
medoid can be defined as the point in the cluster, whose
dissimilarities with all the other points in the cluster is minimum.

 set.seed(1234)
 x <- rnorm(24, mean=rep(1:3, each=4), sd=0.2)
 y <- rnorm(24, mean=rep(c(1,2,1), each=4), sd=0.2)
 data <- data.frame(x, y)
 plot(x, y, col="blue", pch=19, cex=1)
 text(x+0.05, y+0.05, labels=as.character(1:24))
 library(cluster)
 kmedoidObj <- pam(x=data, k=4)
 names(kmedoidObj)

 kmedoidObj$objective
 par(mfrow=c(2,2), mar=c(3,3,3,3))
 for(i in 1:4){ plot(x, y, col=kmedoidObj$clustering, pch=19,
cex=1) points(kmedoidObj$medoids, col=1:4, pch=4, cex=3,
lwd=3)}

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 11
AIM : Implement density based clustering on iris dataset
Source Code:

 Density-Based Clustering of Applications with Noise(DBScan)


is an Unsupervised learning Non-linear algorithm. It does use the
idea of density reachability and density connectivity.

 The data is partitioned into groups with similar characteristics or


clusters but it does not require specifying the number of those
groups in advance. A cluster is defined as a maximum set of
densely connected points. It discovers clusters of arbitrary shapes
in spatial databases with noise.

 data(iris)
 str(iris)
 install.packages("fpc")

 library(fpc)
 iris_1 <- iris[-5]
NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516
AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 set.seed(220)
 Dbscan_cl <- dbscan(iris_1, eps = 0.45, MinPts = 5)
 Dbscan_cl

 Dbscan_cl$cluster
 table(Dbscan_cl$cluster, iris$Species)
 plot(Dbscan_cl, iris_1, main = "DBScan")
 plot(Dbscan_cl, iris_1, main = "Petal Width vs Sepal Length")

PLOTTING THE DENSITY BASE

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 12
AIM : Implement decision trees using ‘readingSkills’ dataset.
Source Code:

Decision Trees: Decision Trees are useful supervised Machine


learning algorithms that have the ability to perform both regression and
classification tasks. It is characterized by nodes and branches, where the
tests on each attribute are represented at the nodes, the outcome of this
procedure is represented at the branches and the class labels are
represented at the leaf nodes.

 Hence it uses a tree-like model based on various decisions that are


used to compute their probable outcomes. These types of tree-
based algorithms are one of the most widely used algorithms due to
the fact that these algorithms are easy to interpret and use.
 let us now examine this concept with the help of an example,
which in this case is the most widely used “readingSkills” dataset
by visualizing a decision tree for it and examining its accuracy.

 library(datasets)
 library(caTools)
 library(party)
 library(dplyr)
 library(magrittr)
 data("readingSkills")
 head(readingSkills)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 sample_data = sample.split(readingSkills, SplitRatio = 0.8)


 train_data <- subset(readingSkills, sample_data == TRUE)
 test_data <- subset(readingSkills, sample_data == FALSE)
 model<- ctree(nativeSpeaker ~ ., train_data)
 plot(model)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

Experiment- 13
AIM : Implement decision trees using ‘iris’ dataset using package party
and ‘rpart’
Source Code:

DECISION TREES WITH PACKAGE PARTY :

 A computational toolbox for recursive partitioning. The core of the


package is ctree(), an implementation of conditional inference trees
which embed tree-structured regression models into a well defined
theory of conditional inference procedures.
 This non-parametric class of regression trees is applicable to all
kinds of regression problems, including nominal, ordinal, numeric,
censored as well as multivariate response variables and arbitrary
measurement scales of the covariates.
 Based on conditional inference trees, cforest() provides an
implementation of Breiman's random forests. The function mob()
implements an algorithm for recursive partitioning based on
parametric models.

 Str(iris)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 set.seed(1234)
 ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
 trainData <- iris[ind==1,]
 testData <- iris[ind==2,]
 library(party)

 myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width


 iris_ctree <- ctree(myFormula, data=trainData)
 table(predict(iris_ctree), trainData$Species)

 print(iris_ctree)
 plot(iris_ctree)

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516


AVANTHI’S RESEARCH &TECHNOLOGICAL
ACADEMY

 This non-parametric class of regression trees is applicable to all


kinds of regression problems, including nominal, ordinal, numeric,
censored as well as multivariate response variables and arbitrary
measurement scales of the covariates.

NAME : JAYA PRAKASH NARAYANA G ROLL : 19HQ1A0516

You might also like