Cia 4 ML

CIA- 4
MACHINE LEARNING ALGORITHMS
SUBMITTED BY
SHIVANGI GUPTA (20221026)
UNDER THE GUIDANCE OF

DURGANSH SHARMA
INSTITUTE OF BUSINESS AND MANAGEMENT

CHRIST(DEEMED TO BE UNIVERSITY) DELHI
NCR
PROBLEM STATEMENT
The problem of predicting abalone age was tackled methodically. The dataset was first
examined, with outliers removed and variables properly harmonized. With the Abalone
dataset, we covered the fundamentals of Machine Learning, learned about the model con
struction workflow, and saw a couple of the supervised classification algorithms in action. We
could tell at the end of it that the model’s accuracy was poor. Due to the economic
importance of the age of the abalone and therefore the cumbersome process that's involved in
calculating it, much research has been done to resolve the problem of abalone age prediction
using its physical measurements available within the dataset.
BUSINESS UNDERSTANDING
Abalone is a sort of shellfish that is highly common. Their flesh is prized as a delicacy, and
their shells are frequently used in jewellery. The topic of assessing the age of abalone based on
its physical properties is addressed in this paper. Alternative techniques of estimating their age
are time-consuming. Therefore this subject is of interest. Depending on the species, abalone can
live up to 50 years. Environmental elements such as water flow and wave activity play a
significant role in how quickly they grow. Those from protected waters often develop more
slowly than those from exposed reef areas due to differences in food availability. Estimating
the age of abalone is challenging because to the fact that their size is determined not only by
their age, but also by the availability of food. Furthermore, abalone can develop so-called
'stunted' populations, which have substantially distinct growth characteristics than other abalone
populations. The abalone age prediction problem has been classified as a classification problem
in most of the research on the dataset, which entails assigning a label to each case in the dataset.
In this case, the label represents the abalone's ring count, which is an actual quantity. As a
result, the classifier will be unable to distinguish between many classes and will perform
insignificantly. The age of abalone has a positive correlation with its price. However, identifying
an abalone's age is a time-consuming operation. As the abalone matures, rings form in its inner
shell, generally at a pace of one ring per year. Cutting the shell of an abalone allows access to the
rings. A lab technician examines a shell sample under a microscope and counts the rings after
polishing and staining them.
DATA UNDERSTANDING
The abalone dataset is a collection of measurements of different abalones' physical features.
There are 4177 examples of it. To demonstrate the algorithms in action, we'll use the Abalone
dataset that has previously been collected. With this data, we can create a number of regression
models to investigate how different independent variables affect our dependent variable, Rings.
Knowing how each factor influences the Abalone's age can help oceanographers, jewelers, and
businesses better examine their production, distribution, and pricing strategies. To understand
the data, you must first understand what it contains. Understanding the type (continuous
numeric, discrete numeric, or categorical) and meaning of each feature and the number of
instances and features in the dataset is essential.
VARIABLES
 Sex: This is the gender of the abalone and has categorical value (M, F or I).
 Length: The longest measurement of the abalone shell in mm. Continuous numeric value.
 Diameter: The measurement of the abalone shell perpendicular to length in

mm. Continuous numeric value.
 Height: Height of the shell in mm. Continuous numeric value.
 Whole Weight: Weight of the abalone in grams. Continuous numeric value.
 Shucked Weight: Weight of just the meat in the abalone in grams. Continuous numeric
value.
 Viscera Weight: Weight of the abalone after bleeding in grams. Continuous numeric
value.
 Shell Weight: Weight of the abalone after being dried in grams. Continuous numeric
value.
 Rings: This is the target that is the feature that we will train the model to predict. As
mentioned earlier, we are interested in the age of the abalone and it has been established that
number of rings + 1.5 gives the age. Discrete numeric value
MODELLING
1. It is a classification method used to determine the probability of an event's success or failure in
R. Binary dependent variables (true/false, yes/no) are utilised in logistic regression. In a binomial
distribution, the logit function is employed as a link function. Modelling (Multiple Linear
regression)
a. Model selection and assumptions, if any
• The model selected for predicting the lifespan/Ageing of Abalone was Logistic Regression,
• Objective of conducting a logistic regression
o Detect the number of rings on the Abalone an ordinal scale of long life or short life.
o Test interactions between attributes
Codes( LINEAR AND MULTILINEAR REGRESSION)

1. head(abalone)
2. summary(abalone)
3. pairs.panels(abalone)
4. abalone_1<na.omit(abalon)
5. describe(abalone_1)
6. b1 <- boxplot(abalone_1$Length, col="slategray2", pch=19, ylab= "Length")
7. b2 <- boxplot(abalone_1$Diameter, col="slategray2", pch=19, ylab= "Diamete r")
8. b3 <- boxplot(abalone_1$Height, col="slategray2", pch=19, ylab= "Height")
9. b4 <- boxplot(abalone_1$Whole.weight, col="slategray2", pch=19,ylab="Whole. Weight")
10. b5 <- boxplot(abalone_1$Shucked.weight, col="slategray2", pch=19,ylab="Shuc ked.weight")
11. b6 <- boxplot(abalone_1$Viscera.weight, col="slategray2", pch=19,ylab="Visc era.weight")
12. b7 <- boxplot(abalone_1$Shell.weight, col="slategray2", pch=19,ylab="Shell. weight")
13. b8 <- boxplot(abalone_1$Rings, col="slategray2", pch=19,ylab="Rings")
14. dim(abalone)
15. cr<- cor(abalone[c(2,3,4,5,6,7,8,9)])
16. corrplot(cr,type="lower")
17. cor(abalone$Rings, abalone$Shell.weight)
18. cor(abalone$Rings, abalone$Length)
19. cor(abalone$Rings, abalone$Diameter)
ggplot(abalone, aes(x=Rings, y=Diameter, group=1))+ geom_point()+
stat_summary(fun.y=mean, colour="Cyan", geom="line", size = 3)
20. set.seed(120)
21. split1<- sample.split(abalone$Rings, SplitRatio = 0.70)
22. summary(split1)
23. datatrain<- subset(abalone, split1 ==TRUE)
24. summary(datatrain)
25. dim(datatrain)
26. datatest<- subset(abalone, split1 ==FALSE)
27. summary(datatest)
28. abalone.train = lm(Rings~., data = datatrain)

29. summary(abalone.train)
30. par(mfrow=c(2,3))
31. lapply(1:6, function(x) plot(abalone.train, which=x, labels.id= 1:nrow(data train))) %>% invisible()
32. par(mfrow=c(1,1))abaloneOutliers= c(1035, 1849,
CODES FOR SIMPLE BINARY LOGISTIC
> library(readxl)
> abalone <- read_excel("C:/Users/Shivangi Gupta/Desktop/abalone.xlsx")
> View(abalone)
> summary(abalone)
> table(abalone$W, abalone$S)

> abalonelogmod1<-glm(W~S, family = binomial(link="logit"), data = abalone)
> summary(abalonelogmod1)
> exp(cbind(coef(abalonelogmod1),confint(abalonelogmod1)))
> round(exp(cbind(coef(abalonelogmod1),confint(abalonelogmod1))),3)
CODES UNIVARIATE BINARY LOGISTIC REGRESSION

>table(abalone$W, abalone$D)
> abalonelogmod2<-glm(abalone$W ~ abalone$D, family=binomial(link="logit"), data=abalone)
>Number of Fisher Scoring iterations: 10> round(exp(cbind(coef(abalonelogmod2),

confint(abalonelogmod2))),3)
>abalone$D 6838.434 1542.648 120165.822> x<-data.frame(abalone$S, abalone$D)

> table (abalone$W, x$abalone.S, abalone$W, x$abalone.D)
CODES FOR MULTIVARIATE BINARY LOGISTIC REGRESSION

> abalonelogmod3<-glm(abalone$W ~ abalone$S + abalone$D, family=binomial(link="logit"),
data=abalone)
> round(exp(cbind(coef(abalonelogmod3), confint(abalonelogmod3))),3)
> library(readxl)
> View(abalone)
> abalonelogmod1<-glm(W ~ Sex + Length + Diameter + Rings, family=binomial (link="logit"),
data=abalone)
> abalonelogmod1<-glm(W ~ H + S + D + R, family=binomial (link="logit"), data=abalone)

> glm(formula = W ~ H + S + D + R, family = binomial(link =
"logit"), data = abalone)

> exp(cbind(coef(abalonelogmod1), confint(abalonelogmod1)))
CODES LINEAR DISCRIMINANT ANALYSIS

> library(psych)
> set.seed(555)
> ind <- sample(2, nrow(abalone), replace = TRUE,prob = c(0.6, 0.4))
> training <- abalone[ind==1,]
> testing <- abalone[ind==2,]
> library(MASS)
> linear <- lda(W~., training)
> linear
> attributes(linear)
HISTOGRAM
> p <- predict(linear, training)
> ldahist(data = p$x[,1], g = training$W)
> library(devtools)
> library(klaR)
> p1 <- predict(linear, training)$class
> tab <- table(Predicted = p1, Actual = training$W)

> tab
> sum(diag(tab))/sum(tab)
> p2 <- predict(linear, testing)$class
> tab1 <- table(Predicted = p2, Actual = testing$Species)
> tab1 <- table(Predicted = p2, Actual = testing$W)
> tab1
> sum(diag(tab1))/sum(tab1)
CONCLUSION
The dataset was examined with Logistic regression and covered with the basics of Machine
learning and examined the model constructions and workflow. It was relevantly evident that
the model accuracy was comparatively good. Apparently, there isn't much of a difference
between Males and Females, a claim that can be confidently made given the small variation in
intersex means for each of the eight regressors. In addition, because the accuracy indicator
exceeds the "No-Information Rate" (the theoretical accuracy that would be attained if all
observations were assigned "No" and then compared actual data), the model might be
considered a more-or-less decent predictor of abalone sex. To summarize, after accounting for
all of the interfering factors, I was generally pleased with the findings obtained by logistic
regression. Given the abundance of other, more advanced, and generally more effective
classification algorithms, I strongly urge their use with respect to the dataset, believing that
they will result in increased accuracy and, as a result, more precise results. Measurement is to
find how good a model is # After creating a model using training data and then apply this
model on testing data and find error, if the error is less then the model is economical and
practical in nature. Our R-squared value to predict the age of an abalone was low (0.3607).
However, a high R-squared does not necessarily indicate that the model has a good fit. All of
our predictor variables were statistically significant with p-values that were lower than the α of
0.05. For this reason, we were still able to draw some conclusions about our variables but
maybe Multiple Linear Regression model is not the best way to predict the age of an abalone.
The dataset was examined with Logistic regression and covered with the basics of Machine
learning and examined the model constructions and workflow. It was relevantly evident that
the model accuracy was comparatively good. , I was generally pleased with the findings
obtained by logistic regression. Given the abundance of other, more advanced, and generally
more effective classification algorithms, I strongly urge their use with respect to the dataset,
believing that they will result in increased accuracy and, as a result, more precise results.
3. CLASSIFICATION
We'll use four classifiers to classify the data: random forest, decision tree, KNN and SVM. We'll
also figure out which parameters are best for each classifier. We don't utilise cross validation to
find the optimal parameter because there are numerous objectives with a total of We utilise the
simple grid search strategy to find the optimal parameter for each classifier.
RANDOM FOREST
Random Forest is a type of ensemble learning technique that creates a large number of decision
trees during training. For classification problems, it predicts the mode of the classes, and for
regression tasks, it predicts the mean of trees. During tree construction, it employs the random
subspace approach and bagging. It comes with a built-in feature importance indicator
CODES
> datarf <- abalone1
> str(datarf)
>spec_tbl_df [4,177 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

> datarf$D <- as.factor(datarf$D)
> table(datarf$D)
> set.seed(123)
> ind <- sample(2, nrow(datarf), replace=TRUE, prob=c(0.7,0.3))
> train <- datarf[ind==1,]
> test <- datarf[ind==2,]
> library(randomForest)
> install.packages("randomForest")
> set.seed(222)
> rf<-randomForest(D~W, data=train, ntree = 300, mtry = 8,importance = TRUE, proximity =

TRUE)
> print(rf)
> attributes(rf)
> rf$confusion
>library(caret)
>p1 <- predict(rf, train)
> head(p1)
> head(train$D)
> confusionMatrix(p1, train$D)
>p2 <- predict(rf, test)
>head(p2)
> head(test$D)
> confusionMatrix(p2, test$D)
> importance(rf)
> varUsed(rf)
> getTree(rf, 1, labelVar = TRUE)
> MDSplot(rf, train$D)
K-NN
KNN is a Supervised Learning algorithm that predicts the output of data points using a labelled
input data set.It is one of the most basic Machine Learning algorithms, and it may be used to
solve a wide range of issues. It is primarily based on resemblance of features. KNN compares a
data point's similarity to that of its neighbour and assigns it to the most similar class.KNN is a
non-parametric model, which means it makes no assumptions about the data set, unlike most
algorithms. Because the algorithm can now handle realistic data, it becomes more effective.KNN
is a lazy algorithm, which implies that instead of learning a discriminative function from the
training data, it memorises it. Both classification and regression problems can be solved with
KNN.
> data <- read.csv(file.choose(), header = T)
> data$D[data$D == 0] <- 'No'
> data$D[data$D == 1] <- 'Yes'
> data$D <- factor(data$D)
> set.seed(1234)
> ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
> training <- data[ind == 1,]
> test <- data[ind == 2,]
> trControl <- trainControl(method = "repeatedcv",number = 10, repeats = 3, classProbs = TRUE,

summaryFunction = twoClassSummary)
> set.seed(222)
SVM
Support vector machines (SVMs) are supervised learning models with related learning algorithms
for classification and regression analysis in machine learning. It's primarily used to solve
categorization challenges. Each data item is displayed as a point in n-dimensional space (where n
is the number of features), with the value of each feature being the value of a specific coordinate
in this algorithm. The hyper-plane that best distinguishes the two classes is then used to classify
the data. SVMs may also conduct non-linear classification, implicitly translating their inputs into
high-dimensional feature spaces, in addition to linear classification.
CODES
> View(abalone)
> data(abalone)
> data(Abalone)
> str(abalone)
> library(ggplot2)
> library(e1071)
> mymodel <- svm(D~., data=abalone)
>summary(mymodel)
> plot(mymodel, data=abalone, abalone.Height~abalone.Length,
slice=list(abalone.Height=3,abalone.Length=4))
library(predtoolsTS)
> pred <- predict(mymodel, abalone)
> tab <- table(Predicted=pred, Actual=abalone$Height)
> tab
> tab
>1-sum(diag(tab))/sum(tab)
>data=abalone,kernel="polynomial")
> mymodel <- svm(Diameter~Length, data=abalone,kernel="polynomial")
>summary(mymodel)

> library(predtoolsTS)
> tab
> mymodel <- svm(Diameter~Length, data=abalone, kernel="sigmoid")
> summary(mymodel)
> set.seed(123)
> tmodel <- tune(svm, Diameter~Length, data=abalone, ranges = list(epsilon = seq(0,1,0.1),
cost=2^(2:9)))
>summary(tmodel)
>tmodel <- tune(svm, Diameter~Length, data=abalone, ranges = list(epsilon = seq(0,1,0.1), cost=2^(2:7)))
>plot(mymodel, data=abalone, abalone.Height~abalone.Length,

>pred <- predict(mymodel, abalone)
>tab <- table(Predicted=pred, Actual=abalone$Height)
>tab
DECISION TREE
In machine learning, a Decision Tree is a supervised method. It assigns a target value to each
data sample using a binary tree graph (each node has two children). The tree leaves represent the
target values. Starting at the root node, the sample is propagated through nodes until it reaches
the leaf. A choice is made in each node about which descendant node it should travel to. The
feature of the selected sample is used to make a choice. It is usually one of the factors considered
while making a decision (one feature is used in the node to make a decision). The process of
discovering the best rules at each internal tree node based on the chosen metric is known as
decision tree learning.
CODES
mydata <- read.csv("abalone.csv")
mydata$D <- as.factor(mydata$D)
library(party)
>mytree <- ctree(D~H+W+R, mydata, controls=ctree_control(mincriterion=0.9, minsplit=50))
>print(mytree)
>plot(mytree,type="simple")
>tab<-table(predict(mytree), mydata$D)
>print(tab)
CONCLUSION
We cross-validated each of the models on the test data before optimising them. Because cross
validation is a random process, we use pairwise t-tests to see if there is a statistically significant
difference between the performance of any two improved classifiers. First, we run each of the
best models via a 10-fold stratified cross-validation procedure (without any repetitions). Second,
we use a paired t-test to compare the accuracy of the RF model to that of other models, because
the RF model is the most accurate. RF outperforms other models in terms of f1-score and
weighted average recall, followed by KNN. At the same time, KNN has a higher precision score.
Other classifications are similar; but, because to the enormous number of goal levels, we did not
print them all. In the confusion matrix, the scenario is the same.
OUTPUTS
MULTILINEAR AND LINEAR REGRESSION
## Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight ## 1 M
0.455 0.365 0.095 0.5140 0.2245 0.1010
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775
## Shell.weight Rings
## 1 0.150 15
## 2 0.070 7
## 3 0.210 9
## 4 0.155 10
## 5 0.055 7
## 6 0.120 8
## Sex Length Diameter Height
## Length:4177 Min. :0.075 Min. :0.0550 Min. :0.0000
## Class :character 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150 ##
Mode :character Median :0.545 Median :0.4250 Median :0.1400
##Mean :0.524 Mean :0.4079 Mean :0.1395
##3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650
##Max. :0.815 Max. :0.6500 Max. :1.1300
## Whole.weight Shucked.weight Viscera.weight Shell.weight
## Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015
## 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300
#The pairwise plot indicates there is lack of linearity in the dependent va riable column(Rings). Also
the other columns show multi-collinearity.
1.3.1 --
## v tibble 3.1.3 v purrr 0.3.4 ## v tidyr 1.1.3
v stringr 1.4.0 ## v readr 2.0.0 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.5 ## Warning:
package 'tidyr' was built under R version 4.0.5 ## Warning: package 'readr' was
built under R version 4.0.5 ## Warning: package 'purrr' was built under R
version 4.0.5 ## Warning: package 'forcats' was built under R version 4.0.5
Length Diameter Height Whole.weight
Shucked.weight ## Length 1.0000000 0.9868116 0.8275536
0.9252612 0.8979137
## Diameter 0.9868116 1.0000000 0.8336837 0.9254521 0.8931625
## Height 0.8275536 0.8336837 1.0000000 0.8192208 0.7749723
## Whole.weight 0.9252612 0.9254521 0.8192208 1.0000000 0.9694055
## Shucked.weight 0.8979137 0.8931625 0.7749723 0.9694055 1.0000000
## Viscera.weight 0.9030177 0.8997244 0.7983193 0.9663751 0.9319613
## Shell.weight 0.8977056 0.9053298 0.8173380 0.9553554 0.882617 1
## Rings 0.5567196 0.5746599 0.5574673 0.5403897 0.4208837
## Viscera.weight Shell.weight Rings ## Length 0.9030177 0.8977056 0.5567196
#An upward inclination is observed and hence the model is not linear and
co ntains outliers.
shapiro.test(abalone$Rings)
##
## Shapiro-Wilk normality
test ##
## data: abalone$Rings
## W = 0.93115, p-value < 2.2e-16

par(mfrow=c(1,1))
abaloneOutliers= c(1035, 1849, 1103)
removed_1=datatrain[-abaloneOutliers,]
#The above values will be redacted from the original training data
acquired
.
train.norm = datatrain[-abaloneOutliers, ]
regfinal = lm(Rings~., data = train.norm)
summary(regfinal)
##
## Call:
## lm(formula = Rings ~ ., data =
train.norm) ##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2569 -1.2994 -0.3121 0.8494 14.2412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.64148 0.34718 10.489 < 2e-16 ***

#auto-correlation - The error terms should not be co-related which means th ey should be
random. The presence of correlation leads to autocorrelation. To check this we can do a durbin
watson test. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2
implies positive autocorr elation while 2 < DW < 4 indicates negative autocorrelation.
#from the result we can see the durbin watson is 1.68 whic is less than 2, this implies there is
positive autocorrelation
#We’re going to employ step-wise algorithm for the feature selection method
. Here we use three directions of the algorithm, i.e. backward, forward, an d both.We have to
define the models for lower and upper threshold of the al
Gorithm
SIMPLE BINAR LOGISTIC REGRESSION

S Sex Length D
Min. :0.0000 Length:4177 Min. :0.075 Min. :0.0000
1st Qu.:0.0000 Class :character 1st Qu.:0.450 1st
Qu.:0.0000 Median :1.0000 Mode :character Median :
0.545 Median :1.0000
Mean :0.6342 Mean :0.524 Mean :0.7328
3rd Qu.:1.0000 3rd Qu.:0.615 3rd Qu.:1.0000
Max. :1.0000 Max. :0.815 Max. :1.0000
Diameter Height Whole weight Shucked weight

Min. :0.0550 Min. :0.0000 Min. :0.0020 Min. :
0.0010 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4415 1st
Qu.:0.1860
Median :0.4250 Median :0.1400 Median :0.7995 Median :0.3360
Mean :0.4079 Mean :0.1395 Mean :0.8287 Mean :
0.3594 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1530 3rd
Qu.:0.5020 Max. :0.6500 Max. :1.1300 Max. :2.8255 Max. :
1.4880
Viscera weight Shell weight Rings W

Min. :0.0005 Min. :0.0015 Min. : 1.000 Min. :
0.0000 1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 8.000 1st
Qu.:1.0000
Median :0.1710 Median :0.2340 Median : 9.000 Median :1.0000
Mean :0.1806 Mean :0.2388 Mean : 9.934 Mean :
0.8152 3rd Qu.:0.2530 3rd Qu.:0.3290 3rd Qu.:11.000 3rd
Qu.:1.0000 Max. :0.7600 Max. :1.0050 Max. :29.000 Max.
:1.0000
>0 1
0 129 643
1 1399 2006
Call:
glm(formula = W ~ S, family = binomial(link = "logit"), data = abalone)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2235 0.4200 0.4200 0.7457 0.7457
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.38370 0.09201 25.91 <2e-16 ***
S -1.24595 0.10257 -12.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3998.4 on 4176 degrees of freedom

Residual deviance: 3820.7 on 4175 degrees of
freedom AIC: 3824.7
Number of Fisher Scoring iterations: 5
UNIVARIATE BINARY LOGISTIC REGRESSION
> table(abalone$W, abalone$D)

0 1
0 771 1
1 345 3060
> abalonelogmod2<-glm(abalone$W ~ abalone$D, family=binomial(link="logit"), data=abalone)
Call:
glm(formula = abalone$W ~ abalone$D, family = binomial(link =
Deviance Residuals:
-4.0066 0.0256 0.0256 0.0256 1.5323
Coefficients:
(Intercept) -0.80414 0.06477 -12.415 <2e-16
***
abalone$D 8.83031 1.00207 8.812 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual deviance: 1398.3 on 4175 degrees of freedom
AIC: 1402.3
Number of Fisher Scoring iterations: 10> round(exp(cbind(coef(abalonelogmod2),

confint(abalonelogmod2))),3)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.447 0.394 0.508
abalone$D 6838.434 1542.648 120165.822> x<-data.frame(abalone$S, abalone$D)
> table (abalone$W, x$abalone.S, abalone$W, x$abalone.D)
, , = 0, = 0
0 1
0 129 642
1 0 0
, , = 1, = 0
0 1
0 0 0
1 81 264
, , = 0, = 1
0 1
0 0 1
1 0 0
, , = 1, = 1
0 1
0 0 0
1 1318 1742
MULTIVARIATE BINARY LOGISTIC REGRESSION
> abalonelogmod3<-glm(abalone$W ~ abalone$S + abalone$D, family=binomial(link="logit"),

data=abalone)
> round(exp(cbind(coef(abalonelogmod3), confint(abalonelogmod3))),3)

2.5 % 97.5 %
(Intercept) 0.632 0.478 0.832
abalone$S 0.649 0.476 0.889
abalone$D 6329.174 1426.810 111238.284
> library(readxl)
> View(abalone)
> abalonelogmod1<-glm(W ~ Sex + Length + Diameter + Rings, family=binomial (link="logit"),
data=abalone)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> abalonelogmod1<-glm(W ~ H + S + D + R, family=binomial (link="logit"), data=abalone)
Call:
glm(formula = W ~ H + S + D + R, family = binomial(link =
Deviance Residuals:
-3.9568 0.0130 0.0146 0.0282 2.2501
Coefficients:
(Intercept) -2.2090 0.2217 -9.964 < 2e-16
***
H 3.3347 0.1779 18.741 < 2e-16 ***
S -0.2395 0.2222 -1.078 0.281
D 6.9416 1.0064 6.897 5.30e-12 ***
R 1.3141 0.3195 4.113 3.91e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual deviance: 856.12 on 4172 degrees of
freedom AIC: 866.12
Number of Fisher Scoring iterations: 10
> exp(cbind(coef(abalonelogmod1), confint(abalonelogmod1)))

2.5 % 97.5 %
(Intercept) 0.1098121 0.07030839 1.677201e-01
H 28.0696722 19.92973074 4.005609e+01
S 0.7870051 0.50951643 1.218486e+00
D 1034.3932455 229.83152199 1.825352e+04
R 3.7212957 2.01601457 7.058820e+00
2.5 % 97.5 %
(Intercept) 0.110 0.070 0.168
H 28.070 19.930 40.056
S 0.787 0.510 1.218
D 1034.393 229.832 18253.516
R 3.721 2.016 7.059
All outputs where probability of z values are less than 0.05

are considered as major factors for gender determination, such
as weight, height, diameter, rings etc. There odds of gender
determination increases by a factor of 40.056 and 7.059 with
increase in height and rings respectively.
LINEAR DISCRIMINANT ANALYSIS

> library(psych)
DATA PARTITION
> set.seed(555)
> ind <- sample(2, nrow(abalone), replace = TRUE,prob = c(0.6, 0.4))
> training <- abalone[ind==1,]
> testing <- abalone[ind==2,]
> library(MASS)
> linear <- lda(W~., training)
Warning message:
In lda.default(x, grouping, ) : variables are collinear
> linear
Call:
lda(W ~ ., data = training)
Prior probabilities of
groups: 0 1
0.1787268 0.8212732
Group means:
S SexI SexM Length D Diameter H Height
0 0.8628319 0.7986726 0.1371681 0.3333407 0.0000000 0.2503319 0.1283186 0.08277655
1 0.5917188 0.2161772 0.4082812 0.5666827 0.8955224 0.4430380 0.9740010 0.15133606
`Whole weight` `Shucked weight` `Viscera weight` `Shell weight` R 0
0.1962909 0.08656305 0.0428219 0.05803982 0.03539823
1 0.9655279 0.42000289 0.2107434 0.27752792 0.41550313
Rings
0 6.597345
1 10.630236
Coefficients of linear discriminants:

LD1
S -0.02271797
SexI -
0.12657893
SexM 0.0227179
7
Length 6.2153105
8
D 1.20696507
Diamet 3.674831
er 37
H 3.43807409
Height -
6.6220758
0
`Whole weight` -1.82072215

`Shucked weight` 0.53309955
`Viscera weight` 0.01258240
`Shell weight` 1.81269316
R -0.02220770
Rings 0.02356770
> attributes(linear)
$names
[1] "prior" "counts" "means" "scaling" "lev" "svd" "N" "call"
[9] "terms" "xlevels"
$class
[1] "lda"
HISTOGRAM
> p <- predict(linear, training)
> ldahist(data = p$x[,1], g = training$W)
> library(devtools)
In addition: Warning messages:

1: package ‘devtools’ was built under R version 4.0.5
2: package ‘usethis’ was built under R version 4.0.5
> library(klaR)
Warning message:
package ‘klaR’ was built under R version 4.0.5

> p1 <- predict(linear, training)$class
> tab <- table(Predicted = p1, Actual = training$W)

> tab
Actual
Predicted 0 1
0 394 48
1 58 2029
> sum(diag(tab))/sum(tab) [1]
0.9580862
> p2 <- predict(linear, testing)$class
> tab1 <- table(Predicted = p2, Actual = testing$Species)
Error in table(Predicted = p2, Actual = testing$Species) :
all arguments must have the same length

In addition: Warning message:
Unknown or uninitialised column: `Species`.

> tab1 <- table(Predicted = p2, Actual = testing$W)
> tab1
Actual
Predicted 0 1
0 286 40
1 34 1288
> sum(diag(tab1))/sum(tab1) [1]
0.9550971
Thus, Linear Discriminant Analysis has helped to produce robust,
decent, and interpretable classification results, and
classifying abalone shells on the basis of their gender which
was not possible on the first glance. The continuous independent
variables help in determining the classifying variable that is
gender.
RANDM FOREST
> datarf <- abalone1
> str(datarf)
spec_tbl_df [4,177 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$S : num [1:4177] 0 0 1 0 1 1 1 1 0 1 ...
$ Sex : chr [1:4177] "M" "M" "F" "M" ...

$ Length : num [1:4177] 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
$D : num [1:4177] 1 0 1 1 0 0 1 1 1 1 ...
$ Diameter : num [1:4177] 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
$H : num [1:4177] 0 0 1 1 0 0 1 1 1 1 ...
$ Height : num [1:4177] 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
$ Whole weight : num [1:4177] 0.514 0.226 0.677 0.516 0.205 ...
$ Shucked weight: num [1:4177] 0.2245 0.0995 0.2565 0.2155 0.0895 ...
$ Viscera weight: num [1:4177] 0.101 0.0485 0.1415 0.114 0.0395 ...
$ Shell weight : num [1:4177] 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
$R : num [1:4177] 1 0 0 0 0 0 1 1 0 1 ...
$ Rings : num [1:4177] 15 7 9 10 7 8 20 16 9 19 ...
$W : num [1:4177] 1 0 1 1 0 1 1 1 1 1 ...
- attr(*, "spec")=
.. cols(
.. S = col_double(),
.. Sex = col_character(),
.. Length = col_double(),
.. D = col_double(),
.. Diameter = col_double(),
.. H = col_double(),
.. Height = col_double(),
.. `Whole weight` = col_double(),
.. `Shucked weight` = col_double(),
.. `Viscera weight` = col_double(),

.. `Shell weight` = col_double(),
.. R = col_double(),
.. Rings = col_double(),
.. W = col_double()
.. )
- attr(*, "problems")=<externalptr>
>
> datarf$D <- as.factor(datarf$D)
> table(datarf$D)
0 1
1116 3061
> set.seed(123)
>
> ind <- sample(2, nrow(datarf), replace=TRUE, prob=c(0.7,0.3))
> train <- datarf[ind==1,]
> test <- datarf[ind==2,]
> library(randomForest)
> install.packages("randomForest")
> set.seed(222)
> rf<-randomForest(D~W, data=train, ntree = 300, mtry = 8,importance = TRUE, proximity =

TRUE)
> print(rf)
Call:
randomForest(formula = D ~ W, data = train, ntree = 300, mtry = 8, importance =
TRUE, proximity = TRUE)
Type of random forest: classification
Number of trees: 300
No. of variables tried at each split: 1
OOB estimate of error rate: 7.98%
Confusion matrix:
0 1 class.error
0 548 232 0.2974358974
1 1 2137 0.0004677268
> attributes(rf)
$names
[1] "call" "type" "predicted" "err.rate"
[5] "confusion" "votes" "oob.times" "classes"
[9] "importance" "importanceSD" "localImportance" "proximity"
[13] "ntree" "mtry" "forest" "y"
[17] "test" "inbag" "terms"
$class
[1] "randomForest.formula" "randomForest"
> rf$confusion
0 1 class.error
0 548 232 0.2974358974
1 1 2137 0.0004677268
> library(caret)
> p1 <- predict(rf, train)
> head(p1)
123456
111111
Levels: 0 1
> head(train$D)
[1] 1 1 0 1 1 1
Levels: 0 1
> confusionMatrix(p1, train$D)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 548 1
1 232 2137
Accuracy : 0.9202
95% CI : (0.9097, 0.9297)
No Information Rate : 0.7327
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.775
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7026
Specificity :
0.9995 Pos Pred Value :
0.9982 Neg Pred Value :
0.9021
Prevalence : 0.2673
Detection Rate : 0.1878
Detection Prevalence : 0.1881
Balanced Accuracy : 0.8510
'Positive' Class : 0
> p2 <- predict(rf, test)
> head(p2)
123456
010111
Levels: 0 1
> head(test$D)
[1] 0 1 0 1 1 1
Levels: 0 1
> confusionMatrix(p2, test$D)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 223 0
1 113 923
Accuracy : 0.9102
95% CI : (0.8931, 0.9255)
No Information Rate : 0.7331
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7432
Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.6637
Specificity :
1.0000 Pos Pred Value :
1.0000 Neg Pred Value :
0.8909
Prevalence : 0.2669
Detection Rate : 0.1771
Detection Prevalence : 0.1771
Balanced Accuracy : 0.8318
'Positive' Class : 0
> plot(rf)
> hist(treesize(rf), main = "No. of nodes for the Trees", col = "green")
> varImpPlot(rf)
varImpPlot(rf, sort=T, n.var=10, main="Top 10 - Variable Importance")
> importance(rf)
0 1 MeanDecreaseAccuracy MeanDecreaseGini
W 321.381 257.9927 297.0707 722.7635
> varUsed(rf)
[1] 300
partialPlot(rf,train,Height"1")
left daughter right daughter split var split point status prediction
1 2 3 W 0.5 1 <NA>
2 0 0 <NA> 0.0 -1 0
3 0 0 <NA> 0.0 -1 1
> MDSplot(rf, train$D)
KNN
> data <- read.csv(file.choose(), header = T)
> data$D[data$D == 0] <- 'No'
> data$D[data$D == 1] <- 'Yes'
> data$D <- factor(data$D)
> set.seed(1234)
> ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
> training <- data[ind == 1,]
> test <- data[ind == 2,]
> trControl <- trainControl(method = "repeatedcv",number = 10, repeats = 3, classProbs =

TRUE, summaryFunction = twoClassSummary)
> set.seed(222)
SVM
> View(abalone)
> data(abalone)
> data(Abalone)
> str(abalone)
tibble [4,177 x 14] (S3: tbl_df/tbl/data.frame)
$S : num [1:4177] 0 0 1 0 1 1 1 1 0 1 ...
$ Sex : chr [1:4177] "M" "M" "F" "M" ...
$ Length : num [1:4177] 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
$D : num [1:4177] 1 0 1 1 0 0 1 1 1 1 ...
$ Diameter : num [1:4177] 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
$H : num [1:4177] 0 0 1 1 0 0 1 1 1 1 ...
$ Height : num [1:4177] 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
$ Whole weight : num [1:4177] 0.514 0.226 0.677 0.516 0.205 ...
$ Shucked weight: num [1:4177] 0.2245 0.0995 0.2565 0.2155 0.0895 ...
$ Viscera weight: num [1:4177] 0.101 0.0485 0.1415 0.114 0.0395 ...
$ Shell weight : num [1:4177] 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
$R : num [1:4177] 1 0 0 0 0 0 1 1 0 1 ...
$ Rings : num [1:4177] 15 7 9 10 7 8 20 16 9 19 ...
$W : num [1:4177] 1 0 1 1 0 1 1 1 1 1 ...
> library(ggplot2)
> library(e1071)
> mymodel <- svm(D~., data=abalone)
> summary(mymodel)
Call:
svm(formula = D ~ ., data = abalone)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 1
gamma: 0.06666667
epsilon: 0.1
Number of Support Vectors: 1100

Warning message:
package ‘predtoolsTS’ was built under R version 4.0.5
> tab
Actual
Predicted 0 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055
- 0 0 0 0 0 0 0 0 0 0 0
0.42832379789287
4
- 0 0 0 0 0 0 0 0 0 0 0
0.40111345882097
9
- 0 0 0 0 0 0 0 0 0 0 0
0.38643363691343
6
- 0 0 0 0 0 0 0 0 0 0 0
0.36859118843068
- 0 0 0 0 0 0 0 0 0 0 0
0.339139940558743
- 0 0 0 0 0 0 0 0 0 0 0
0.317581220057663
- 0 0 0 0 0 0 0 0 0 0 0
7 0.202278624410508
- 0 0 0 0 0 0 0 0 0 0 0
0.184111808267246 Actual
- 0 0 0 0 0 0 0 0 0 0 0
0.158358211476489 Predicted 0.06
- 0 0 0 0 0 0 0 0 0 0 0 0.065 0.07 0.075 0.08
- 0.148342875783461
0 0 0 0 0 0 0 0 0 0 1 0.085 0.09 0.095 0.1
0.428323797892874
- 0 0 0 0 0 0 0 0 0 0 0 0.105 0.11
- 0
0.146884610550649 0 0 0 0 1 0 0 0 0 0
0.401113458820979
- 0 0 0 0 0 0 0 0 0 0 0
- 0 0
0.128110023710106 0 0 0 0 0 0 0 0 0
0.386433636913436
- 0 0 0 0 0 0 0 0 0 0 0
- 0 0
0.124924507377761 0 0 0 0 0 0 1 0 0
0.368591188430687
- 0 0 0 0 0 0 0 0 0 0 0
- 0 0
0.123656899763398 0 0 0 0 0 0 1 0 0
0.339139940558743
- 0 0 0 0 0 0 0 0 0 0 0
- 0 0
0.113946442560223 0 0 0 1 0 0 0 0 0
0.317581220057663
- 0 0 0 0 0 0 0 0 0 0 0
- 0 0
0.113028091306822 0 0 0 0 1 0 0 0 0
0.202278624410508
- 0 0 0 0 0 0 0 0 0 0 0
0.101605807977074
- 0 0 0 0 0 0 0 0 0 0 0
0.091708002702802
9
- 0 0 0 0 0 0 0 0 0 0 0
0.079459304461938
8
- 0 0 0 0 0 0 0 0 0 1 0
0.184111808267246
- 0 0 0 0 0 0 0 0 0 0 0
0.158358211476489
- 0 0 0 0 0 0 0 0 0 0 0
0.148342875783461
- 0 0 0 0 0 0 0 1 0 0 0
0.146884610550649
- 0 0 0 0 0 0 0 0 1 0 0
0.128110023710106
- 0 0 0 0 0 0 0 0 0 1 0
0.124924507377761
- 0 0 0 0 0 0 0 0 0 0 1
0.123656899763398
- 0 0 0 0 0 0 0 0 0 1 0
0.113946442560223
- 0 0 0 0 0 0 0 1 0 0 0
0.113028091306822
- 0 0 0 0 0 0 0 0 0 1 0
0.101605807977074
- 0 0 0 0 0 0 0 1 0 0 0
0.091708002702802
9
- 0 0 0 0 0 0 0 0 0 0 1
0.079459304461938
8
-0.146884610550649 0 0 0 0 0 0 0 0 0 0
-0.128110023710106 0 0 0 0 0 0 0 0 0 0
-0.124924507377761 0 0 0 0 0 0 0 0 0 0
-0.123656899763398 0 0 0 0 0 0 0 0 0 0
-0.113946442560223 0 0 0 0 0 0 0 0 0 0
-0.113028091306822 0 0 0 0 0 0 0 0 0 0
-0.101605807977074 0 0 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0
0.0917080027028029
- 0 0 0 0 0 0 0 0 0 0
0.0794593044619388
Actual
Predicted 0.165 0.17 0.175 0.18 0.185 0.19 0.195 0.2 0.205 0.21 0.215
- 0 0 0 0 0 0 0 0 0 0 0
0.428323797892874
- 0 0 0 0 0 0 0 0 0 0 0
0.401113458820979
- 0 0 0 0 0 0 0 0 0 0 0
0.386433636913436
- 0 0 0 0 0 0 0 0 0 0 0
0.368591188430687
- 0 0 0 0 0 0 0 0 0 0 0
0.339139940558743
- 0 0 0 0 0 0 0 0 0 0 0
0.317581220057663
- 0 0 0 0 0 0 0 0 0 0 0
0.202278624410508
- 0 0 0 0 0 0 0 0 0 0 0
0.184111808267246
- 0 0 0 0 0 0 0 0 0 0 0
0.158358211476489
- 0 0 0 0 0 0 0 0 0 0 0
0.148342875783461
- 0 0 0 0 0 0 0 0 0 0 0
0.146884610550649
- 0 0 0 0 0 0 0 0 0 0 0
0.128110023710106
- 0 0 0 0 0 0 0 0 0 0 0
0.124924507377761
-0.123656899763398 0 0 0 0 0 0 0 0 0 0 0
-0.113946442560223 0 0 0 0 0 0 0 0 0 0 0
-0.113028091306822 0 0 0 0 0 0 0 0 0 0 0
-0.101605807977074 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0 0
0.0917080027028029
- 0 0 0 0 0 0 0 0 0 0 0
0.0794593044619388
Actual
Predicted 0.22 0.225 0.23 0.235 0.24 0.25 0.515 1.13
- 0 0 0 0 0 0 0 0
0.428323797892874
- 0 0 0 0 0 0 0 0
0.401113458820979
- 0 0 0 0 0 0 0 0
0.386433636913436
- 0 0 0 0 0 0 0 0
0.368591188430687
- 0 0 0 0 0 0 0 0
0.339139940558743
- 0 0 0 0 0 0 0 0
0.317581220057663
- 0 0 0 0 0 0 0 0
0.202278624410508
- 0 0 0 0 0 0 0 0
0.184111808267246
- 0 0 0 0 0 0 0 0
0.158358211476489
- 0 0 0 0 0 0 0 0
0.148342875783461
- 0 0 0 0 0 0 0 0
0.146884610550649
- 0 0 0 0 0 0 0 0
0.128110023710106
- 0 0 0 0 0 0 0 0
0.124924507377761
- 0 0 0 0 0 0 0 0
0.123656899763398
- 0 0 0 0 0 0 0 0
0.113946442560223
- 0 0 0 0 0 0 0 0
0.113028091306822
- 0 0 0 0 0 0 0 0
0.101605807977074
- 0 0 0 0 0 0 0 0
0.091708002702802
9
- 0 0 0 0 0 0 0 0
0.079459304461938
8
[ reached getOption("max.print") -- omitted 4158 rows ]
> 1-sum(diag(tab))/sum(tab)
[1] 0.9997606
> mymodel <- svm(Diameter~Length, data=abalone, kernel="linear")
> summary(mymodel)
Call:
svm(formula = Diameter ~ Length, data = abalone, kernel = "linear")
Parameters:
SVM-Kernel: linear
cost: 1
gamma: 1
epsilon: 0.1

> tab
Actual
Predicted 0 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06
0.0418229150443036 0 1 0 0 0 0 0 0 0 0 0 0
0.07036860050603 0 0 0 0 0 1 0 0 0 0 0 0
73
0.08668042076991 0 0 0 0 0 1 1 0 0 0 0 0
25
0.09075837583583 0 0 0 0 0 0 0 1 0 0 0 0
95
0.09483633090180 0 0 0 0 0 0 2 0 0 0 0 0
36
0.10299224103375 0 0 0 0 1 0 0 0 0 0 0 0
7
0.10707019609972 0 0 0 0 1 0 0 1 0 1 0 0
5
0.11114815116570 0 0 0 1 2 0 1 0 0 0 0 0
2
0.11522610623166 0 0 1 1 0 1 0 1 0 1 0 0
0.11930406129764 0 0 0 0 0 0 1 0 0 0 1 0
7
0.12338201636358 0 0 0 0 0 0 0 3 0 1 1 0
2
0.12745997142956 0 0 0 0 0 0 1 0 1 1 0 0
0.13153792649550 0 0 0 0 0 0 0 1 2 0 0 0
9
0.13561588156144 0 0 0 0 0 2 0 1 1 0 0 0
2
0.13969383662744 0 0 0 0 0 0 0 1 1 1 0 0
2
0.14377179169336 0 0 0 0 1 0 0 2 0 1 1 1
1
0.14784974675935 0 0 0 0 0 0 0 0 2 1 1 0
0.15192770182533 0 0 0 0 0 0 0 0 2 2 2 0
1
0.15600565689123 0 0 0 0 0 1 0 0 0 0 3 1
8
Actual
Predicted 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115
0.041822915044303 0 0 0 0 0 0 0 0 0 0 0
6
0.070368600506037 0 0 0 0 0 0 0 0 0 0 0
3
0.086680420769912 0 0 0 0 0 0 0 0 0 0 0
5
0.090758375835839 0 0 0 0 0 0 0 0 0 0 0
5
0.09483633090180 0 0 0 0 0 0 0 0 0 0 0
36
0.10299224103375 0 0 0 0 0 0 0 0 0 0 0
7
0.10707019609972 0 0 0 0 0 0 0 0 0 0 0
5
0.11114815116570 0 0 0 0 0 0 0 0 0 0 0
2
0.11522610623166 0 0 0 0 0 0 0 0 0 0 0
0.11930406129764 0 0 0 0 0 0 1 0 0 0 0
7
0.12338201636358 0 0 0 0 0 0 0 0 0 0 0
2
0.12745997142956 0 0 0 1 0 0 0 0 0 0 0
0.13153792649550 0 0 0 0 0 0 0 0 0 0 0
9
0.13561588156144 0 0 0 0 0 0 0 0 0 0 0
2
0.13969383662744 0 0 0 0 0 0 0 0 0 0 0
2
0.14377179169336 0 0 0 0 0 0 0 0 0 0 0
1
0.14784974675935 1 0 0 0 0 0 0 0 0 0 0
0.15192770182533 0 0 0 0 0 0 0 0 0 0 0
1
0.15600565689123 0 0 0 0 0 0 0 0 0 0 0
8
Actual
Predicted 0.12 0.125 0.13 0.135 0.14 0.145 0.15 0.155 0.16 0.165 0.17
0.04182291504430 0 0 0 0 0 0 0 0 0 0 0
36
0.07036860050603 0 0 0 0 0 0 0 0 0 0 0
73
0.08668042076991 0 0 0 0 0 0 0 0 0 0 0
25
0.09075837583583 0 0 0 0 0 0 0 0 0 0 0
95
0.09483633090180 0 0 0 0 0 0 0 0 0 0 0
36
0.1029922410337 0 0 0 0 0 0 0 0 0 0 0
57
0.1070701960997 0 0 0 0 0 0 0 0 0 0 0
25
0.11114815116570 0 0 0 0 0 0 0 0 0 0 0
2
0.11522610623166 0 0 0 0 0 0 0 0 0 0 0
0.11930406129764 0 0 0 0 0 0 0 0 0 0 0
7
0.12338201636358 0 0 0 0 0 0 0 0 0 0 0
2
0.12745997142956 0 0 0 0 0 0 0 0 0 0 0
0.13153792649550 1 0 0 0 0 0 0 0 0 0 0
9
0.13561588156144 0 0 0 0 0 0 0 0 0 0 0
2
0.13969383662744 0 0 0 0 0 0 0 0 0 0 0
2
0.14377179169336 0 0 0 0 0 0 0 0 0 0 0
1
0.14784974675935 0 0 0 0 0 0 0 0 0 0 0
0.15192770182533 0 0 0 0 0 0 0 0 0 0 0
1
0.15600565689123 0 0 0 0 0 0 0 0 0 0 0
8
Actual
Predicted 0.175 0.18 0.185 0.19 0.195 0.2 0.205 0.21 0.215 0.22 0.225
0.04182291504430 0 0 0 0 0 0 0 0 0 0 0
36
0.07036860050603 0 0 0 0 0 0 0 0 0 0 0
73
0.08668042076991 0 0 0 0 0 0 0 0 0 0 0
25
0.09075837583583 0 0 0 0 0 0 0 0 0 0 0
95
0.09483633090180 0 0 0 0 0 0 0 0 0 0 0
36
0.10299224103375 0 0 0 0 0 0 0 0 0 0 0
7
0.10707019609972 0 0 0 0 0 0 0 0 0 0 0
5
0.11114815116570 0 0 0 0 0 0 0 0 0 0 0
2
0.11522610623166 0 0 0 0 0 0 0 0 0 0 0
0.11930406129764 0 0 0 0 0 0 0 0 0 0 0
7
0.12338201636358 0 0 0 0 0 0 0 0 0 0 0
2
0.12745997142956 0 0 0 0 0 0 0 0 0 0 0
0.13153792649550 0 0 0 0 0 0 0 0 0 0 0
9
0.13561588156144 0 0 0 0 0 0 0 0 0 0 0
2
0.13969383662744 0 0 0 0 0 0 0 0 0 0 0
2
0.14377179169336 0 0 0 0 0 0 0 0 0 0 0
1
0.14784974675935 0 0 0 0 0 0 0 0 0 0 0
0.15192770182533 0 0 0 0 0 0 0 0 0 0 0
1
0.15600565689123 0 0 0 0 0 0 0 0 0 0 0
8
Actual
Predicted 0.23 0.235 0.24 0.25 0.515 1.13
0.04182291504430 0 0 0 0 0 0
36
0.07036860050603 0 0 0 0 0 0
73
0.08668042076991 0 0 0 0 0 0
25
0.09075837583583 0 0 0 0 0 0
95
0.09483633090180 0 0 0 0 0 0
36
0.10299224103375 0 0 0 0 0 0
7
0.10707019609972 0 0 0 0 0 0
5
0.11114815116570 0 0 0 0 0 0
2
0.11522610623166 0 0 0 0 0 0
0.11930406129764 0 0 0 0 0 0
7
0.12338201636358 0 0 0 0 0 0
2
0.12745997142956 0 0 0 0 0 0
0.13153792649550 0 0 0 0 0 0
9
0.13561588156144 0 0 0 0 0 0
2
0.13969383662744 0 0 0 0 0 0
2
0.14377179169336 0 0 0 0 0 0
1
0.14784974675935 0 0 0 0 0 0
0.15192770182533 0 0 0 0 0 0
1
0.15600565689123 0 0 0 0 0 0
8
[ reached getOption("max.print") -- omitted 115 rows ]
> 1-sum(diag(tab))/sum(tab)
[1] 0.9997606
> data=abalone,kernel="polynomial")
Error: unexpected ',' in "data=abalone,"
> mymodel <- svm(Diameter~Length, data=abalone,kernel="polynomial")
> summary(mymodel)
Call:
svm(formula = Diameter ~ Length, data = abalone, kernel = "polynomial")
Parameters:
SVM-Kernel: polynomial
cost: 1
degree: 3
gamma: 1
coef.0: 0
epsilon: 0.1

> tab
Actual
Predicted 0 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055
- 0 1 0 0 0 0 0 0 0 0 0
1.0603036263162
3
- 0 0 0 0 0 1 0 0 0 0 0
0.7382935452387
54
- 0 0 0 0 0 1 1 0 0 0 0
0.5770500872595
89
- 0 0 0 0 0 0 0 1 0 0 0
0.5392023791437
68
- 0 0 0 0 0 0 2 0 0 0 0
0.5023152397564
95
- 0 0 0 0 1 0 0 0 0 0 0
0.4313732797135
74
- 0 0 0 0 1 0 0 1 0 1 0
0.3972937652948
27
- 0 0 0 1 2 0 1 0 0 0 0
0.3641254320485
79
- 0 0 1 1 0 1 0 1 0 1 0
0.3318559330951
34
- 0 0 0 0 0 0 1 0 0 0 1
0.3004729215377
61
- 0 0 0 0 0 0 0 3 0 1 1
0.2699640504955
12
- 0 0 0 0 0 0 1 0 1 1 0
0.2403169730742
1
- 0 0 0 0 0 0 0 1 2 0 0
0.2115193423810
17
- 0 0 0 0 0 2 0 1 1 0 0
0.1835588115112
- 0 0 0 0 0 0 0 1 1 1 0
0.1564230335840
44
- 0 0 0 0 1 0 0 2 0 1 1
0.1300996617189
87
- 0 0 0 0 0 0 0 0 2 1 1
0.1045763489860
5
Predicted 0.115 0.12 0.125 0.13 0.135 0.14 0.145 0.15 0.155 0.16 0.165
-1.06030362631623 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0 0
0.738293545238754
- 0 0 0 0 0 0 0 0 0 0 0
0.577050087259589
- 0 0 0 0 0 0 0 0 0 0 0
0.539202379143768
- 0 0 0 0 0 0 0 0 0 0 0
0.502315239756495
- 0 0 0 0 0 0 0 0 0 0 0
0.431373279713574
- 0 0 0 0 0 0 0 0 0 0 0
0.397293765294827
- 0 0 0 0 0 0 0 0 0 0 0
0.364125432048579
- 0 0 0 0 0 0 0 0 0 0 0
0.331855933095134
- 0 0 0 0 0 0 0 0 0 0 0
0.300472921537761
- 0 0 0 0 0 0 0 0 0 0 0
0.269964050495512
-0.24031697307421 0 0 0 0 0 0 0 0 0 0 0
- 0 1 0 0 0 0 0 0 0 0 0
0.211519342381017
-0.1835588115112 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0 0
0.156423033584044
- 0 0 0 0 0 0 0 0 0 0 0
0.130099661718987
-0.10457634898605 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0 0
0.079840748521407
4
- 0 0 0 0 0 0 0 0 0 0 0
0.055880513428906
3
- 0 0 0 0 0 0 0 0 0 0 0
0.57705008725958
9
- 0 0 0 0 0 0 0 0 0 0 0
0.53920237914376
8
- 0 0 0 0 0 0 0 0 0 0 0
0.50231523975649
5
- 0 0 0 0 0 0 0 0 0 0 0
0.43137327971357
4
- 0 0 0 0 0 0 0 0 0 0 0
0.39729376529482
7
- 0 0 0 0 0 0 0 0 0 0 0
0.36412543204857
9
- 0 0 0 0 0 0 0 0 0 0 0
0.33185593309513
4
- 0 0 0 0 0 0 0 0 0 0 0
0.30047292153776
1
- 0 0 0 0 0 0 0 0 0 0 0
0.26996405049551
2
- 0 0 0 0 0 0 0 0 0 0 0
0.24031697307421
- 0 0 0 0 0 0 0 0 0 0 0
0.21151934238101
7
-0.1835588115112 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0 0
0.15642303358404
4
- 0 0 0 0 0 0 0 0 0 0 0
0.13009966171898
7
- 0 0 0 0 0 0 0 0 0 0 0
0.10457634898605
- 0 0 0 0 0 0 0 0 0 0 0
0.07984074852140
74
- 0 0 0 0 0 0 0 0 0 0 0
0.05588051342890
63
- 0 0 0 0 0 0 0
1.06030362631623
SVM-Kernel: sigmoid
cost: 1
gamma: 1
coef.0: 0
epsilon: 0.1

>
> tab
Actual
Predicted 0 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06
- 0 0 0 0 0 0 0 0 0 0 0 0
64.80864640785
5
- 0 0 0 0 0 0 0 0 0 0 0 0
61.003629272392
8
- 0 0 0 0 0 0 0 0 0 0 0 0
55.447067842471
4
- 0 0 0 0 0 0 0 0 0 0 0 0
53.966716893405
7
- 0 0 0 0 0 0 0 0 0 0 0 0
52.448562184744
5
- 0 0 0 0 0 0 0 0 0 0 0 0
50.892037073733
5
- 0 0 0 0 0 0 0 0 0 0 0 0
49.296646290102
4
- 0 0 0 0 0 0 0 0 0 0 0 0
47.66197949152
7
- 0 0 0 0 0 0 0 0 0 0 0 0
45.987726511967
7
- 0 0 0 0 0 0 0 0 0 0 0 0
40.726225435013
5
- 0 0 0 0 0 0 0 0 0 0 0 0
38.893173083007
2
- 0 0 0 0 0 0 0 0 0 0 0 0
37.021161402670
7
- 0 0 0 0 0 0 0 0 0 0 0 0
35.110919629362
3
- 0 0 0 0 0 0 0 0 0 0 0 0
33.163447319594
8
- 0 0 0 0 0 0 0 0 0 0 0 0
31.180050288203
3
- 0 0 0 0 0 0 0 0 0 0 0 0
29.16238012641
5
- 0 0 0 0 0 0 0 0 0 0 0 0
27.112477508983
5
Actual
Predicted 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115
- 0 0 0 0 0 0 0 0 0 0 0
64.808646407855
- 0 0 0 0 0 0 0 0 0 0 0
61.003629272392
8
- 0 0 0 0 0 0 0 0 0 0 0
55.447067842471
4
- 0 0 0 0 0 0 0 0 0 0 0
53.966716893405
7
- 0 0 0 0 0 0 0 0 0 0 0
52.448562184744
5
- 0 0 0 0 0 0 0 0 0 0 0
50.892037073733
5
- 0 0 0 0 0 0 0 0 0 0 0
49.296646290102
4
- 0 0 0 0 0 0 0 0 0 0 0
47.661979491527

Cia 4 ML

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cia 4 ML

Uploaded by

Copyright:

Available Formats

CIA- 4

MACHINE LEARNING ALGORITHMS

UNDER THE GUIDANCE OF

INSTITUTE OF BUSINESS AND MANAGEMENT

 Diameter: The measurement of the abalone shell perpendicular to length in

 Height: Height of the shell in mm. Continuous numeric value.

 Whole Weight: Weight of the abalone in grams. Continuous numeric value.

a. Model selection and assumptions, if any

• Objective of conducting a logistic regression

o Test interactions between attributes

Codes( LINEAR AND MULTILINEAR REGRESSION)

28. abalone.train = lm(Rings~., data = datatrain)

32. par(mfrow=c(1,1))abaloneOutliers= c(1035, 1849,

CODES FOR SIMPLE BINARY LOGISTIC

> abalone <- read_excel("C:/Users/Shivangi Gupta/Desktop/abalone.xlsx")

> table(abalone$W, abalone$S)

CODES UNIVARIATE BINARY LOGISTIC REGRESSION

> abalonelogmod2<-glm(abalone$W ~ abalone$D, family=binomial(link="logit"), data=abalone)

>Number of Fisher Scoring iterations: 10> round(exp(cbind(coef(abalonelogmod2),

>abalone$D 6838.434 1542.648 120165.822> x<-data.frame(abalone$S, abalone$D)

CODES FOR MULTIVARIATE BINARY LOGISTIC REGRESSION

> round(exp(cbind(coef(abalonelogmod3), confint(abalonelogmod3))),3)

> abalonelogmod1<-glm(W ~ H + S + D + R, family=binomial (link="logit"), data=abalone)

> glm(formula = W ~ H + S + D + R, family = binomial(link =

"logit"), data = abalone)

CODES LINEAR DISCRIMINANT ANALYSIS

> p1 <- predict(linear, training)$class

> tab <- table(Predicted = p1, Actual = training$W)

> tab1 <- table(Predicted = p2, Actual = testing$W)

>spec_tbl_df [4,177 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

> ind <- sample(2, nrow(datarf), replace=TRUE, prob=c(0.7,0.3))

> train <- datarf[ind==1,]

> test <- datarf[ind==2,]

> rf<-randomForest(D~W, data=train, ntree = 300, mtry = 8,importance = TRUE, proximity =

> confusionMatrix(p1, train$D)

>p2 <- predict(rf, test)

> data <- read.csv(file.choose(), header = T)

> data$D[data$D == 0] <- 'No'

> data$D[data$D == 1] <- 'Yes'

> data$D <- factor(data$D)

> ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))

> training <- data[ind == 1,]

> test <- data[ind == 2,]

> trControl <- trainControl(method = "repeatedcv",number = 10, repeats = 3, classProbs = TRUE,

> tab <- table(Predicted=pred, Actual=abalone$Height)

> pred <- predict(mymodel, abalone)

> tab <- table(Predicted=pred, Actual=abalone$Height)

> plot(mymodel, data=abalone, abalone.Height~abalone.Length,

> pred <- predict(mymodel, abalone)

> tab <- table(Predicted=pred, Actual=abalone$Height)

>tmodel <- tune(svm, Diameter~Length, data=abalone, ranges = list(epsilon = seq(0,1,0.1), cost=2^(2:7)))

>plot(mymodel, data=abalone, abalone.Height~abalone.Length,

>pred <- predict(mymodel, abalone)

>tab <- table(Predicted=pred, Actual=abalone$Height)

mydata <- read.csv("abalone.csv")

mydata$D <- as.factor(mydata$D)

>mytree <- ctree(D~H+W+R, mydata, controls=ctree_control(mincriterion=0.9, minsplit=50))

## 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300

## W = 0.93115, p-value < 2.2e-16

## (Intercept) 3.64148 0.34718 10.489 < 2e-16 ***

SIMPLE BINAR LOGISTIC REGRESSION

Diameter Height Whole weight Shucked weight