Professional Documents
Culture Documents
Stroke Prediction Dataset
Stroke Prediction Dataset
Javier Galindos
5/29/2021
Data structure
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : chr "36.6" "N/A" "32.5" "34.4" ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
2/48
Data structure
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 N/A never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1
3/48
Data cleaning. Handling missing
data.
Data cleaning.
First of all, it is going to be checked the data within the numerical variables.
## Gender:
## Married:
## Work type:
5/48
## Residence type:
## Smoking:
6/48
How many N/A are in the dataset?
# how many "N/A" values are in my dataset per column?
miss_scan_count(data = StrokeData, search = list("N/A", "Unknown"))
## # A tibble: 12 x 2
## Variable n
## <chr> <int>
## 1 id 0
## 2 gender 0
## 3 age 0
## 4 hypertension 0
## 5 heart_disease 0
## 6 ever_married 0
## 7 work_type 0
## 8 Residence_type 0
## 9 avg_glucose_level 0
## 10 bmi 201
## 11 smoking_status 1544
## 12 stroke 0
7/48
How many N/A are in the dataset?
Could be observed that there are N/A values in bmi and smoking_status.
They are going to be replaced by N/A.
9/48
BMI
Lets check whether this was an appropriate imputation for this variable.
10/48
# Explore the median values in bmi in the imputed dataset
ggplot(StrokeData.imp,
aes(x = StrokeData.imp$bmi_NA, y = StrokeData.imp$bmi)) +
geom_boxplot() +
labs(title = "Comparison, no-missing vs. imputed values for BMI")
11/48
BMI
Looks like this worked well.
## Smoking status Next up is to solve smoking_status. We will use random forest to impute those values.
r StrokeData=rfImpute(stroke~. ,data=StrokeData,iter=6)
p1 = ggplot(StrokeData,
aes(x = smoking_status, fill = smoking_status)) +
geom_bar() + 12/48
Descriptive Statistics
Scatterplot Matrices
14/48
Division by sex
15/48
Division by sex
16/48
Distribution by age
17/48
People with hypertension or without
hypertension
18/48
Distribition by Smoking Status
19/48
Train / Test set.
Train / Test set.
We are going to split 70% for training and 30% for testing.
21/48
Decision Tree
Decision Tree
First of all, we will build a simple decision tree and then prune it.
#tree.Stroke=tree(stroke~gender+age+hypertension +heart_disease+ever_married+work_type+Resi
tree.Stroke=tree(stroke~.,data=StrokeTrain)
summary(tree.Stroke)
##
## Classification tree:
## tree(formula = stroke ~ ., data = StrokeTrain)
## Variables actually used in tree construction:
## [1] "age" "avg_glucose_level"
## Number of terminal nodes: 5
## Residual mean deviance: 0.2956 = 1056 / 3572
## Misclassification error rate: 0.04529 = 162 / 3577
23/48
BMI
Finally, we will also convert bmi from a continuous variable to a factor
according to the bmi categories.
Stroke.Oversample = Stroke.Oversample%>%
mutate(bmi = case_when(bmi < 18.5 ~ "underweight",
bmi >= 18.5 & bmi < 25 ~ "normal weight",
bmi >= 25 & bmi < 30 ~ "overweight",
bmi >= 30 ~ "obese"),
bmi = factor(bmi, levels = c("underweight", "normal weight",
"overweight", "obese"), order = TRUE))
StrokeTest.Oversample = StrokeTest.Oversample %>%
mutate(bmi = case_when(bmi < 18.5 ~ "underweight",
bmi >= 18.5 & bmi < 25 ~ "normal weight",
bmi >= 25 & bmi < 30 ~ "overweight",
bmi >= 30 ~ "obese"),
bmi = factor(bmi, levels = c("underweight", "normal weight",
"overweight", "obese"), order = TRUE))
24/48
New decision tree
Lets model the tree with the balanced dataset.
#tree.Stroke.Oversample=tree(stroke~gender+age+hypertension+heart_disease+ever_married+work
tree.Stroke.Oversample=tree(stroke~.,data=Stroke.Oversample,
method="deviance")
plot(tree.Stroke.Oversample);text(tree.Stroke.Oversample,pretty=0)
25/48
New decision tree
Performance in test set:
prunetree.Stroke.Oversample=prune.tree(tree.Stroke.Oversample,best=4)
summary(prunetree.Stroke.Oversample)
##
## Classification tree:
## snip.tree(tree = tree.Stroke.Oversample, nodes = 2L)
## Variables actually used in tree construction:
## [1] "age" "avg_glucose_level"
## Number of terminal nodes: 4
## Residual mean deviance: 0.7715 = 5266 / 6826
## Misclassification error rate: 0.1432 = 978 / 6830
27/48
plot(prunetree.Stroke.Oversample);text(prunetree.Stroke.Oversample,pretty=0)
Check performance:
## If we want to compare this random forest to others with different values for
## mtry (to control how many variables are considered at each step)...
oob.values <- vector(length=10)
for(i in 1:10) {
temp.model <- randomForest(stroke ~ ., data=Stroke.Oversample,mtry=i, ntree=500)
oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.values
## [1] 0.05373353
29/48
## create a model for proximities using the best value for mtry
rf.fit <- randomForest(stroke ~ .,
data=Stroke.Oversample,
ntree=2000,
#proximity=TRUE,
mtry=4)
rf.fit
##
## Call:
## randomForest(formula = stroke ~ ., data = Stroke.Oversample, ntree = 2000, mtry =
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.71%
## Confusion matrix:
## no yes class.error
## no 3120 295 0.08638360
## yes 95 3320 0.02781845
30/48
## Now check to see if the random forest is actually big enough...
## Up to a point, the more trees in the forest, the better. You can tell when
## you've made enough when the OOB no longer improves.
oob.error.data <- data.frame(
Trees=rep(1:nrow(rf.fit$err.rate), times=3),
Type=rep(c("OOB", "no stroke", "have stroke"), each=nrow(rf.fit$err.rate)),
Error=c(rf.fit$err.rate[,"OOB"],
rf.fit$err.rate[,"no"],
rf.fit$err.rate[,"yes"]))
31/48
Random forest
importance(rf.fit)
## MeanDecreaseGini
## gender 65.20942
## age 1882.06956
## hypertension 86.05656
## heart_disease 67.81507
## ever_married 92.08252
## work_type 168.42082
## Residence_type 75.68467
## avg_glucose_level 642.55035
## bmi 110.54556
## smoking_status 133.71465
33/48
Random forest
34/48
Logistic Regression
logistic.fit=glm(stroke~.,data=StrokeTrain,family=binomial)
summary(logistic.fit)
##
## Call:
## glm(formula = stroke ~ ., family = binomial, data = StrokeTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0886 -0.3101 -0.1640 -0.0951 3.4469
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.986e+00 8.104e-01 -7.386 1.51e-13 ***
Logistic Regression
## genderMale -2.208e-01 1.770e-01 -1.248 0.21216
## genderOther -1.072e+01 1.455e+03 -0.007 0.99412
## age 7.116e-02 7.064e-03 10.074 < 2e-16 ***
## hypertension1 3.639e-01 2.015e-01 1.806 0.07096 .
## heart_disease1 2.131e-01 2.386e-01 0.893 0.37194
## ever_marriedYes -6.559e-02 2.791e-01 -0.235 0.81419
## work_typeGovt_job -1.391e+00 8.879e-01 -1.567 0.11715
## work_typeNever_worked -1.071e+01 3.622e+02 -0.030 0.97641
## work_typePrivate -1.236e+00 8.615e-01 -1.435 0.15140
## work_typeSelf-employed -1.577e+00 8.921e-01 -1.767 0.07717 .
logistic.prob=predict(logistic.fit,type="response", newdata = StrokeTest)
logistic.pred=ifelse(logistic.prob>0.5, 'yes','no')
table(logistic.pred,StrokeTest$stroke)
##
## logistic.pred no yes
## no 1446 87
confusionMatrix(as.factor(logistic.pred),StrokeTest[["stroke"]],positive = "yes")
## Warning in confusionMatrix.default(as.factor(logistic.pred),
## StrokeTest[["stroke"]], : Levels are not in the same order for reference and
## data. Refactoring data to match.
logistic.fit.os=glm(stroke~.,data=Stroke.Oversample,family=binomial)
summary(logistic.fit.os)
##
## Call:
## glm(formula = stroke ~ ., family = binomial, data = Stroke.Oversample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9337 -0.3761 0.0823 0.6146 3.3661
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.005e+01 5.621e+01 -0.179
## gendergender.Male -2.285e-01 7.551e-02 -3.025
## gendergender.Other -1.377e+01 3.956e+03 -0.003
## age 1.222e-01 3.508e-03 34.839
## hypertensionhypertension.1 3.719e-01 8.887e-02 4.184
38/48
## heart_diseaseheart_disease.1 1.570e-01 1.115e-01 1.409
## Confusion Matrix and Statistics
##
##
## logistic.pred.os no yes
## no 1155 26
## yes 291 61
##
## Accuracy : 0.7932
## 95% CI : (0.7721, 0.8132)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2056
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.70115
## Specificity : 0.79876
## Pos Pred Value : 0.17330
## Neg Pred Value : 0.97798
## Prevalence : 0.05675
## Detection Rate : 0.03979
## Detection Prevalence : 0.22962
## Balanced Accuracy : 0.74995
## 39/48
Logistic Regression Oversample
40/48
Bayes classifiers
LDA
library(MASS)
##
## Attaching package: 'MASS'
library(class)
lda.fit = lda(stroke~.,data=StrokeTrain)
lda.fit
## Call:
## lda(stroke ~ ., data = StrokeTrain)
42/48
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1441 79
## yes 5 8
##
## Accuracy : 0.9452
## 95% CI : (0.9326, 0.9561)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 0.3971
##
## Kappa : 0.1474
##
## Mcnemar's Test P-Value : 1.653e-15
##
## Sensitivity : 0.091954
## Specificity : 0.996542
## Pos Pred Value : 0.615385
## Neg Pred Value : 0.948026
## Prevalence : 0.056751
## Detection Rate : 0.005219
## Detection Prevalence : 0.008480
## Balanced Accuracy : 0.544248
## 43/48
LDA
lda.fit.os = lda(stroke~.,data=Stroke.Oversample)
summary(lda.fit.os)
44/48
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1110 21
## yes 336 66
##
## Accuracy : 0.7671
## 95% CI : (0.7451, 0.7881)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1948
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.75862
## Specificity : 0.76763
## Pos Pred Value : 0.16418
## Neg Pred Value : 0.98143
## Prevalence : 0.05675
## Detection Rate : 0.04305
## Detection Prevalence : 0.26223
## Balanced Accuracy : 0.76313
## 45/48
LDA
# Plot good classifications in red and bad classifications in black
col.lda.pt <- c("black","indianred1")[1*(lda.pred.os$class==StrokeTest.Oversample$stroke)+1
46/48
QDA
47/48
48/48