Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Stroke Prediction dataset

Javier Galindos
5/29/2021
Data structure
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : chr "36.6" "N/A" "32.5" "34.4" ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...

2/48
Data structure
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 N/A never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1

3/48
Data cleaning. Handling missing
data.
Data cleaning.
First of all, it is going to be checked the data within the numerical variables.

## Gender:

## [1] "Male" "Female" "Other"

## Married:

## [1] "Yes" "No"

## Work type:

## [1] "Private" "Self-employed" "Govt_job" "children"


## [5] "Never_worked"

5/48
## Residence type:

## [1] "Urban" "Rural"

## Smoking:

## [1] "formerly smoked" "never smoked" "smokes" "Unknown"

6/48
How many N/A are in the dataset?
# how many "N/A" values are in my dataset per column?
miss_scan_count(data = StrokeData, search = list("N/A", "Unknown"))

## # A tibble: 12 x 2
## Variable n
## <chr> <int>
## 1 id 0
## 2 gender 0
## 3 age 0
## 4 hypertension 0
## 5 heart_disease 0
## 6 ever_married 0
## 7 work_type 0
## 8 Residence_type 0
## 9 avg_glucose_level 0
## 10 bmi 201
## 11 smoking_status 1544
## 12 stroke 0

7/48
How many N/A are in the dataset?
Could be observed that there are N/A values in bmi and smoking_status.
They are going to be replaced by N/A.

# replace the "N/A" in bmi


StrokeData = replace_with_na(data = StrokeData, replace = list(bmi = c("N/A"), smoking_stat
# change bmi to numeric
mutate(bmi = as.numeric(bmi))
# check
summary(StrokeData)

## id gender age hypertension


## Min. : 67 Length:5110 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Class :character 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Mode :character Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
##
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 Length:5110 Length:5110 Length:5110
## 1st Qu.:0.00000 Class :character Class :character Class :character
## Median :0.00000 Mode :character Mode :character Mode :character 8/48
BMI
We are going to assume this missingness is MCAR (missing completely at
random). This implies that we can impute the bmi values with the mean or
median.

ggplot(data=StrokeData,aes_string(x = StrokeData$bmi))+geom_histogram(color = 'black',fill


)+labs(title = 'Distribition of BMI',y = 'frequency',x='bmi')

## Warning: Removed 201 rows containing non-finite values (stat_bin).

9/48
BMI
Lets check whether this was an appropriate imputation for this variable.

# impute median and bind shadow to evaluate imputation


StrokeData.imp = bind_shadow(StrokeData) %>%
impute_median_at(.vars = c("bmi")) %>%
add_label_shadow()

10/48
# Explore the median values in bmi in the imputed dataset
ggplot(StrokeData.imp,
aes(x = StrokeData.imp$bmi_NA, y = StrokeData.imp$bmi)) +
geom_boxplot() +
labs(title = "Comparison, no-missing vs. imputed values for BMI")

## Warning: Use of `StrokeData.imp$bmi_NA` is discouraged. Use `bmi_NA` instead.

## Warning: Use of `StrokeData.imp$bmi` is discouraged. Use `bmi` instead.

11/48
BMI
Looks like this worked well.

StrokeData = impute_median_at(StrokeData, .vars = c("bmi"))

Lets also change the categorical variables to factor.

r StrokeData = StrokeData %>% mutate(across(c(hypertension, heart_disease), factor),


across(where(is.character), as.factor), stroke = factor(ifelse(stroke == 0, "no",
"yes")))

## Smoking status Next up is to solve smoking_status. We will use random forest to impute those values.

r StrokeData=rfImpute(stroke~. ,data=StrokeData,iter=6)

## ntree OOB 1 2 ## 300: 4.99% 0.12%100.00% ## ntree OOB 1


2 ## 300: 4.95% 0.10% 99.60% ## ntree OOB 1 2 ## 300: 4.97%
0.10%100.00% ## ntree OOB 1 2 ## 300: 4.95% 0.10% 99.60% ## ntree
OOB 1 2 ## 300: 4.93% 0.06%100.00% ## ntree OOB 1 2 ##
300: 4.99% 0.12%100.00%

p1 = ggplot(StrokeData,
aes(x = smoking_status, fill = smoking_status)) +
geom_bar() + 12/48
Descriptive Statistics
Scatterplot Matrices

14/48
Division by sex

15/48
Division by sex

16/48
Distribution by age

The older, the more likely to have an stroke.

17/48
People with hypertension or without
hypertension

18/48
Distribition by Smoking Status

19/48
Train / Test set.
Train / Test set.
We are going to split 70% for training and 30% for testing.

#Partition training and testing data


train=sample(nrow(StrokeData),nrow(StrokeData)*0.7)
StrokeTrain = StrokeData[train,-2]
StrokeTest = StrokeData[-train,-2]

21/48
Decision Tree
Decision Tree
First of all, we will build a simple decision tree and then prune it.

#tree.Stroke=tree(stroke~gender+age+hypertension +heart_disease+ever_married+work_type+Resi
tree.Stroke=tree(stroke~.,data=StrokeTrain)
summary(tree.Stroke)

##
## Classification tree:
## tree(formula = stroke ~ ., data = StrokeTrain)
## Variables actually used in tree construction:
## [1] "age" "avg_glucose_level"
## Number of terminal nodes: 5
## Residual mean deviance: 0.2956 = 1056 / 3572
## Misclassification error rate: 0.04529 = 162 / 3577

23/48
BMI
Finally, we will also convert bmi from a continuous variable to a factor
according to the bmi categories.

Stroke.Oversample = Stroke.Oversample%>%
mutate(bmi = case_when(bmi < 18.5 ~ "underweight",
bmi >= 18.5 & bmi < 25 ~ "normal weight",
bmi >= 25 & bmi < 30 ~ "overweight",
bmi >= 30 ~ "obese"),
bmi = factor(bmi, levels = c("underweight", "normal weight",
"overweight", "obese"), order = TRUE))
StrokeTest.Oversample = StrokeTest.Oversample %>%
mutate(bmi = case_when(bmi < 18.5 ~ "underweight",
bmi >= 18.5 & bmi < 25 ~ "normal weight",
bmi >= 25 & bmi < 30 ~ "overweight",
bmi >= 30 ~ "obese"),
bmi = factor(bmi, levels = c("underweight", "normal weight",
"overweight", "obese"), order = TRUE))

24/48
New decision tree
Lets model the tree with the balanced dataset.

#tree.Stroke.Oversample=tree(stroke~gender+age+hypertension+heart_disease+ever_married+work
tree.Stroke.Oversample=tree(stroke~.,data=Stroke.Oversample,
method="deviance")
plot(tree.Stroke.Oversample);text(tree.Stroke.Oversample,pretty=0)

25/48
New decision tree
Performance in test set:

## Confusion Matrix and Statistics


##
## Reference
## Prediction no yes
## no 1144 23
## yes 302 64
##
## Accuracy : 0.788
## 95% CI : (0.7667, 0.8082)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2101
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.73563
## Specificity : 0.79115
## Pos Pred Value : 0.17486
## Neg Pred Value : 0.98029
26/48
## Prevalence : 0.05675
Pruning the tree
Could be observed that from 2 leaves, the deviance does not decrease a lot.
We are going to pick 2 as the number of terminal towards having better
interpretability.

prunetree.Stroke.Oversample=prune.tree(tree.Stroke.Oversample,best=4)
summary(prunetree.Stroke.Oversample)

##
## Classification tree:
## snip.tree(tree = tree.Stroke.Oversample, nodes = 2L)
## Variables actually used in tree construction:
## [1] "age" "avg_glucose_level"
## Number of terminal nodes: 4
## Residual mean deviance: 0.7715 = 5266 / 6826
## Misclassification error rate: 0.1432 = 978 / 6830

27/48
plot(prunetree.Stroke.Oversample);text(prunetree.Stroke.Oversample,pretty=0)

Check performance:

## Confusion Matrix and Statistics ## ## Reference ## Prediction no yes ##


no 1144 23 ## yes 302 64 ## ## Accuracy : 0.788 ##
95% CI : (0.7667, 0.8082) ## No Information Rate : 0.9432 ## P-Value [Acc > NIR]
: 1 ## ## Kappa : 0.2101 ## ## Mcnemar's Test P-Value : <2e-16 ## ##
28/48
Sensitivity : 0.73563 ## Specificity : 0.79115 ## Pos Pred Value :
Random forest
Could be obsrved that after 250 trees aporx the OBB error reach a plateu.
Thus, 500 trees is enough.

## If we want to compare this random forest to others with different values for
## mtry (to control how many variables are considered at each step)...
oob.values <- vector(length=10)
for(i in 1:10) {
temp.model <- randomForest(stroke ~ ., data=Stroke.Oversample,mtry=i, ntree=500)
oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.values

## [1] 0.14202050 0.09311859 0.06573939 0.05959004 0.05710102 0.05739385


## [7] 0.05710102 0.05475842 0.05431918 0.05373353

## find the minimum error


min(oob.values)

## [1] 0.05373353

29/48
## create a model for proximities using the best value for mtry
rf.fit <- randomForest(stroke ~ .,
data=Stroke.Oversample,
ntree=2000,
#proximity=TRUE,
mtry=4)
rf.fit

##
## Call:
## randomForest(formula = stroke ~ ., data = Stroke.Oversample, ntree = 2000, mtry =
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.71%
## Confusion matrix:
## no yes class.error
## no 3120 295 0.08638360
## yes 95 3320 0.02781845

30/48
## Now check to see if the random forest is actually big enough...
## Up to a point, the more trees in the forest, the better. You can tell when
## you've made enough when the OOB no longer improves.
oob.error.data <- data.frame(
Trees=rep(1:nrow(rf.fit$err.rate), times=3),
Type=rep(c("OOB", "no stroke", "have stroke"), each=nrow(rf.fit$err.rate)),
Error=c(rf.fit$err.rate[,"OOB"],
rf.fit$err.rate[,"no"],
rf.fit$err.rate[,"yes"]))

ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +


geom_line(aes(color=Type))

31/48
Random forest

rf.pred <- predict(rf.fit,newdata=StrokeTest.Oversample)


confusionMatrix(rf.pred, StrokeTest.Oversample[["stroke"]], positive = "yes")

## Confusion Matrix and Statistics


##
## Reference
## Prediction no yes
## no 1323 63
## yes 123 24
##
## Accuracy : 0.8787
## 95% CI : (0.8613, 0.8946)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1441
##
## Mcnemar's Test P-Value : 1.518e-05
##
## Sensitivity : 0.27586
## Specificity : 0.91494
## Pos Pred Value : 0.16327 32/48
## Neg Pred Value : 0.95455
Random forest

importance(rf.fit)

## MeanDecreaseGini
## gender 65.20942
## age 1882.06956
## hypertension 86.05656
## heart_disease 67.81507
## ever_married 92.08252
## work_type 168.42082
## Residence_type 75.68467
## avg_glucose_level 642.55035
## bmi 110.54556
## smoking_status 133.71465

33/48
Random forest

34/48
Logistic Regression
logistic.fit=glm(stroke~.,data=StrokeTrain,family=binomial)
summary(logistic.fit)

##
## Call:
## glm(formula = stroke ~ ., family = binomial, data = StrokeTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0886 -0.3101 -0.1640 -0.0951 3.4469
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.986e+00 8.104e-01 -7.386 1.51e-13 ***

Logistic Regression
## genderMale -2.208e-01 1.770e-01 -1.248 0.21216
## genderOther -1.072e+01 1.455e+03 -0.007 0.99412
## age 7.116e-02 7.064e-03 10.074 < 2e-16 ***
## hypertension1 3.639e-01 2.015e-01 1.806 0.07096 .
## heart_disease1 2.131e-01 2.386e-01 0.893 0.37194
## ever_marriedYes -6.559e-02 2.791e-01 -0.235 0.81419
## work_typeGovt_job -1.391e+00 8.879e-01 -1.567 0.11715
## work_typeNever_worked -1.071e+01 3.622e+02 -0.030 0.97641
## work_typePrivate -1.236e+00 8.615e-01 -1.435 0.15140
## work_typeSelf-employed -1.577e+00 8.921e-01 -1.767 0.07717 .
logistic.prob=predict(logistic.fit,type="response", newdata = StrokeTest)
logistic.pred=ifelse(logistic.prob>0.5, 'yes','no')
table(logistic.pred,StrokeTest$stroke)

##
## logistic.pred no yes
## no 1446 87

confusionMatrix(as.factor(logistic.pred),StrokeTest[["stroke"]],positive = "yes")

## Warning in confusionMatrix.default(as.factor(logistic.pred),
## StrokeTest[["stroke"]], : Levels are not in the same order for reference and
## data. Refactoring data to match.

## Confusion Matrix and Statistics


##
## Reference
## Prediction no yes
## no 1446 87
## yes 0 0
##
## Accuracy : 0.9432
## 95% CI : (0.9305, 0.9543) 37/48
## No Information Rate : 0.9432
Logistic Regression Oversample

logistic.fit.os=glm(stroke~.,data=Stroke.Oversample,family=binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logistic.fit.os)

##
## Call:
## glm(formula = stroke ~ ., family = binomial, data = Stroke.Oversample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9337 -0.3761 0.0823 0.6146 3.3661
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.005e+01 5.621e+01 -0.179
## gendergender.Male -2.285e-01 7.551e-02 -3.025
## gendergender.Other -1.377e+01 3.956e+03 -0.003
## age 1.222e-01 3.508e-03 34.839
## hypertensionhypertension.1 3.719e-01 8.887e-02 4.184
38/48
## heart_diseaseheart_disease.1 1.570e-01 1.115e-01 1.409
## Confusion Matrix and Statistics
##
##
## logistic.pred.os no yes
## no 1155 26
## yes 291 61
##
## Accuracy : 0.7932
## 95% CI : (0.7721, 0.8132)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2056
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.70115
## Specificity : 0.79876
## Pos Pred Value : 0.17330
## Neg Pred Value : 0.97798
## Prevalence : 0.05675
## Detection Rate : 0.03979
## Detection Prevalence : 0.22962
## Balanced Accuracy : 0.74995
## 39/48
Logistic Regression Oversample

## now we can plot the data


predicted.data <- data.frame(
probability.of.stroke=logistic.prob.os,
stroke=StrokeTest.Oversample$stroke)
predicted.data <- predicted.data[
order(predicted.data$probability.of.stroke, decreasing=FALSE),]
predicted.data$rank <- 1:nrow(predicted.data)

40/48
Bayes classifiers
LDA

library(MASS)

##
## Attaching package: 'MASS'

## The following object is masked from 'package:patchwork':


##
## area

## The following object is masked from 'package:dplyr':


##
## select

library(class)

lda.fit = lda(stroke~.,data=StrokeTrain)
lda.fit

## Call:
## lda(stroke ~ ., data = StrokeTrain)
42/48
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1441 79
## yes 5 8
##
## Accuracy : 0.9452
## 95% CI : (0.9326, 0.9561)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 0.3971
##
## Kappa : 0.1474
##
## Mcnemar's Test P-Value : 1.653e-15
##
## Sensitivity : 0.091954
## Specificity : 0.996542
## Pos Pred Value : 0.615385
## Neg Pred Value : 0.948026
## Prevalence : 0.056751
## Detection Rate : 0.005219
## Detection Prevalence : 0.008480
## Balanced Accuracy : 0.544248
## 43/48
LDA

lda.fit.os = lda(stroke~.,data=Stroke.Oversample)
summary(lda.fit.os)

## Length Class Mode


## prior 2 -none- numeric
## counts 2 -none- numeric
## means 34 -none- numeric
## scaling 17 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 8 -none- list

44/48
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1110 21
## yes 336 66
##
## Accuracy : 0.7671
## 95% CI : (0.7451, 0.7881)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1948
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.75862
## Specificity : 0.76763
## Pos Pred Value : 0.16418
## Neg Pred Value : 0.98143
## Prevalence : 0.05675
## Detection Rate : 0.04305
## Detection Prevalence : 0.26223
## Balanced Accuracy : 0.76313
## 45/48
LDA
# Plot good classifications in red and bad classifications in black
col.lda.pt <- c("black","indianred1")[1*(lda.pred.os$class==StrokeTest.Oversample$stroke)+1

pairs(StrokeTest.Oversample,main="Good (in red) and bad (in black) classifications with LD

46/48
QDA

# Plot good classifications in red and bad classifications in black


col.qda.pt <- c("black","indianred1")[1*(qda.pred.os$class==StrokeTest.Oversample$stroke)+1

pairs(StrokeTest.Oversample,main="Good (in red) and bad (in black) classifications with QD

47/48
48/48

You might also like