Professional Documents
Culture Documents
An Integrated Machine Learning Approach To Stroke Prediction
An Integrated Machine Learning Approach To Stroke Prediction
to Stroke Prediction
Aditya Khosla Yu Cao Cliff Chiung-Yu Lin
Dept. of Computer Science Dept. of Electrical Engineering Dept. of Electrical Engineering
Stanford University Stanford University Stanford University
Stanford, CA 94305 Stanford, CA 94305 Stanford, CA 94305
aditya86@stanford.edu yufcao@stanford.edu chiungyu@stanford.edu
Hsu-Kuang Chiu Junling Hu∗ Honglak Lee†
Dept. of Electrical Engineering eBay, Inc Dept. of Computer Science
Stanford University 2145 Hamilton Avenue Stanford University
Stanford, CA 94305 San Jose, CA 95125 Stanford, CA 94305
hkchiu@stanford.edu juhu@ebay.com hllee@cs.stanford.edu
ABSTRACT 1. INTRODUCTION
Stroke is the third leading cause of death and the principal Stroke is the third leading cause of death and the principal
cause of serious long-term disability in the United States. cause of serious long-term disability in the United States [2].
Accurate prediction of stroke is highly valuable for early in- Stroke risk prediction can contribute significantly to its pre-
tervention and treatment. In this study, we compare the vention and early treatment. Numerous medical studies and
Cox proportional hazards model with a machine learning data analyses have been conducted to identify effective pre-
approach for stroke prediction on the Cardiovascular Health dictors of stroke. The Framingham Study [6, 34] reported a
Study (CHS) dataset. Specifically, we consider the common list of stroke risk factors including age, systolic blood pres-
problems of data imputation, feature selection, and predic- sure, the use of anti-hypertensive therapy, diabetes mellitus,
tion in medical datasets. We propose a novel automatic fea- cigarette smoking, prior cardiovascular disease, atrial fib-
ture selection algorithm that selects robust features based rillation, and left ventricular hypertrophy by electrocardio-
on our proposed heuristic: conservative mean. Combined gram. Furthermore, in the past decade, a number of other
with Support Vector Machines (SVMs), our proposed fea- studies [25, 23, 24, 26] have led to the discovery of more risk
ture selection algorithm achieves a greater area under the factors such as creatinine level, time to walk 15 feet, and
ROC curve (AUC) as compared to the Cox proportional haz- others.
ards model and L1 regularized Cox model. Furthermore, we Most previous prediction models have adopted features
present a margin-based censored regression algorithm that (risk factors) that are verified by clinical trials or selected
combines the concept of margin-based classifiers with cen- manually by medical experts. For example, Lumley et al. [24]
sored regression to achieve a better concordance index than built a 5-year stroke prediction model based on the Cardio-
the Cox model. Overall, our approach outperforms the cur- vascular Health Study [8] dataset using a set of 16 manually
rent state-of-the-art in both metrics of AUC and concor- selected features (given in [25]) from a total of roughly one
dance index. In addition, our work has also identified poten- thousand features. With a large number of features in cur-
tial risk factors that have not been discovered by traditional rent medical datasets, it is a cumbersome task to identify
approaches. Our method can be applied to clinical predic- and verify each risk factor manually. On the other hand,
tion of other diseases, where missing data are common and machine learning algorithms are capable of identifying fea-
risk factors are not well understood. tures highly related to stroke occurrence efficiently from the
huge set of features; therefore, we believe machine learning
Categories and Subject Descriptors can be used to: (i) improve the prediction accuracy of stroke
J.3 [Computer Application]: Life and medical sciences; risk and (ii) discover new risk factors.
I.2.6 [Artificial Intelligence]: Learning; I.5.2 [Pattern Lumley et al.’s [24] 5-year stroke prediction model adopted
recognition]: Design methodology the Cox proportional hazards model, one of the most com-
monly used statistical methods in medical research [3]. It
General Terms has been extensively studied [1, 3] and applied to the predic-
Experimentation, Algorithms, Performance tion of various diseases including stroke [16, 24, 21]. How-
∗
ever, the performance of the original Cox model depends
This work was done while the author was at Robert Bosch heavily on the quality of the pre-selected features. To ad-
LLC, Research and Technology Center dress this problem, several approaches have been proposed
†
Corresponding author. recently [9, 28].
Permission to make digital or hard copies of all or part of this work for Thus far, there have been very few studies on compar-
personal or classroom use is granted without fee provided that copies are ing the Cox regression with machine learning methods in
not made or distributed for profit or commercial advantage and that copies making predictions on censored data. Kattan [18] compared
bear this notice and the full citation on the first page. To copy otherwise, to
Cox proportional hazards regression with several machine
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. learning methods (neural networks and tree-based meth-
KDD’10, July 25–28, 2010, Washington, DC, USA. ods) based on three urological datasets. However, Kattan’s
Copyright 2010 ACM 978-1-4503-0055-1/10/07 ...$10.00.
study focused on datasets with only five features, while ma- 2. RELATED WORK
chine learning algorithms are expected to effectively han- Cox proportional hazards model is widely adopted in clin-
dle many more features. In addition, the paper considered ical studies and used heavily in stroke prediction. We briefly
only some relatively simple machine learning algorithms and compare the Cox model to some of our other approaches.
high-performance machine learning algorithms such as SVM
and logistic regression were not explored. 2.1 Cox proportional hazards model
This paper presents an integrated machine learning ap- The Cox proportional hazards model is given by
proach for stroke risk prediction. We investigated machine
learning algorithms to improve the prediction accuracy and h(t|x) = h0 (t) exp(β T x), (1)
conducted extensive comparisons between our results and where h(t|x) is the hazard value at time t given the feature
those with the Cox proportional hazards model. Using the vector x ∈ Rd for an individual, h0 (t) is an arbitrary baseline
CHS dataset as a benchmark, we first duplicated the re- hazard function, β ∈ Rd are the parameters that we are
sults of Lumley et al. [24] as a baseline. We then compared trying to estimate for the model, and d is the number of
our machine learning approach with the baseline results and features for each individual.
an extended version of the Cox model with feature selec- This model is known as a semi-parametric model because
tion. According to our experiments, our approach consis- the baseline hazard function is treated non-parametrically.
tently outperformed the Cox model. Thus, we can see that the parameters have a multiplicative
Our approach considers the problems of data imputation, effect on the hazard value which makes it different from the
feature selection, and prediction in medical datasets. We linear regression models [20].
propose a novel automatic feature selection algorithm that The Cox model is part of the Generalized Linear Model
selects robust features based on our proposed heuristic: con- (GLM) family. Another member of this family is the logistic
servative mean. We combine this feature selection algorithm regression model, where the output takes the following form:
with the popular SVM algorithm.
Furthermore, we present a margin-based censored regres- h(x) = (1 + exp(−β T x))−1 (2)
sion algorithm that combines the concept of margin-based In this study, we investigated both the Cox model and
classifiers with censored regression to achieve a better con- logistic regression model for stroke prediction. In addition,
cordance index than the Cox model. In addition, our work we broadened our approach to other non-regression models
has also identified potential risk factors that have not been such as SVM, taking an agnostic view on what the best
discovered by traditional approaches. Last, we note that model is for stroke prediction. We found that while the Cox
this method can be applied to clinical prediction of other model performs reasonably well for stroke prediction, it is
diseases, where missing data are common and risk factors inferior to more general machine learning models, such as
are not well understood. SVM or margin-based censored regression (proposed in this
paper).
In summary, our main contributions are:
1. An extensive evaluation of the problems of data im-
2.2 Feature selection and L1 regularization
putation, feature selection and prediction in medical Finding the best estimate for β in equation (1) and (2)
data, with comparisons against the Cox proportional is typically computationally difficult, particularly given a
hazards model. large number of features. By introducing a complexity-based
penalty term, we can identify irrelevant features and remove
2. A novel feature selection algorithm, Conservative Mean them from our model. The L1 regularized sparse learning
feature selection, that outperforms both L1 regularized problem has the following general form:
Cox model and L1 regularized logistic regression on the
CHS dataset. min g(β) + λ||β||1 , (3)
β
3. A novel risk prediction algorithm, Margin-based Cen- where g(·) is a convex function, β is a vector of length d,
sored Regression, that outperforms the Cox model given and λ > 0 is a regularization parameter.
the same set of features. In this study, we evaluated both L1 regularized Cox model
and L1 regularized logistic regression. We found that L1
4. Discovery of new (previously unknown) potential risk regularized feature selection gives better performance over
factors of stroke. the baselines (i.e., selecting features manually) by reducing
5. An integrated machine learning approach that signif- the feature set to the most relevant ones.
icantly outperforms the current state-of-the-art algo-
rithm in stroke prediction. 3. OUR APPROACH
We present an integrated machine learning approach to
This paper is organized as follows. Section 2 describes Cox stroke prediction. Our approach takes the following steps:
proportional hazards regression and the L1 regularized Cox
models that we use as the baselines in this study. Section 1. Apply a systematic method for imputing the missing
3 describes the various machine learning-based methods we entries in the dataset.
compare against the Cox models, and Section 4 provides
2. Select the relevant feature subset based on an auto-
the experimental results of our approach. Finally, Section 5
matic procedure.
presents our conclusions.
3. Apply learning algorithms to evaluate the prediction
performance.
3.1 Performance Metrics where 1(.) is the indicator function as before, and |ε| denotes
For evaluating the performance of our methods, we used the number of edges in the ordered graph of t.3 We assume
the following metrics: area under the ROC curve and con- that a larger value of f corresponds to a higher risk of stroke.
cordance index. To define these precisely, we first outline In the following sections, we describe the details of data
the notation. imputation, feature selection, and prediction models.