Professional Documents
Culture Documents
Exploring The Risk Factors of Preterm Birth Using Data Mining
Exploring The Risk Factors of Preterm Birth Using Data Mining
Exploring The Risk Factors of Preterm Birth Using Data Mining
a r t i c l e
i n f o
Keywords:
Preterm birth
Data mining
Neural network
Decision tree
a b s t r a c t
Preterm birth is the leading cause of perinatal morbidity and mortality, but a precise mechanism is
still unknown. Hence, the goal of this study is to explore the risk factors of preterm using data mining with neural network and decision tree C5.0. The original medical data were collected from a prospective pregnancy cohort by a professional research group in National Taiwan University. Using the
nest case-control study design, a total of 910 motherchild dyads were recruited from 14,551 in the
original data. Thousands of variables are examined in this data including basic characteristics, medical history, environment, and occupation factors of parents, and variables related to infants. The
results indicate that multiple birth, hemorrhage during pregnancy, age, disease, previous preterm history, body weight before pregnancy and height of pregnant women, and paternal life style risk factors related to drinking and smoking are the important risk factors of preterm birth. Hence, the
ndings of our study will be useful for parents, medical staff, and public health workers in attempting to detect high risk pregnant women and provide intervention early to reduce and prevent preterm birth.
2010 Elsevier Ltd. All rights reserved.
1. Introduction
2. Literature review
Corresponding author.
E-mail addresses: i14248@mail.hku.edu.tw (H.Y. Chen), chchuang@mail.cjcu.
edu.tw (C.H. Chuang).
1
These authors contributed equally to the work.
0957-4174/$ - see front matter 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.10.017
5385
set at each of its tree nodes, seeking the attribute that best separates the instances. ID3 later evolved into C4.5 (Quinlan, 1993),
and this was an important with regard to the splitting rule and
the calculation method. C5.0 is a commercial version of C4.5, and
is available as a closed-source product, such as Clementine and
RuleQuest (Han & Kamber, 2007). C5.0 improves the rule generation of C4.5, and can obtain similar results with considerably smaller decision trees (Quinlan, 1997). Other decision tree methods
include CART (Breiman, Friedman, Olshen, & Stone, 1984) and
CHAID (Loh & Shih, 1997), which provide a set of rules that can
be applied to a new (unclassied) dataset to predict which records
will have a given outcome. CART segments a dataset by creating
two-way splits, while CHAID creates multi-way splits. CART typically requires less data preparation than CHAID. QUEST, another
type of decision tree, is similar to the CART algorithm, but is designed to reduce the processing time required for large CART analyses (Agrawal, Mehta, Shafer, & Srikant, 1996).
A decision tree is exploratory in nature, identifying clusters or
segments of interest. We thus try use a decision tree to identify
the 15 most important impact factors for preterm birth. We used
the C5.0 algorithm, which obtained considerably more results than
the other decision tree methods.
3. Empirical study
Table 1
Relative importance of inputs found by neural network.
Strategy 1
Strategy 2
Fig. 1. The process of data mining to explore risk factors of preterm birth.
Number
Factor
Coefcient
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Number of birth
Paternal smoking
Hemorrhage during pregnancy
Parity
Maternal age
Paternal occupation
Maternal hypertension
Medicines taken during pregnancy
Maternal gynecological diseases
Maternal body height
Maternal body weight before pregnancy
Paternal age
Paternal drinking
Previous preterm birth
Vitamins taken during pregnancy
0.3597
0.1086
0.1077
0.0899
0.0659
0.0636
0.0620
0.0600
0.0599
0.0567
0.0500
0.0499
0.0492
0.0415
0.0350
5386
Table 2
Decision tree on predictors of preterm birth.
Number
1
2
3
4
5
6
7
8
9
10
Rule
Accuracy
(%)
Multiple birth
Single birth, and hemorrhage during pregnancy
Single birth, no hemorrhage during pregnancy, previous preterm birth, and paternal drinking
Single birth, no hemorrhage during pregnancy, previous preterm birth, no paternal drinking, no maternal gynecological diseases, paternal
smoking, maternal body weight before pregnancy is 4650 kg, and medicines used during pregnancy
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age < 35, paternal smoking, maternal age >= 20,
maternal body height > 160 cm, and paternal age <= 19
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age < 35, paternal smoking, maternal age >= 20,
maternal body height > 160 cm, paternal age > 19, and maternal gynecological diseases
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age < 35, paternal smoking, maternal age >= 20,
maternal body height is 156160 cm, medicines used by pregnant women, and paternal occupation is neither manual nor non-manual
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age >= 35, paternal smoking
Single birth, no hemorrhage during pregnancy, no previous preterm birth, rst parity, paternal age < 35, paternal smoking, maternal age >= 20,
maternal body height is 156160 cm, medicines used by pregnant women, and paternal occupation is non-manual
Single birth, no hemorrhage during pregnancy, previous preterm birth, paternal drinking, maternal gynecological diseases, paternal smoking,
and maternal body weight before pregnancy over 55 kg
100.0
100.0
100.0
100.0
questionnaire. Many factors were measured in the questionnaire, including the pregnant womens and their husbands
characteristics (age, education, occupation, body height and
weight), life style, family income, past and present medical
information, and living environment. After the participating
women giving birth, the birth weight, gestational duration
and characteristics of the live born infants were gathered from
the Taiwan National Birth Register. A total of 14,551 pairs were
recruited in the database (Chuang et al., 2006).
3. Data organization: Incomplete data were deleted, and in the
current data, based on the nest case control study method,
910 motherinfant pairs were selected from 14,551 in the original set.
4. Analysis: A neural network was used to mine the 15 most
important factors related to preterm. A decision tree C5.0 was
than used to classify the risk factors, so high risk groups for preterm birth would be detected.
3.2. Data preparation
There was a total of 455 preterm birth in this medical data set,
and thus a signicant difference between preterm (455) and full
term infants (14,096) in the original data. Thus, according to the
1:1 principle for case (preterm birth) and control (full term birth),
we randomly sampled 455 full term babies from the original medical data. Thus, a total of 910 pairs (mothers and their infants) were
nally utilized in our analysis.
4. Research results
The relative importance of inputs, as derived by the neural network, is shown in Table 1. Because a lot of variables were measured
in the current medical data, we decided to explore it in two stages.
First, a neural network was used to investigate the risk factors of
preterm birth, and the 15 top important factors, with coefcients
larger than 0.0300, were than used in the next stage of our study.
These factors were as follows: number of birth, paternal smoking,
hemorrhage during pregnancy, parity, maternal age, paternal occupation, maternal hypertension, medicines taken during pregnancy,
maternal gynecological diseases, maternal body height, maternal
body weight before pregnancy, paternal age, paternal drinking,
previous preterm birth, and vitamins taken during pregnancy.
We then used these results 15 factors for the decision tree analysis. Through the construction of a decision tree, 17 rules were explored to predict preterm birth. Ten of these rules, with an
accuracy of 80% or more, are listed in Table 2. A multiple birth
100.0
100.0
100.0
92.4
83.3
80.0
5387
Liu, D., Yuan, Y., & Liao, S. (2009). Articial neural networks for optimization of goldbearing slime smelting. Expert Systems with Applications, 36(9), 1167111674.
Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classication trees.
Statistica Sinica, 7(4), 815840.
MacDorman, M. F., Martin, J. A., Mathews, T. J., Hoyert, D. L., & Ventura, S. J. (2005).
Explaining the 200102 infant mortality increase: Data from the linked birth/
infant death data set. National Vital Statistics Reports, 53(12), 122.
McCormick, M. C. (1985). The contribution of low birth weight to infant mortality
and childhood morbidity. New England Journal of Medicine, 312(2), 8290.
Moore, M. L. (2003). Preterm labor and birth: what have we learned in the past two
decades? Journal of Obstetric Gynecologic & Neonatal Nursing, 32(5), 638649.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Fransisco: Morgan
Kaufman.
Quinlan, J. R. (1997). C5. 0 and See 5: Illustrative examples. RuleQuest Research.
htttp://www.rulequest.com.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81106.
Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., & Palaniappan, B. (2009).
Automatic classication of Tamil documents using vector space model and
articial neural network. Expert Systems with Applications, 36(8), 1091410918.
Romero, R., Espinoza, J., Kusanovic, J. P., Gotsch, F., Hassan, S., Erez, O., et al. (2006).
The preterm parturition syndrome. British Journal of Obstetrics and Gynaecology,
113(Suppl 3), 1742.
Slattery, M. M., & Morrison, J. J. (2002). Preterm delivery. Lancet, 360(9344),
14891497.
SPSS (2005). Introduction to clementine. USA: SPSS Inc.
Wu, L. C., Lee, J. X., Huang, H. D., Liu, B. J., & Horng, J. T. (2009). An expert system to
predict protein thermostability using decision tree. Expert Systems with
Applications, 36(5), 90079014.