Framework for integration of domain knowlede into logistic regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Framework for integration of domain knowledge into logistic

regression
Sandro Radovanović Boris Delibašić Miloš Jovanović
University of Belgrade - Faculty of University of Belgrade - Faculty of University of Belgrade - Faculty of
Organizational Sciences Organizational Sciences Organizational Sciences
Serbia Serbia Serbia
sandro.radovanovic@fon.bg.ac.rs boris.delibasic@fon.bg.ac.rs milos.jovanovic@fon.bg.ac.rs

Milan Vukićević Milija Suknović


University of Belgrade - Faculty of University of Belgrade - Faculty of
Organizational Sciences Organizational Sciences
Serbia Serbia
milan.vukicevic@fon.bg.ac.rs milija.suknovic@fon.bg.ac.rs

ABSTRACT KEYWORDS
Domain knowledge, Logistic regression, Stacking, Hospital
Traditionally, machine learning extracts knowledge solely based on readmission
data. However, huge volume of knowledge is available in other
sources which can be included into machine learning models. Still,
ACM Reference format:
domain knowledge is rarely used in machine learning. We propose
a framework that integrates domain knowledge in form of Sandro Radovanović, Boris Delibašić, Miloš Jovanović, Milan
hierarchies into machine learning models, namely logistic Vukićević, and Milija Suknović. 2018. SIG Proceedings Paper in
regression. Integration of the hierarchies is done by using stacking Word Format: Regular Research Paper. In WIMS ’18: 8th
(stacked generalization). We show that the proposed framework International Conference on Web Intelligence, Mining and
yields better results compared to standard logistic regression Semantics, June 25-27, 2018, Novi Sad, Serbia, Jennifer B. Sartor,
model. The framework is tested on the binary classification Theo D’Hondt, and Wolfgang De Meuter (Eds.). ACM, New York,
problem for predicting 30-days hospital readmission. Results
NY, USA, 8 pages. https://doi.org/10.1145/3227609.3227653
suggest that the proposed framework improves AUC (area under
the curve) compared to logistic regression models unaware of
domain knowledge by 9% on average.

CCS CONCEPTS 1 INTRODUCTION


• Machine learning → Learning paradigms → Supervised One of the fastest growing field in computer science is machine
learning → Supervised learning by classification • Machine learning. The goal of machine learning is to develop algorithms
learning algorithms → Ensemble methods which can artificially generate knowledge (induce models) from
experience (data). However, in some domains (such as
bioinformatics, medicine etc.) a vast amount of expert or domain
knowledge is readily available besides the experience stored in
data. This knowledge is also presented in machine readable format,
_____________________________________________________ such as ontologies or hierarchies. It seems reasonable to include
Permission to make digital or hard copies of all or part of this work domain knowledge into machine learning models in such manner
for personal or classroom use is granted without fee provided that that machine learning algorithms understand the context of the data
copies are not made or distributed for profit or commercial better or alleviate some assumptions of learning algorithms. This
advantage ant that copies bear this notice and the full citation on can assumably lead to better predictive performance, or to more
the first page. Copyrights for components of this work owned bz understandable and consistent models. Typical example is one of
others than ACM to post on servers or to redistribute to lists, the most commonly used machine learning algorithm called logistic
requires prior specific permission and/or a fee. Request permissions regression, which assumes a linear relationship between inputs
from permissions@acm.org. attributes. Consequently, logistic regression will have high bias
WIMS ’18, June 25-27, 2018, Novi Sad, Serbia error if a non-linear relationship between input attributes exists.
© 2018 Association for Computing Machinery. One can add attribute interactions in order to alleviate this problem,
ACM ISBN 978-1-4503-5489-9-18-06…$15.00 but the question is which interactions should be added. If domain
https://doi.org/10.1145/3227609.3227653 knowledge is already available adding attribute interactions which
WIMS’18, June 2018, Novi Sad, Serbia S. Radovanović et al.

utilize the hierarchical interactions between attributes, the logistic solution. However, search space can be expanded if domain
regression model would then be able to solve this problem. knowledge would help learning algorithm perform better. Third,
The main hypothesis of this paper is that integrating knowledge domain knowledge provides tradeoff between computational cost
into machine learning models can improve the results of machine and accuracy. Namely, domain knowledge should be used in an
learning models. In this paper we test this hypothesis partially by interactive learning process where expert should stop learning
integrating Clinical Classifications Software (CCS) hierarchies into when acceptable solution is achieved.
a logistic regression stacking (stacked generalization) framework. According to [35] domain knowledge and machine learning can
The experiments are performed on real world medical classification be combined such that 1) domain expert is used to validate
problem where the problem is to predict whether pediatric patients predictive models after machine learning was applied, 2) expert
will be readmitted to hospital 30 days after the discharge. This provides additional constraints to a learning models and 3) experts
problem is one of the most challenging problems in medical and algorithms are performing in turns, namely algorithm present a
applications of machine learning [30]. model, expert provide feedback to the model. To this date, first type
The problem on which our approach will be tested is hospital of domain knowledge and machine learning combination is mostly
readmission in 30-days prior to discharge. This is binary used (especially in sensitive applications such as medicine), while
classification problem which is well known in medical applications second and third are less used due to time needed for algorithm
of machine learning. It is challenging because of high development, model creation and model evaluation by domain
dimensionality (problems often have hundreds or thousands of expert. Most broadly incorporation of domain knowledge into
attributes), sparsity (data at hand are often sparse) and class machine learning algorithms can be categorized into four groups
imbalance (hospital readmission have high cost, but occurs on [38]. Namely, those are 1) preparation of training examples, 2)
average 7% for pediatric patients [2]). initialization of hypothesis space, 3) altering the search objective
and 4) augmentation of search procedure. All of the groups are
2 LITERATURE REVIEW aimed to improve predictive model generality and/or efficiency of
the learning process.
Using domain knowledge as a source of additional information
Preparation of training examples aims to expand training
for machine learning algorithms is not a new idea. The main reason
examples and/or enlarges the number of training examples using
why this is interesting is the fact that human generated expert
human heuristic which is mapped to algorithm for data generation.
knowledge can sometimes outperform learning algorithms. This is
Most often creation of these examples are called creation of virtual
especially true in areas such as medicine where there are multiple
examples. The idea is to utilize similarities between concepts which
complex patterns which needs to be interpreted. Sometimes, the
are labels. This process is very useful in areas where rare events
integrated domain knowledge can be indispensable and
occurs. Utilization of domain knowledge in example generation in
irreplaceable, especially when data at hand contain noisy data,
medical application, and more specifically hospital readmission
missing data, when working with rare events, or when domain
problem is presented in [32], [33]. It has been shown that creation
problem is hard to define and solve. Therefore, efforts are invested
of virtual examples using similarity between diagnoses in ICD-9-
in order to make machine learning algorithms more effective and
CM hierarchy outperform classic approaches such as
efficient [11]. One can alter goal function of learning algorithm in
undersampling and oversampling. Additionally, better
order to solve problem directly, i.e. instead of minimizing penalty
performances are obtained if there are very small amount of
function one can minimize cost where errors have different cost
examples and if there exists huge class imbalance problem (rare
associated with it. One way to include domain knowledge is to
events problems). The reason why virtual examples work is that
include constraints for learning model, i.e. to include
virtual examples are created in areas where density of minority
regularizations [16]. Also, domain knowledge is used to obtain new
class is high. In other words, virtual examples are created as an
data using virtual examples [32], [33]. However, domain
interpolation of existing examples. Domain knowledge is used as a
knowledge is mostly used in form of heuristics or exact rules for
guidance in input space. As a major drawback of this process is the
feature extraction or feature selection.
increase in the computational cost of classifier training and loss of
Introduction of domain knowledge to learning algorithm must
interpretation of model probability score if virtual examples are
be carefully implemented. Learning algorithms seeks for optimal
used for oversampling.
solution of search objective (goal) function and domain knowledge
Initialization of hypothesis space using domain knowledge is
must be in accordance with goal function or if it alters it function
most often used for reducing hypothesis space. One group of papers
must be optimizable, i.e. have global optima. In other words,
belonging to this category is focused on incorporation of domain
domain knowledge must have three aspects of convergence [38].
knowledge into the kernel methods. The concepts, which present
First, hypothesis space must have feasible solution. This means that
domain knowledge, are most often presented as a kernel quality
hypothesis space encompasses an acceptable approximation of goal
criterion [34] and as such are used for kernel selection. However,
function. If hypothesis space have zero acceptable solutions
this approach is less used. Another example is argument-based
learning algorithm would not yield a predictive model. Second,
machine learning (ABML) [9], [22] which present an extension of
inclusion of domain knowledge must be efficiently implemented.
machine learning with domain knowledge in form of arguments
This requirement suggest that domain knowledge should be used to
with idea of reduction of hypothesis space. The idea is based on the
reduce search space or help in faster convergence to optimal

2
Framework for integration of domain knowledge into logistic
WIMS’18, June 2018, Novi Sad, Serbia
regression

fact that domain experts focus on a specific problem when provide classes which are computed as minimal values of product of cost
domain knowledge which means that arguments are not general and probability score for class. Other approach, where weight is
(exceptions are taken using counter-arguments). Also, given to an example, is also often used. It is considered as a very
disagreements between experts can be imported into model and hard problem since domain expert has to give weight to each
reasoner will select which one of them is acceptable. The benefit of example. In the era of huge data sets this is time consuming task.
using arguments is in imposing a constraint over the search space However, one can give weight to a group of examples per specific
which will reduce overfitting problem and, additionally, proposed value of other feature or per class [39]. Example weighting can be
model will be in accordance with expert knowledge. Arguments performed based on decision rules presented in [15]. Having
can present positive feedback (example belongs to specific class) decision rules in mind domain expert can propose rules and give
or negative (example does not belong to specific class). As a weights to examples that satisfy proposed rules.
drawback of such approach one must use constrained optimization Most often domain knowledge is used to augment search space.
methods which are not available in standard off-the-shelf tools. In Augmentation of search space is expansion of existing dataset by
this category we can add creation of initial hypothesis, which can inclusion of logical interactions. One example is First Order
be interpreted as a help from domain knowledge in such manner to Inductive Learner (FOIL) [25] which is a system that learns Horn
provide better starting point and speed up the process of learning. clauses (instead of the classical attribute-value languages) from
It is most often used with neural networks where structure of the relations. Relations are constructed by adding one literal at a time.
network is imposed by domain expert. This research belongs to this In order to evaluate relations information gain is calculated. A
category where domain knowledge in the form of hierarchy is used broader field of algorithms which heavily depends on a domain
as a structure for logistic regression model. More details about knowledge is known as an Inductive Logic Programming [19].
other papers similar to this will be presented in following These algorithms generates logical theories using examples in data
paragraph. and domain knowledge. We refer to interested reader to other
The idea of enhancing logistic regression with medical domain examples of ILP such as PROGOL [23], ALEPH [29] and TILDE
knowledge in form of hierarchies is not new and has already given [3]. Similar approach called propositionalization [17] which
better results compared to plain logistic regression. One approach decouples feature construction phase from model construction
tried to extract and select attributes using heuristics [27], [26]. It phase is proposed. It is based on relational rule learning for
has been shown in these papers that extracted attributes, not only classification and prediction where relational rules augment feature
improves predictive performance, but also provide better stability space in process called constructive induction. Relational rules
because efficient attribute selection is performed on more general consumes domain knowledge and structural properties of
attributes (represent broader population). Therefore, patterns in examples. Arguably, as a downside one can say that ILP does not
data have higher support and confidence. Additionally, more warrant optimal solution since most of the approaches uses
general attributes are interpretable to medical expert. Also, it is heuristics or meta-heuristics such as hill-climbing. However, this
worth to notice that these approaches were better in predictive optimization technique is justified since number of combination
performance compared to traditional and modern attribute selection would lead to combinatorial explosion. We would like to mention
technique. The main reason was the usage of more general description logic learners which are used to construct descriptions
attributes which present general medical concepts. These attributes of sets of examples and then reason about these descriptions [1].
were extracted using unsupervised learning methods, namely Although medical and biomedical domain have vast amount of
logical or operator. Other approach was to utilize domain hierarchy knowledge which can be exploited in order to provide improve
in form of regularization [14], [13]. Besides better generalizability predictive models there are other applications and utilization of
of predictive model, these models lead to improved interpretability domain knowledge in machine learning models which can be
of logistic regression model. Additionally, these models can be classified to initialization of hypothesis space and altering search
used for gaining further insights of causes of readmission and as objective at the same time. Sometimes a classifier needs to classify
risk indicators. Additionally, similar approaches have already multiple outputs at the same time. This problem is called multi-
shown that utilization of domain knowledge in form of hierarchy label classification. Main idea of multi-label methods is to use
improve predictive performance of learning algorithm and model correlation between labels. However, often we can construct
interpretability [40], [4], [6]. hierarchy or hierarchy can be imposed from domain. When
Altering search objective is third group of applications where hierarchy is imposed it can be utilized to improve predictive
domain knowledge is of help to machine learning models. This type performance by hierarchical classification [28]. In other words, we
of applications modifies predictive model’s optimization problem can predict higher nodes of hierarchy and then specialize to lower
either by introducing weights of examples or by imposing costs of levels and finally leafs of the hierarchy. There are several
the errors. In latter, wrapper around learning algorithm is applied
which optimizes cost provided by domain expert. Typical example
is MetaCost algorithm [7]. It perform bagging-like procedure
utilizing cost and probability obtained from learning model.
Namely, output of the model will be weighted average of predicted

3
WIMS’18, June 2018, Novi Sad, Serbia S. Radovanović et al.

Bacterial infection

Tuberculosis ... Septicemia

Other specified Unspecified


ICD 01000 ... ...
septicemia septicemia

... ... ... ICD 0031 ICD 77181

ICD 1374 ... ... ... ...

ICD 0389 ICD 99592

Figure 1: Excerpt of CCS hierarchy.

algorithms that proved that hierarchical classification improves


predictive performance of a learning algorithms such as Predictive
Clustering Trees (PCT) [20] and HOMER [31]. Table 1: Dataset description
Compared to the existing approaches in literature, we induce the
Diagnoses # rows # %
hierarchical structure into the logistic regression model by using columns readmission
stacking, which is an approach that hasn’t been so far reported in Anemia 3,445 301 29.96%
literature. Esophageal reflux 4,286 301 19.58%
(ER)
3 METHODOLOGY Epilepsy 6,340 232 18.26%
Acute Respiratory 3,610 173 13.32%
The methodology section consists of three parts. First, we Infections (ARI)
explain the hospital readmission datasets used in this research. Asthma 13,907 152 9.54%
Next, we introduce domain hierarchy and logistic regression. Pneumonia 6,931 142 7.62%
Finally, we explain the methodology and experiments in the
experimental setup subsection. A dataset row present one hospital admission, while a column
presents an ICD-9-CM code. Since there are over 15,000 ICD-9-
3.1 Data CM codes (diagnosis), only those ICD-9-CM codes were selected
Data used in this research comes from hospital discharge where over 0.5% of admissions had that specific diagnosis.
records from California, State Inpatient Database (SID), Healthcare Therefore, we reduced the input space by large just by removing
Cost and Utilization Project (HCUP), Agency for Healthcare sparse columns. Values in matrix present whether diagnosis is
Research and Quality [24]. Purpose of this database is to track all present or not (binary values). Finally, it can be observed that
hospital admissions at the individual level, where one patient can hospital readmission is a fairly imbalanced problem having from
have a maximum of 25 diagnoses for one admission. Every 7.62% to 29.96% hospital readmissions.
diagnosis is presented as an ICD-9-CM code. For the purpose of
this paper, several datasets were extracted which contain pediatric 3.2 Domain hierarchy
patients with diagnoses that are considered as the most important
for pediatric subpopulation [2]. Datasets on which the experiments In order to impute domain knowledge into logistic regression
have been done are presented in Table 1. we employed domain hierarchy called Clinical classification
software (CCS) for ICD-9-CM [10] which presents clusters of
clinically meaningful categories. It is created so one can analyze

4
Framework for integration of domain knowledge into logistic
WIMS’18, June 2018, Novi Sad, Serbia
regression

costs, utilization, and outcomes associated with particular there will be small loss. Random noise 𝑐 is normal distributed with
diagnoses and medical procedures. CCS can be presented in multi- zero mean. Optimization is often performed using gradient descent
level form such that diagnoses are presented in four levels, i.e. methods. After optimization we can calculate probability of
hierarchies. However, a hierarchy is not strict, meaning that not hospital readmission using:
every path on a hierarchy is of depth four. Top level of the hierarchy
exp(θ0 + θ1 𝑥1 + ⋯ + θ𝑛 𝑥𝑛 )
presents diagnoses on a general level (i.e. Diseases of the 𝑝(𝑦̅) = (3)
1 + exp(θ0 + θ1 𝑥1 + ⋯ + θ𝑛 𝑥𝑛 )
circulatory system) which can be used mostly for descriptive
analysis. As one goes down the hierarchy diagnoses becomes more Formula 3 is obtained as a derivation by 𝑝 from Formula 1. The
specific and more usable for analysis. On bottom level specific probability of hospital readmission 𝑝(𝑦̅) is a value between zero
ICD-9-CM codes are presented. Hierarchical relationship between and one. This formula is present probabilistic interpretation of
levels is presented in “is-a” relationship. An example of a CCS Formula 2 and we can say that if i.e. θ0 + θ1 𝑥1 + ⋯ + θ𝑛 𝑥𝑛 = 4
diagnosis hierarchy is presented below: then predicted probability 𝑝(𝑦̅) would be ≈ 0.98 . Similarly, if
θ0 + θ1 𝑥1 + ⋯ + θ𝑛 𝑥𝑛 = −4 then 𝑝(𝑦̅) ≈ 0.02.
7. Diseases of the circulatory system However, in order to make prediction we introduce decision
7.1. Hypertension threshold τ. If probability is greater or equal to τ then hospital
7.1.1. Essential hypertension [ICD-9-CM 401.9] admission will be classified as readmission in 30-days, otherwise
7.1.2. Hypertension with complications and secondary not. In this paper value of τ is set to 0.5, which is default value for
hypertension [ICD-9-CM 405.9] decision threshold. It can be argued that this is not an optimal value
7.1.2.1. Hypertensive heart and/or renal disease of decision threshold however some performance metrics such as
7.1.2.2. Other hypertensive complications AUC are threshold independent and will be discussed accordingly.
7.2. Diseases of the heart Also, we do not seek to find best performing algorithm but
improvement using stacked generalization with domain knowledge
It should be noted that CCS hierarchy is a forest, which means hierarchy and selection of τ to 0.5 allows the same experimental
that multiple hierarchies exist. More specifically, the CCS contains setup. Additionally, same threshold allows same evaluation of the
17 hierarchies which present major diagnoses states. proposed approach and baseline method.
The CCS hierarchy contains mapping between every single
ICD-9-CM code to leaf nodes in hierarchy. There are over 14,000 3.4 The stacking logistic regression framework
ICD-9-CM codes which corresponds to diagnoses and over 3,900 In this research we propose stacking logistic regression
ICD-9-CM codes which corresponds to procedures. Since data we framework which integrates domain knowledge from CCS which
used have only diagnoses we utilized hierarchy with 14,000 ICD- groups clinically meaningful diagnoses in form of hierarchy.
9-CM codes. The depth of the hierarchy is variable ranging from Integration of domain knowledge into machine learning (in this
two to four, while depth 3 is most dominant. case logistic regression) algorithms will be performed using
stacking approach [36] where structure of stacking network will be
3.3 Logistic Regression CCS hierarchy.
Logistic regression is one of the most popular and most widely If we have hierarchy H which presents parent-child elements.
used algorithm in machine learning. We can present it [12] as: Namely, element h from hierarchy H contains of parent element p
𝑝 and child element c. We define set of higher medical concepts S as
log ( ) = θ0 + θ1 𝑥1 + ⋯ + θ𝑛 𝑥𝑛 (1) non-leaf elements in hierarchy H. Additionally, we can extract
1−𝑝
higher medical concept level in hierarchy and sort set in descending
𝑝
where log ( ) present logarithm of odds ratio, 𝜃 weight of order by level. Then, for each medical concept s we select its
1−𝑝
children C and train logistic regression model LR from dataset D
associated attribute 𝑥. By optimizing 𝜃 such that penalty function
using only columns C. After model training we apply model to the
is minimized we can obtain probability that a hospital readmission
same dataset D and create new column with name s. We save
will occur. Penalty function is:
logistic regression model LR to a list of models M. The procedure
𝑚
is repeated for every medical concept s in set of medical concepts
𝐿(θ) = ∑(1 + exp(−𝑦𝑖 (𝑥 𝑇 θ + 𝑐)) (2) S.
𝑖=1 Using Figure 1 we can describe intuition of our approach.
where 𝑥 represent input attributes, 𝑦 binary label which takes Namely, we create logistic regression model for Tuberculosis using
values from {−1, 1}, θ weight vector associated to 𝑥, 𝑚 number of attributes ICD 01000, …, ICD 1374. Similarly, ICD 0031, …, ICD
examples and 𝑐 random noise. This loss function present a 0389 are used for learning Other specified septicemia while ICD
maximum likelihood from Formula 1. It is convex and continuous 77181, …, ICD 99592 are used for learning Unspecified
where function tends to zero as 𝑦𝑖 (𝑥 𝑇 θ + 𝑐) tends to infinity. In septicemia. Learning is performed with regard to hospital
other words if predictions are in good direction then function will readmission within 30-days. Therefore, learning phase produce
be close to zero. We note that even if model is certain in prediction probability scores of hospital readmission within 30-days. Obtain

5
WIMS’18, June 2018, Novi Sad, Serbia S. Radovanović et al.

values present values in dataset for attribute which correspond to Stable of produced model is considered as a desirable property
that higher order (more genera) medical concept. Further, Other of machine learning models in sensitive application, such as
specified septicemia and Unspecified septicemia are used for medical. It has been shown that regularization techniques such as
learning more general concept Septicemia. This process is repeated lasso tends to be unstable if dataset is slightly perturbed in presence
until the most general concept is learned, in this case Bacterial of highly correlated attributes [21]. However, utilizing hierarchy
infection. Since CCS hierarchy have more hierarchies (recall that has shown, both theoretically and practically, more stable models
CCS is forest of medical concepts) this procedure is repeated for [40].
every hierarchy. After every hierarchy is learned most general
concepts are used for hospital readmission prediction. 3.5 Experimental Setup
The idea is to group attributes into a hierarchy and gradually As a baseline classifier logistic regression will be used. The
predicting the outcome attribute, instead of predicting the outcome diagnoses (bottom level of hierarchy) are used as inputs.
variable by having all attributes in the same level of hierarchy. This Experiments are performed using 10-fold cross validation. This
idea makes sense [18], [5] as many authors have noticed that the means that each dataset is split into 10 subsets where 9 are used for
assumption of linearity among attributes does not hold for all model training and one for testing the predictive performance. The
relationship between attributes. process is repeated 10 times such that each of the subsets are used
This way of integration of domain knowledge into logistic exactly once for testing. Values of predictive performance are
regression can be seen as deep feature extraction model, which presented as the average value with standard deviation of
resembles neural networks, where domain knowledge provide the performance on the ten test sets. However, one must be careful
structure of the networks. The benefits of using domain knowledge interpreting standard deviation since samples are not independent.
are 1) alleviating linear decision boundary problem of logistic Besides classification accuracy, as a predictive performance
regression, 2) interpretability of the model and 3) more stable measure, we also report the area under the curve (AUC), precision
solutions [40]. Also, one can state that this is mixture of experts and recall. AUC can be interpreted as a probability that a random
[37] model because each feature is constructed as an independent positive example (hospital readmission within 30-days) have
expert used for hospital readmission prediction. higher probability score than random negative example (patient did
Logistic regression is known for linear decision boundary. not readmit to hospital within 30-days). Precision, which present
Because of that sometimes it has problem when there are multiple percentage of predicted hospital readmission which are truly
groups of examples on different sides of attribute space. In the hospital readmission, and recall, which is percentage of true
literature one can find several proposed solutions for this problem. hospital readmission that has been predicted as a hospital
First, most simple one, is to use higher order polynomials and readmission. For all performance measure mentioned above higher
attribute interactions. Defining higher order polynomial is prone to value is better. The decision threshold for the logistic regression
overfitting and therefore attribute interaction is more useful. model was set to 0.5 in order for each predictive model to be
Defining interactions can be automatic from data or by utilization comparable.
of hierarchy by domain expert. This approach already gave We note that datasets used in experimental setup differs in
performance improvement in hospital readmission prediction [26]. number of examples and columns. Additionally, in each dataset
Second approach involve utilization of kernel methods [41]. Using there is class imbalance which range from 7.62% of readmitted
kernels yield in nonlinear decision boundary in original input space patients to 29.96% of readmitted patients. Description of the dataset
by constructing a linear decision boundary in a transformed version is presented in Table 1. Hierarchy used for stacked learning is CCS
of the original input space. Transformed space can, theoretically, for ICD-9-CM, as explained in Section 3.2.
be infinite. However, this approach is seldom used. Finally, output
of one logistic regression can be used as an input of another logistic
4 RESULTS
regression (similar to neural network). We utilized latter approach
since we believe that this approach is highly compatible with The experimental results are presented in Table 2. As stated in
domain knowledge which is available and heavily underused. the experimental setup section, AUC, Classification accuracy (CA),
Since hierarchy is domain created concepts which are created Precision, and Recall are reported. Logistic regression, used as a
are from domain and can be interpreted by domain expert. Logistic baseline method is denoted with Basic, while the newly proposed
regression is considered as a predictive algorithm which is highly method is denoted Stacked Domain Hierarchy (SDH). We note that
interpretable. Namely, coefficients of logistic regression can be we could select decision threshold such that CA, Precision and
interpreted in terms of increase or decrease of odds of event (in this Recall would be optimized. However, this would lead to
paper hospital readmission). In first step of proposed approach we performances which would not be comparable.
obtain coefficients of specific diagnoses. Result of that step is
probability of occurrence of hospital readmission for higher order
concept. The process is repeated and we obtain odds of hospital
readmission of higher concept which is again interpretable as odds
of hospital readmission.

6
Framework for integration of domain knowledge into logistic
WIMS’18, June 2018, Novi Sad, Serbia
regression

Table 2: Experimental evaluation of proposed model for Having performances improvement in mind we can conclude
hospital readmission prediction that introducing more general concepts and making predictions
based on them does improve predictive performance of logistic
Dataset Method AUC CA Prec. Rec regression. As we can observe by combining Table 1 and Table 2
Anemia Basic 0.696 ± 0.720 ± 0.547 ± 0.375 ± size of the dataset is not correlated with predictive performance.
0.030 0.027 0.069 0.044
Therefore, the main reason for performance improvement is
Anemia SDH 0.722 ± 0.747 ± 0.638 ± 0.365 ±
0.037 0.023 0.061 0.038 introduction of non-linear decision boundary. This leads us that
ER Basic 0.599 ± 0.790 ± 0.390 ± 0.131 ± interaction of diagnoses on higher concepts have influence which
0.035 0.012 0.060 0.021 cannot be seen just by inspecting specific diagnoses. Additionally,
ER SDH 0.618 ± 0.797 ± 0.435 ± 0.112 ± since standard deviation are smaller in SDH models we can say that
0.034 0.014 0.103 0.029 stacking logistic regression framework improve model stability as
Epilepsy Basic 0.601 ± 0.813 ± 0.426 ± 0.070 ± it is shown in [40].
0.019 0.005 0.124 0.021
Epilepsy SDH 0.602 ± 0.818 ± 0.506 ± 0.063 ±
5 CONCLUSION
0.018 0.014 0.101 0.019
ARI Basic 0.569 ± 0.357 ± 0.238 ± 0.766 ± In this paper we proposed the integration of domain knowledge
0.111 0.355 0.179 0.377 integration in logistic regression models on real world medical data.
ARI SDH 0.749 ± 0.861 ± 0.406 ± 0.109 ± The reported results show that the proposed framework consistently
0.019 0.020 0.100 0.049 improves the predictive performance of the logistic regression
Asthma Basic 0.500 ± 0.095 ± 0.095 ± 1.000 ± models. The proposed framework utilizes domain knowledge in
0.000 0.005 0.005 0.000 form of hierarchies, i.e. forests, to perform stacking. Since
Asthma SDH 0.717 ± 0.901 ± 0.311 ± 0.034 ±
hierarchy is used we can say that our framework is a type of mixture
0.029 0.005 0.144 0.019
of experts model. Contributions of this papers are methodological,
Pneumonia Basic 0.649 ± 0.920 ± 0.271 ± 0.029 ±
0.052 0.006 0.159 0.020 namely framework that allows integration domain or expert
Pneumonia SDH 0.685 ± 0.922 ± 0.399 ± 0.033 ± knowledge into logistic regression. Integration is performed in such
0.046 0.006 0.310 0.025 a manner that hierarchies extend the classical logistic regression in
a stacking framework. One can say that this methodology can be
It can be noticed that the SDH model is outperforming the basic seen as a neural network where domain knowledge provide the
model on AUC and classification accuracy on every dataset. More structure of the network. It is supposed to perform better compared
specifically, on average the SDH models have 9.58% better AUC to traditional approach because of the ensemble nature of
compared to the basic model. Other performance measures, except framework. Additionally, proposed framework allows non-linear
Recall, are also better using proposed SDH model having on problems to be solved and more stable solutions (less prone to
average 14.37% better classification accuracy, 12.13% precision, overfitting).
while recall is 31.93% worse. It is worth mentioning that standard However, further experimentation, both on simulated and on
deviation of AUC performances and other performance measures real world datasets, are needed in order to validate the quality of
(although it cannot be directly interpreted as standard deviation) are integration of domain knowledge in form of hierarchies (forests)
smaller. into proposed stacking framework. Possible option of validation is
Further inspection of Table 2 lead to the dataset for which a combination of randomly chosen attributes into a proposed
improvement is not significant. This dataset is Epilepsy. With framework in order to test whether the results are being improved
inspection of the papers regarding hospital readmission among because of the domain knowledge, or because of the stacking
epilepsy patient in pediatric hospitals it has been shown that patient framework. Also, we can test hierarchical clustering in order to
characteristics such as prior health care utilization, demographics, obtain hierarchy (or forest). Namely, we can extract hierarchies
insurance status, social history, patient engagement (i.e. missed from data using i.e. AGNES or DIANA algorithms and use those
appointments) and use of specialty care are most important features hierarchies instead of domain knowledge based hierarchies. If SDH
for predictions [8]. All of those features are not available in framework outranks the other approaches we could be absolutely
diagnoses codes in ICD-9-CM hierarchy. However, we can see sure that expert knowledge makes the difference, and not the
dramatic improvement for Asthma dataset. Inspection of methodology itself. Further experimentation are left for future
classification accuracy and other metric suggest that decision studies.
threshold was set too high to classify example as hospital
readmission. However, basic logistic regression model achieved REFERENCES
AUC 0.500 which means that predictive model was good as [1] Franz Baader, Diego Clavanese, Deborah McGuinness, Daniele Nardi, and
Peter F. Patel-Schneider (Eds.). 2003. The description logic handbook: Theory,
random model. Significant improvement in terms of AUC are also implementation and applications. Cambridge university press.
presented in Acute Respiratory Infections dataset. [2] Jay G. Berry, Sara L. Toomey, Alan M. Zaslavsky, Ashish K. Jha, Mari M.
Nakamura, David J. Klein, Jeremy Y. Feng, Shanna Shulman, Vincent W.
Chiang, William Kaplan, Matt Hall, and Mark A. Schuster. 2013. Pediatric

7
WIMS’18, June 2018, Novi Sad, Serbia S. Radovanović et al.

readmission prevalence and variability across hospitals. JAMA 309, 4 (2013), [28] Luiz M. Romao, and Julio C. Nievola. 2012. Hierarchical classification of gene
372-380. ontology with learning classifier systems. In Ibero-American Conference on
[3] Hendrick Blockeel, Sašo Džeroski, and Jasna Grbović. 1999. Simultaneous Artificial Intelligence, Springer, Berlin, Heidelberg (2012). 120-129.
prediction of multiple chemical parameters of river water quality with TILDE. [29] Ashwin Srinivasan. 2001. The aleph manual. Available at
In European Conference on Principles of Data Mining and Knowledge http://www.cs.ox.ac.uk/activities/machinelearning/Aleph/aleph.html.
Discovery, Springer, Berlin, Heidelberg, 32-40. [30] Gregor Stiglic, Petra P. Brzan, Nino Fijacko, Fei Wang, Boris Delibasic,
[4] Marko Bohanec, and Boris Delibašić. 2015. Data-mining and expert models for Alexandros Kalousis, and Zoran Obradovic. 2015. Comprehensible predictive
predicting injury risk in ski resorts. In International conference on decision modeling using regularized logistic regression and comorbidity based features.
support system technology, Springer, Cham, 46-60. PloS one 10, 12, e0144439.
[5] Marko Bohanec, and Vladislav Rajkovič. 1990. DEX: An expert system shell [31] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. 2008. Effective
for decision support. Sistemica 1, 1 (1990), 145-157. and efficient multilabel classification in domains with large number of labels.
[6] Boris Delibašić, Sandro Radovanović, Miloš Jovanović, Marko Bohanec, and In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data
Milija Suknović. 2018. Integrating knowledge from DEX hierarchies into a (MMD’08) 21 (2008). 53-59.
logistic regression stacking model for predicting ski injuries. Journal of [32] Milan Vukicevic, Sandro Radovanovic, Ana Kovacevic, Gregor Stiglic, and
Decision Systems, (2018), 1-8. Zoran Obradovic. 2015. Improving hospital readmission prediction using
[7] Pedro Domingos. 1999. Metacost: A general method for making classifiers cost- domain knowledge based virtual examples. In International Conference on
sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge Management in Organizations, Springer, Cham (2015). 695-706.
Knowledge discovery and data mining (1999), ACM, 155-164. [33] Mila Vukicevic, Sandro Radovanovic, Gregor Stiglic, Boris Delibasic, Sven
[8] Zachary M. Grinspan, Anup D. Patel, Baria Hafeez, Erika L. Abramson, and Van Poucke, and Zoran Obradovic. 2016. A Data and Knowledge Driven
Lisa M. Kern. 2018. Predicting frequent emergency department use among Randomization Technique for Privacy-Preserving Data Enrichment in Hospital
children with epilepsy: A retrospective cohort study using electronic health data Readmission Prediction. In 5th Workshop on Data Mining for Medicine and
from 2 centers. Epilepsia 59, 1 (2018), 155-169. Healthcare (2016). 10.
[9] Vida Groznik, Matej Guid, Aleksander Sadikov, Martin Možina, Dejan [34] Lei Wang, Yan Gao, Kap L. Chan, Ping Xue, and Wei-Yun Yau. 2005. Retrieval
Georgiev, Veronika Kragelj, Simo Ribarič, Zvezdan Pirtošek, and Ivan Bratko. with knowledge-driven kernel design: an approach to improving SVM-based
2013. Elicitation of neurological knowledge with argument-based machine CBIR with relevance feedback. In Tenth IEEE International Conference on
learning. Artificial intelligence in medicine 57, 2 (2013), 133-144. Computer Vision, 2005. ICCV 2005 2, IEEE (2005). 1355-1362.
[10] Healthcare Cost and Utilization Project. 2011. Clinical classifications software [35] Geoffrey I. Webb, Jason Wells, and Zijian Zheng. 1999. An experimental
(CCS) for ICD-9-CM. Available at: www.hcup- evaluation of integrating machine learning with knowledge acquisition.
us.ahrq.gov/toolssoftware/ccs/ccs. jsp. Accessed January, 23, 2018. Machine Learning 35, 1, 5-23.
[11] Andreas Holzinger. 2016. Interactive machine learning for health informatics: [36] Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2016. Data
when do we need the human-in-the-loop?. Brain Informatics 3, 2 (2016), 119- Mining: Practical machine learning tools and techniques. Morgan Kaufmann.
131. [37] Lei Xu, Michael I. Jordan, and Geoffrey E. Hinton. 1995. An alternative model
[12] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An for mixtures of experts. In Advances In Neural Information Processing Systems
introduction to statistical learning. Vol. 112. New York: Springer. (1995). 633-640.
[13] Miloš Jovanovic, Sandro Radovanovic, Milan Vukicevic, Sven Van Poucke, [38] Ting Yu, Tony Jan, Simeon Simoff, and John Debenham. 2007. Incorporating
and Boris Delibasic. 2016. Building interpretable predictive models for pediatric prior domain knowledge into inductive machine learning. Unpublished doctoral
hospital readmission using Tree-Lasso logistic regression. Artificial intelligence dissertation Computer Sciences.
in medicine 72 (2016), 12-21. [39] Bianca Zadrozny, John Langford, and Naoki Abe. 2003. Cost-sensitive learning
[14] Iman Kamkar, Sunil K. Gupta, Dinh Phung, and Svetha Venkatesh. 2015. Stable by cost-proportionate example weighting. In Third IEEE International
feature selection for clinical prediction: Exploiting ICD tree structure using Conference on Data Mining, 2003. ICDM (2003), IEEE. 435-442.
Tree-Lasso. Journal of Biomedical Informatics 53 (2015), 277-290. [40] Jiayu Zhou, Zhaosong Lu, Jimeng Sun, Lei Yuan, Fei Wang, and Jieping Ye.
[15] Branko Kavsek, and Nada Lavrac. 2004. Analysis of example weighting in 2013. Feafiner: biomarker identification from medical data through feature
subgroup discovery by comparison of three algorithms on a real-life data set. In generalization and selection. In Proceedings of the 19th ACM SIGKDD
Proceedings of the 15th European conference on machine learning and 8th international conference on Knowledge discovery and data mining, ACM
European conference on principles and practice of knowledge discovery in (2013). 1034-1042.
databases (2004). 64-76. [41] Ji Zhu, and Trevor Hastie. 2005. Kernel logistic regression and the import vector
[16] Seyoung Kim, and Eric P. Xing. 2010. Tree-guided group lasso for multi-task machine. Journal of Computational and Graphical Statistics 14, 1, 185-205.
regression with structured sparsity. In Proceedings of the 27th International
Conference on Machine Learning, (2010).
[17] Stefan Kramer, Nada Lavrač, and Peter Flach. 2001. Propositionalization
approaches to relational data mining. In Relational data mining, Springer,
Berlin, Heidelberg, 262-291.
[18] Niels Landwehr, Mark Hall, and Eibe Frank. 2005. Logistic model trees.
Machine learning 59, 1-2 (2005), 161-205.
[19] Nada Lavrac, and Sašo Dzeroski. 1994. Inductive Logic Programming. In WLP
(1994), 146-160.
[20] Gjorgji Madjarov, Dragi Kocev, Dejan Gjorgjevikj, and Sašo Džeroski. 2012.
An extensive experimental comparison of methods for multi-label learning.
Pattern recognition 45, 9 (2012), 3084-3104.
[21] Nicolai Meinshausen, and Peter Bühlmann. 2010. Stability selection. Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (2010),
417-473.
[22] Sanjay Modgil, Francesca Toni, Floris Bex, Ivan Bratko, Carlos I. Chesnevar,
Wolfgang Dvořák, Marcelo A. Falappa et al. 2013. The added value of
argumentation. In Agreement Technologies, Springer, Dordrecht, 357-403.
[23] Stephen Muggleton. 1995. Inverse entailment and Progol. New generation
computing 13, 3-4 (1995), 245-286.
[24] NIS, HCUP Nationwide Inpatient Sample. 2011. Healthcare cost and utilization
project (HCUP).
[25] John Ross Quinlan. 1990. Learning logical definitions from relations. Machine
learning 5, 3 (1990), 239-266.
[26] Sandro Radovanovic, Milan Vukicevic, Ana Kovacevic, Gregor Stiglic, and
Zoran Obradovic. 2015. Domain knowledge based hierarchical feature selection
for 30-day hospital readmission prediction. In Conference on Artificial
Intelligence in Medicine in Europe, Springer, Cham. 96-100.
[27] Petar Ristoski, and Heiko Paulheim. 2014. Feature selection in hierarchical
feature spaces. In International Conference on Discovery Science, Springer,
Cham. 288-300.

You might also like