SurvLIME A Method For Explaining Machine Learning Survival Models

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Knowledge-Based Systems 203 (2020) 106164

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

SurvLIME: A method for explaining machine learning survival models



Maxim S. Kovalev, Lev V. Utkin , Ernest M. Kasimov
Peter the Great St.Petersburg Polytechnic University (SPbPU), St.Petersburg, Russia

article info a b s t r a c t

Article history: A new method called SurvLIME for explaining machine learning survival models is proposed. It can
Received 17 March 2020 be viewed as an extension or modification of the well-known method LIME. The main idea behind
Received in revised form 28 May 2020 the proposed method is to apply the Cox proportional hazards model to approximate the survival
Accepted 17 June 2020
model at the local area around a test example. The Cox model is used because it considers a linear
Available online 22 June 2020
combination of the example covariates such that coefficients of the covariates can be regarded as
Keywords: quantitative impacts on the prediction. Another idea is to approximate cumulative hazard functions of
Interpretable model the explained model and the Cox model by using a set of perturbed points in a local area around the
Explainable AI point of interest. The method is reduced to solving an unconstrained convex optimization problem. A
Survival analysis lot of numerical experiments demonstrate the SurvLIME efficiency.
Censored data © 2020 Elsevier B.V. All rights reserved.
Convex optimization
The Cox model

1. Introduction is the Local Interpretable Model-agnostic Explanations (LIME) [5],


which uses simple and easily understandable linear models to
Many complex problems in various applications are solved by locally approximate the predictions of black-box models. The
means of deep machine learning models, in particular deep neural main intuition of the LIME is that the explanation may be derived
networks, at the present time. One of the demonstrative examples locally from a set of synthetic examples generated randomly in
is the disease diagnosis by the models on the basis of medical the neighborhood of the example to be explained such that every
images or another medical information. At the same time, deep synthetic example has a weight according to its proximity to
learning models often work as black-box models such that details the explained example. Moreover, the method uses simple and
of their functioning are often completely unknown. It is difficult easily understandable models like decision rules or linear models
to explain in this case how a certain result or decision is achieved. to locally or globally approximate the predictions of black-box
As a result, the machine learning models meet some difficulties in models. The method is agnostic to the black-box model. This
their incorporating into many important applications, for exam- means that any details of the black-box model are unknown. Only
ple, into medicine, where doctors have to have an explanation of its input and the corresponding output are used for training the
a stated diagnosis in order to choose a corresponding treatment. explanation model. It is important to mention a work [6], which
The lack of the explanation elements in many machine learning provides a thorough theoretical analysis of the LIME.
models has motivated development of many methods which The LIME as well as other methods have been successfully ap-
could interpret or explain the deep machine learning algorithm plied to many machine learning models for explanation. However,
predictions and understand the decision-making process or the to the best of our knowledge, there is a large class of models
key factors involved in the decision [1–4]. for which there are no explanation methods. These models are
The methods explaining the black-box machine learning mod- applied to problems which take into account survival aspects
els can be divided into two main groups: local methods which of applications. Survival analysis as a basis for these models
derive explanation locally around a test example; global methods can be regarded as a fundamental tool which is used in many
which try to explain the overall behavior of the model. A key applied areas especially in medicine. The survival models can be
component of explanations for models is the contribution of indi- divided into three parts: parametric, nonparametric and semi-
vidual input features. It is assumed that a prediction is explained parametric [7,8]. Most machine learning models are based on
when every feature is assigned by some number quantified its im- nonparametric and semiparametric survival models. One of the
pact on the prediction. One of the first local explanation methods popular regression model for the analysis of survival data is the
well-known Cox proportional hazards model, which is a semi-
∗ Corresponding author. parametric model that calculates effects of observed covariates
E-mail addresses: maxkovalev03@gmail.com (M.S. Kovalev), on the risk of an event occurring, for example, the death or
lev.utkin@gmail.com (L.V. Utkin), kasimov.ernest@gmail.com (E.M. Kasimov). failure [9]. The proportional hazards assumption in the Cox model

https://doi.org/10.1016/j.knosys.2020.106164
0950-7051/© 2020 Elsevier B.V. All rights reserved.
2 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

means that different examples (patients) have hazard functions An increasingly important family of methods are based on
that are proportional, i.e., the ratio of the hazard functions for counterfactual explanations [21], which try to explain what to
two examples with different prognostic factors or covariates is a do in order to achieve a desired outcome by means of finding
constant and does not vary with time. The model assumes that a changes to some features of an explainable input example such
patient’s log-risk of failure is a linear combination of the example that the resulting data point called counterfactual has a different
covariates. This is a very important peculiarity of the Cox model, specified prediction than the original input. Due to intuitive and
which will be used below in explanation models. human-friendly explanations provided by this family of methods,
A lot of machine learning implementations of survival analysis it is extended very quickly [22–25]. Counterfactual modifications
models have been developed [8] such that most implementations of LIME have been also proposed by Ramon et al. [26] and White
(random survival forests, deep neural networks) can be regarded and Garcez [27].
as black-box models. Therefore, the problem of the survival anal- Many aforementioned explanation methods starting from
ysis result explanation is topical. However, in contrast to other LIME [5] are based on perturbation techniques [28–31]. These
machine learning models, one of the main difficulties of the sur- methods assume that contribution of a feature can be determined
vival model explanation is that the result of most survival models by measuring how prediction score changes when the feature is
is a time-dependent function (the survival function, the hazard altered [32]. One of the advantages of perturbation techniques is
function, the cumulative hazard function, etc.). This implies that that they can be applied to a black-box model without any need
many available explanation methods like LIME cannot be applied to access the internal structure of the model. A possible disadvan-
to survival models. In order to cope with this difficulty, a new tage of perturbation technique is its computational complexity
method called SurvLIME (Survival LIME) for explaining machine when perturbed input examples are of the high dimensionality.
learning survival models is proposed, which can be viewed as Descriptions of many explanation methods and various ap-
an extension or modification of LIME. The main idea of the pro- proaches, their critical reviews can be found in survey papers
posed method is to apply the Cox proportional hazards model [2,33–36].
to approximate the survival model at a local area around a test It should be noted that most explanation methods deal with
example. It has been mentioned that the Cox model considers the point-valued predictions produced by black-box models. We
a linear combination of the example covariates. Moreover, it is mean under point-valued predictions some finite set of possible
important that the covariates as well as their combination do not model outcomes, for example, classes of examples. A main prob-
depend on time. Therefore, coefficients of the covariates can be lem of the considered survival models is that their outcome is
regarded as quantitative impacts on the prediction. However, we a function. Therefore, we try to propose a new approach, which
approximate not a point-valued black-box model prediction, but uses LIME as a possible tool for its implementing, dealing with
functions, for example, the cumulative hazard function (CHF). In CHFs as the model outcomes.
accordance with the proposed explanation method, synthetic ex- Machine learning models in survival analysis. A review of
amples are randomly generated around the explainable example, survival analysis methods is presented by Wang et al. [8]. The
and the CHF is calculated for every synthetic example by means of Cox model is a very powerful and popular method for dealing
the black-box survival model. Simultaneously, we write the CHF with survival data. Therefore, a lot of approaches modifying the
corresponding to the approximating Cox model as a function of Cox model have been proposed last decades. In particular, Tibshi-
the coefficients of interest. By writing the distance between the rani [37] proposed a modification of the Cox model based on the
CHF provided by the black-box survival model and the CHF of the Lasso method in order to take into account a high dimensionality
approximating Cox model, we construct an unconstrained convex of survival data. Following this paper, several modifications of the
optimization problem for computing the coefficients of covari- Lasso methods for the Cox model were introduced [38–42]. In
ates. Numerical results using synthetic and real data illustrate the order to relax the linear relationship assumption accepted in the
proposed method. Cox model, a simple neural network as well as the deep neural
The paper is organized as follows. Related work can be found networks were proposed by several authors [43–47]. The SVM
in Section 2. A short description of basic concepts of survival approach to survival analysis has been also studied by several
analysis, including the Cox model, is given in Section 3. Basic ideas authors [48–51]. It turned out that the random survival forests
of the method LIME are briefly considered in Section 4. Section 5 (RSFs) became a very powerful, efficient and popular tool for the
provides a description of the proposed SurvLIME and its basic survival analysis. Therefore, this tool and its modifications, which
ideas. A formal derivation of the convex optimization problem can be regarded as extensions of the standard random forest [52]
for determining important features and a scheme of an algorithm on survival data, were proposed and investigated in many papers,
implementing SurvLIME can be found in Section 6. Numerical for example, in [53–60], in order to take into account the limited
experiments with synthetic data are provided in Section 7. Similar survival data.
numerical experiments with real data are given in Section 8. Most of the above models dealing with survival data can be
Discussion of different peculiarities of SurvLIME can be found in regarded as black-box models and should be explained. How-
Section 9. Concluding remarks are provided in Section 10. ever, only the Cox model has a simple explanation due to its
linear relationship between covariates. Therefore, it can be used
2. Related work to approximate more powerful models, including survival deep
neural networks and RSFs, in order to explain predictions of these
Local explanation methods. A lot of methods have been devel- models.
oped to locally explain black-box models. Along with the original
LIME [5], many its modifications have been proposed due to 3. Some elements of survival analysis
success and simplicity of the method, for example, ALIME [10],
NormLIME [11], DLIME [12], Anchor LIME [13], LIME-SUP [14], 3.1. Basic concepts
LIME-Aleph [15], GraphLIME [16]. Another very popular method
is the SHAP [17] which takes a game-theoretic approach for In survival analysis, an example (patient) i is represented by
optimizing a regression loss function based on Shapley values a triplet (xi , δi , Ti ), where xTi = (xi1 , . . . , xid ) is the vector of the
[18]. Alternative methods are influence functions [19], a multiple patient parameters (characteristics) or the vector of the example
hypothesis testing framework [20], and many other methods. features; Ti is time to event of the example. If the event of interest
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 3

is observed, then Ti corresponds to the time between baseline represents the hazard when all of the covariates are equal to zero,
time and the time of event happening, in this case we have an i.e., it describes the hazard for zero feature vector xT = (0, . . . , 0).
uncensored observation and δi = 1. If the example event is not It can be seen from the above expression for the hazard function
observed and its time to event is greater than the observation that the reparametrization Ψ (x, b) = exp (ψ (x, b)) is used in the
time, then Ti corresponds to the time between baseline time Cox model. The function ψ (x, b) in the model is linear, i.e.,
and end of the observation, and we have a censored observation m
and δi = 0. We will deal only with right-censoring, where the

ψ (x, b) = bT x = bk xk . (7)
observed survival time is less than or equal to the true survival k=1
time. Suppose a training set D consists of n triplets (xi , δi , Ti ),
i = 1, . . . , n. The goal of survival analysis is to estimate the time A hazard ratio is the ratio between two hazard functions
to the event of interest T for a new example (patient) with a h(t |x1 , b) and h(t |x2 , b). It is constant for the Cox model over
feature vector denoted by x by using the training set D. time. In the framework of the Cox model, the survival function
S(t |x, b) is computed as
The survival and hazard functions are key concepts in sur-
vival analysis for describing the distribution of event times. The S(t |x, b) = exp(−H0 (t) exp (ψ (x, b))) = (S0 (t))exp(ψ (x,b)) . (8)
survival function denoted by S(t) as a function of time t is the
probability of surviving up to that time, i.e., S(t) = Pr{T > t }. Here H0 (t) is the cumulative baseline hazard function; S0 (t)
The hazard function h(t) is the rate of event at time t given that is the baseline survival function which is defined as the survival
no event occurred before time t, i.e., it describes the relative function for zero feature vector xT = (0, . . . , 0). It is important to
likelihood of the event occurring at time t, conditional on the note that functions H0 (t) and S0 (t) do not depend on x and b.
subject’s survival up to that time. The hazard function can be The partial likelihood in this case is defined as follows [9]:
defined as n
[ ]δj
∏ exp(ψ (xj , b))
f (t) L(b) = . (9)
, exp(ψ (xi , b))

h(t) = (1) i∈Rj
S(t) j=1

where f (t) is the density function of the event of interest. Here Rj is the set of patients who are at risk at time tj . The term
By using the fact that the density function can be expressed “at risk at time t” means patients who die at time t or later. xj
through the survival function as denotes the covariate vector for the unit that actually experienced
the event at time tj . Here the jth term in the product is 1 when
dS(t)
f (t) = − , (2) the indicator δj is 0. This implies that only censored observations
dt have some effect on the product.
we can write the following expression for the hazard rate:
d 4. LIME
h(t) = − ln S(t). (3)
dt Before studying the LIME modification for survival data, this
Another important concept in survival analysis is the CHF H(t), method is briefly considered below.
which is defined as the integral of the hazard function h(t) and LIME proposes to approximate a black-box model denoted as
can be interpreted as the probability of an event at time t given q with a simple function g in the vicinity of the point of interest
survival until time t, i.e., x, whose prediction by means of q has to be explained, under
∫ t condition that the approximation function g belongs to a set of
H(t) = h(x)dx. (4) explanation models G, for example, linear models. In order to
−∞ construct the function g in accordance with LIME, a new dataset
The survival function is determined through the hazard func- consisting of perturbed samples is generated, and predictions
tion and through the CHF as follows: corresponding to the perturbed samples are obtained by means
( ∫ t
) of the explained model. New samples are assigned by weights wx
S(t) = exp − h(z)dz = exp (−H(t)) . (5) in accordance with their proximity to the point of interest x by
0 using a distance metric, for example, the Euclidean distance.
The dependence of the above functions on a feature vector x An explanation (local surrogate) model is trained on new gen-
is omitted for short. erated samples by solving the following optimization problem:
To compare survival models, the C-index proposed by Harrell
et al. [61] is used. It estimates how good a survival model is arg min L(q, g , wx ) + Φ (g). (10)
at ranking survival times. It estimates the probability that, in a g ∈G
randomly selected pair of examples, the example that fails first Here L is a loss function, for example, mean squared error,
had a worst predicted outcome. In fact, this is the probability that which measures how the explanation is close to the prediction
the event times of a pair of examples are correctly ranking. of the black-box model; Φ (g) is the model complexity.
A local linear model is the result of the original LIME. As a
3.2. The Cox model result, the prediction is explained by analyzing coefficients of the
local linear model.
Let us consider main concepts of the Cox proportional hazards
model [7,9]. According to the model, the hazard function at time 5. A basic sketch of SurvLIME
t given predictor values x is defined as

h(t |x, b) = h0 (t)Ψ (x, b) = h0 (t) exp (ψ (x, b)) . (6) Suppose that there are available a training set D and an black-
box model. For every new example x, the black-box model with
Here h0 (t) is a baseline hazard function which does not depend input x produces the corresponding output in the form of the
on the vector x and the vector b; Ψ (x, b) is the covariate effect or CHF H(t |x) or the hazard function h(t |x). The basic idea behind
the risk function; bT = (b1 , . . . , bm ) is an unknown vector of re- the explanation algorithm SurvLIME is to approximate the output
gression coefficients or parameters. The baseline hazard function of the black-box model in the form of the CHF by means of the
4 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

CHF produced by the Cox model for the same input example x. 1. H(t |x) ≥ 0 for all t
The idea stems from the fact that the function ψ (x, b) in the Cox maxt H(t |x) < ∞
2. ∫

model (see (6)) is linear and does not depend on time t. The 3. 0 H(t |x)dt → ∞
linearity means that coefficients bi of the covariates in ψ (x, b)
can be regarded as quantitative impacts on the prediction. Hence, Let us introduce the time T = tm + γ in order to restrict the
the largest coefficients indicate the corresponding importance integral of H(t |x), where γ is a very small positive number. Let
features. The independence of ψ (x, b) on time t means that time Ω = [0, T ]. Then we can write

and covariates can be considered separately, and the optimization
problem minimizing the difference between CHFs is significantly H(t |x)dt < ∞. (11)

simplifies.
Since the CHF H(t |x) is piecewise constant, then it can be
In order to find the important features of x, we have to
written in a special form. Let us divide the set Ω into m + 1
compute some optimal values of elements of vector b (see the
subsets Ω0 , . . . , Ωm such that
previous section) of the Cox model such that H(t |x) would be as
close as possible to the Cox CHF denoted as HCox (t |x, b). However, 1. Ω = ∪j=0,...,m Ωj
the use of a single point may lead to incorrect results. Therefore, 2. Ωm = [tm , T ], Ωj = [tj , tj+1 ), ∀j ∈ {0, . . . , m − 1}
we generate a lot of nearest points xk in a local area around x.
3. Ωj ∩ Ωk = ∅ for ∀j ̸ = k
For every generated point xk , the CHF H(t |xk ) of the black-box
model can be computed as a prediction provided by the survival Let us introduce the indicator functions
black-box model, and the Cox CHF HCox (t |xk , b) can be written
1, t ∈ Ωj ,
{
as a function of unknown vector b. Now the optimal values of b χΩj (t) = (12)
0, t∈/ Ωj .
can be computed by minimizing the average distance between
every pair of CHFs H(t |xk ) and HCox (t |xk , b) over all generated Hence, the CHF H(t |x) can be expressed through the indicator
points xk . Every distance between CHFs has a weight wk which functions as follows:
depends on the distance between xk and x. Smaller distances m

between xk and x produce larger weights of distances between H(t |x) = Hj (x) · χΩj (t) (13)
CHFs. If explanation methods like LIME deal with point-valued j=0
predictions of the example processing through the black-box
under additional condition Hj (x) ≥ ε > 0, where ε is a small
model, then the proposed method is based on functions char-
positive number. Here Hj (x) is a part of the CHF in interval Ωj . It
acterizing every point from D or new points x. Fig. 1 illustrates
is important that Hj (x) does not depend on t and is constant in
the explanation algorithm. It can be seen from Fig. 1 that a set interval Ωj . The last condition will be necessary below in order
of examples {x1 , . . . , xN } are fed to the black-box survival model, to deal with logarithms of the CHFs.
which produces a set of CHFs {H(t |x1 ), . . . , H(t |xN )}. Simultane- Let g be a monotone function. Then there holds
ously, we write CHFs HCox (t |xk , b), k = 1, . . . , N, as functions of
m
coefficients b for all generated examples. The weighted average ∑
distance between the CHFs of the Cox model and the black-box g(H(t |x)) = g(Hj (x))χΩj (t). (14)
j=0
survival model allows us to construct an optimization problem
and to compute the optimal vector b by solving the optimization By using the above representation of the CHF and integrating
problem. The distance between two functions is defined in metric it over Ω , we get
L2 . Weights wk , k = 1, . . . , n, are defined in accordance with the ⎡ ⎤
∫ ∫ m
distance between xk and x by means of a kernel. An example of ∑
H(t |x)dt = ⎣ Hj (x)χΩj (t)⎦ dt
the used kernel is given below in (24). One should distinguish
Ω Ω j=0
between the following distances: distances between CHFs H(t |xk )
and HCox (t |xk , b), which are aimed to be minimized for computing m
∑ [∫ ] m

χΩj (t) dt = Hj (x) tj+1 − tj .
( )
optimal vector b; distances between vectors xk and x, which = Hj (x)
allow us to take into account how far vectors xk and x from each j=0 Ω j=0
other. (15)
It should be noted that the above description is only a sketch
where every step is a separate task. All steps will be considered The same expressions can be written for the Cox CHF:
HCox (t |x, b) = H0 (t) exp bT x
( )
in detail below.
m
6. Minimization of distances between functions

H0j exp bT x χΩj (t), H0j ≥ ε.
[ ( )]
= (16)
j=0
It has been mentioned that the main peculiarity of machine
learning survival models is that the output of models is a function It should be noted that the distance between two CHFs can
(the CHF or the survival function). Therefore, in order to approx- be replaced with the distance between two logarithms of the
imate the output of the black-box model by means of the CHF corresponding CHFs for the optimization problem. The introduced
produced by the Cox model at the input example x, we have to condition Hj (x) ≥ ε > 0 allows to use logarithms. Therefore, in
generate many points xk in a local area around x and to consider order to get the convex optimization problem for finding optimal
the mean distance between the CHFs for generated points xk , k = values of b, we consider logarithms of the CHFs. It is important
1, . . . , N, and the Cox model CHF for point x. Before deriving the to point out that the difference of logarithms of the CHFs is not
mean distance and its minimizing, we introduce some notations equal to the difference between CHFs themselves. However, we
and conditions. make this replacement to simplify the optimization problem for
computing important features.
Let t0 < t1 < · · · < tm be the distinct times to event of
Let φ (t |xk ) and φCox (t |xk , b) be logarithms of H(t |xk ) and
interest, for example, times to deaths from the set {T1 , . . . , Tn },
HCox (t |xk , b). Here xk is a generated point. The difference between
where t0 = mink=1,...,n Tk and tm = maxk=1,...,n Tk . The black-box
functions φ (t |xk ) and φCox (t |xk , b) can be written as follows:
model maps the feature vectors x ∈ Rd into piecewise constant
CHFs H(t |x) having the following properties: φ (t |xk ) − φCox (t |xk , b)
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 5

Fig. 1. A schematic illustration of the explanation algorithm.

m m
∑ ∑ One of the difficulties of solving the above problem is that the
(ln Hj (xk ))χΩj (t) − ln(H0j exp bT xk ) χΩj (t)
( ( ))
= difference between functions H(t |x) and HCox (t |x, b) may be sig-
j=0 j=0 nificantly different from the distance between their logarithms.
m
∑ Therefore, in order to take into account this fact, it is proposed
ln Hj (xk ) − ln H0j − bT xk χΩj (t).
( )
= (17) to introduce weights which “straighten” functions φ (t |x) and
j=0 φCox (t |x, b). These weights are defined as
Let us consider the distance between functions φ (t |xk ) and H(t |x) H(t |x)
v (t |x) = = . (20)
φCox (t |xk , b) in metric L2 : φ (t |x) ln(H(t |x))
Taking into (account ) these weights and their representation
D2,k (φ, φCox ) = ∥φ (t |xk ) − φCox (t |xk , b)∥22
∫ vkj = Hj (xk )/ln Hj (xk ) , the optimization problem can be rewrit-
ten as
= |φ (t |xk ) − φCox (t |xk , b)|2 dt
Ω N m
∑ ∑ )2 (
wk vkj2 ln Hj (xk ) − ln H0j − bT xk tj+1 − tj .
( )
m min (21)
∑ )2 (
T
tj+1 − tj .
( ) b
= ln Hj (xk ) − ln H0j − b xk (18) k=1 j=0
j=0
The problem can be solved by many well-known methods of
The function ln Hj (xk ) − ln H0j − bT xk is linear with b and, the convex programming.
therefore, convex. Moreover, it is easy to prove by taking the Finally, we write the following scheme of Algorithm 1.
( )2
second derivative over bj that the function A − bT(xk is also ) Let us consider the computational complexity of SurvLIME for
convex, where A = ln Hj (xk ) − ln H0j . Since the term tj+1 − tj is
explaining a single example. According to [8], the complexity
positive, then the function D2,k (φ, φCox ) is also convex as a linear of computing the baseline survival function is O(n), where n is
combination of convex functions with positive coefficients. the number of examples in D. The complexity of solving the
According to the explanation algorithm, we have to consider unconstrained quadratic optimization (21) over Rd is O(d3 ) under
many points xk generated in ∑ a local area around x and to min- condition that the problem is solved as a linear d × d equation
N
imize the objective function k=1 wk D2,k (φ, φCox ), which takes with a symmetric positive semidefinite matrix. The complexity of
into account all these generated examples and the corresponding computing N weights wk and the complexity of N · (m + 1) weights
weights of the examples. It is obvious that this function is also vkj are O(Nd) and O(N(m + 1)n), respectively. The procedure of
convex with respect to b. The weight can be assigned to point generating N points around x has the complexity O(Nd). Finally,
we get the summed complexity of SurvLIME equal to O(d3 + 2Nd +
xk as a decreasing function of the distance between x and xk , for
N(m + 1)n). It is important to point out that the complexity of
example, wk = K (x, xk ), where K (·, ·) is a kernel. In numerical
the black-box model is not taken into account because it depends
experiments, we use the function defined in (24). on a specific implementation of the model. If its complexity of
Finally, we write the following convex optimization problem: computing the CHF by the black-box model is O(K ), and this
N
∑ m
∑ )2 ( model is used N times (by computing N CHFs for N generated
wk ln Hj (xk ) − ln H0j − bT xk tj+1 − tj .
( )
min (19) points xk ), then the complexity O(KN) has to be added to the
b obtained complexity of SurvLIME.
k=1 j=0
6 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

Algorithm 1 The algorithm for computing vector b for point x


Require: Training set D; point of interest x; the number of gen-
erated points N; the black-box survival model for explaining
q(x)
Ensure: Vector b of important features
1: Compute the baseline cumulative hazard function H0 (t) of
the approximating Cox model on dataset D by using the
Nelson–Aalen estimator
2: Generate N −1 random nearest points xk in a local area around
x, point x is the N-th point
3: Get the prediction of the cumulative hazard function H(t |xk )
by using the black-box survival model (the function q)
4: Compute weights wk = K (x, xk ) of perturbed points, k =
1, ..., N
5: Compute weights vkj = Hj (xk )/ln Hj (xk ) , k = 1, ..., N, j =
( )
0, ..., m
6: Find vector b by solving the convex optimization problem (21)

7. Numerical experiments with synthetic data


Fig. 2. Two clusters of generated covariates depicted by using the t-SNE method.
In order to study the proposed explanation algorithm, we
generate random survival times to events by using the Cox model
estimates. This generation allows us to compare initial data for
generating every points and results of SurvLIME. where U is the random variable uniformly distributed in interval
[0, 1].
7.1. Generation of random covariates, survival times and perturba- Parameters of the generation for every cluster are
tions
1. cluster 0: λ = 10−5 , v = 2, bT = (10−6 , 0.1, −0.15, 10−6 ,
d
Two clusters of covariates x ∈ R are randomly generated 10−6 );
such that points of every cluster are generated from the uniform 2. cluster 1: λ = 10−5 , v = 2, bT = (10−6 , −0.15, 10−6 , 10−6 ,
distribution in a sphere. The covariate vectors are generated in −0.1).
the d-sphere with some predefined radius R. The center p of the
It can be seen from the above that every vector b has three
d-sphere and its radius R are parameters of experiments. There
almost zero-valued elements and two “large” elements which
are several methods for the uniform sampling of points x in
will correspond to important features. Generated values Ti are
the d-sphere with the unit radius R = 1, for example, [62,63].
restricted by the condition: if Ti > 2000, then Ti is replaced with
Then every generated point is multiplied by R. The following
parameters for points of every cluster are used: value 2000. The event indicator δi is generated from the binomial
distribution with probabilities Pr{δi = 1} = 0.9, Pr{δi = 0} = 0.1.
1. cluster 0: center p0 = (0, 0, 0, 0, 0); radius R = 8; number Perturbation is one of the steps of the algorithm. According
of points N = 1000; to it, we generate N nearest points xk in a local area around x.
2. cluster 1: center p1 = (4, −8, 2, 4, 2); radius R = 8; These points are uniformly generated in the d-sphere with some
number of points N = 1000. predefined radius r = 0.5 and the center at point x. In numerical
experiments, N = 1000. Weights to every point are assigned as
The parameters of clusters are chosen in order to get some follows:
intersecting area containing points from both the clusters, in √
particular, the radius is determined from the centers as follows: ∥x − xk ∥2
wk = 1 − . (24)

∥p 0 − p 1 ∥2
⌉ r
R= + 2. (22)
2 7.2. Black-box models and approximation measures
The clusters after using the well-known t-SNE algorithm are
depicted in Fig. 2. The value 2 in (22) is taken in order to imple- As black-box models, we use the Cox model and the RSF
ment the overlap of clusters (see Fig. 2). By increasing this value, model [53]. The RSF consists of 250 decision survival trees. The
we can extend the area of the overlap. approximating Cox model has the baseline CHF H0 (t) constructed
In order to generate random survival times by using the Cox on generated training data using the Nelson–Aalen estimator.
model estimates, we apply results obtained by Bender et al. [64] The Cox model is used in order to check whether the selected
for survival time data for the Cox model with Weibull distributed important features explaining the CHF H(t |x) at point x coincide
survival times. The Weibull distribution with the scale λ and with the corresponding features accepted in the Cox model for
shape v parameters is used to generate appropriate survival generating training set. It should be noted that the Cox model
times for simulation studies because this distribution shares as well as the RSF are viewed as black-box models whose pre-
the assumption of proportional hazards with the Cox regression dictions (CHFs or survival functions) are explained. To study how
model [64]. Taking into account the parameters λ and v of different cases impact on the quality of the approximation, we
the Weibull distribution, we use the following expression for use the following two measures for the Cox model:
generated survival times [64]: 
 n 
)1/v 1 ∑  
− bexpl  ,
(
− ln(U) RMSEmodel =√ bmodel (25)

T = , (23) ni i
2
λ exp(bT x) i=1
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 7

Fig. 3. The best, mean and worst approximations for the Cox model (training and testing sets from cluster 0).

(the RSF) output (H(t |xi )), but not training data. Therefore, we

 n  use the above measure to estimate the proximity of two CHFs:
1 ∑   the Cox model approximation and the RSF output.
RMSEtrue =√ btrue − bexpl  , (26)

i i
n 2
i=1 7.3. Experiment 1
where bmodel
i are coefficients of the Cox model which is used as
To evaluate the algorithm, 900 examples are randomly se-
the black-box model; btrue
i are coefficients used for training data
expl
lected from every cluster for training and 100 examples are for
generation (see (23)); bi
are explaining coefficients obtained by testing. Three cases of training and testing the black-box Cox and
using the proposed algorithm. RSF models are studied:
The first measure characterizes how the obtained important
features coincide with the corresponding features obtained by 1. cluster 0 for training and for testing;
using the Cox model as the black-box model. The second measure 2. cluster 1 for training and for testing;
considers how the obtained important features coincide with the 3. clusters 0 and 1 are jointly used for training and separately
features used for generating the random times to events. Every for testing.
measure is calculated by taking randomly n points x from the
The testing phase includes:
testing set and compute the corresponding coefficients.
This measure
( considers
) how the obtained Cox model approx- • Computing the explanation vector bexpl for every point from
expl
imation HCox | ,
tj xi bi is close to the RSF output H(tj |xi ). We the testing set.
cannot estimate the proximity of important features explaining • Depicting the best, mean and worst approximations in ac-
the model because we do not have the corresponding features cordance with Euclidean distance between vectors bexpl and
for the RSF. A comparison of the important features with the bmodel (for the Cox model)
( and with ) Euclidean distance
expl
generated ones is also incorrect because we explain the model between H(tj |xi ) and HCox tj |xi , bi (for the RSF). In order
8 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

Fig. 4. The best, mean and worst approximations for the Cox model (training and testing sets from cluster 1).

to get these approximations, points with the best, mean and accordance with the Cox model. This fact can be explained in
worst approximations are selected among all testing points. the following way. We try to explain the CHF obtained by using
• Computing measures RMSEmodel and RMSEtrue for the Cox the black-box model (the Cox model in the considered case). But
model and RMSEapprox for the RSF over all points of the the black-box Cox model is trained on all examples from two
testing set. different clusters. We have a mix of data from two clusters with
different parameters. Therefore, this model itself provides results
The three cases (best (pictures in the first row), mean (pictures different from the generation model. At the same time, one can
in the second row) and worst (pictures in the third row)) of observe from Figs. 5 and 6 that the explaining important features
approximations for the black-box Cox model under condition that coincide with the features which are obtained from the black-
cluster 0 is used for training and testing are depicted in Fig. 3. box Cox model. These results are interesting. They show that we
Left pictures show values of important features bexpl , bmodel and explain CHFs of the black-box model, but not CHFs of training
btrue . It can be seen from these pictures that all experiments show data.
very clear coincidence of important features for all models. Right The approximation accuracy measures for four cases are given
pictures in Fig. 3 show survival functions obtained from the black- in Table 1. In fact, the table repeats the results shown in Figs. 3–6.
box Cox model and from the Cox approximation. It follows from In the same way, we study the black-box RSF model by using
the pictures that the approximation is perfect even for the worst three cases of its training and testing. The results are shown in
case. Similar results can be seen from Fig. 4, where training and Fig. 7 where the first, second, third, and fourth rows correspond
testing examples are taken from cluster 1. to the four cases of experiments: cluster 0 for training and for
Figs. 5 and 6 illustrate different results corresponding to cases testing; cluster 1 for training and for testing; clusters 0 and 1 are
when training examples are taken from cluster 0 and cluster jointly used for training and separately for testing. Every row con-
1. One can see that the approximation of survival functions is tains pairs of survival functions corresponding to the best, mean
perfect, but important features obtained by SurvLIME do not and worst approximations. The important features are not shown
coincide with the features used for generating random times in in Fig. 7 because they cannot be compared with the features of
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 9

Fig. 5. The best, mean and worst approximations for the Cox model (the training set from clusters 0,1 and the testing set from cluster 0).

Table 1 Table 2
Approximation accuracy measures for four cases of using the black-box Cox Results of the two-sample Kolmogorov–Smirnov test for the Cox model and the
model. RSF.
Clusters for Cluster for RMSEmodel RMSEtrue Clusters for Cluster for p-value
training testing 0 1 training testing Cox model RSF
0 0 0.189 0.177 0 0 0.1299 0.9942
1 1 0.291 0.328 1 1 0.3869 0.9998
{0, 1} 0 0.227 0.446 {0, 1} 0 0.9553 0.6423
{0, 1} 1 0.177 0.381 {0, 1} 1 0.2146 0.5697

the generative Cox model. Moreover, the RSF does not provide the column) or the black-box RSF and the approximating Cox model
important features like the Cox model. One can again see from (the fourth column) are from the same distribution. The third and
Fig. 7 that SurvLIME illustrates the perfect approximation of the fourth columns show the corresponding p-values. It can be seen
RSF output by the Cox model. from Table 2 that all p-values are not significant to reject the null
Let us use the well-known two-sample Kolmogorov–Smirnov
hypothesis.
test which can be used to test whether two underlying one-
One of the advantages of SurvLIME is that it has a few tuning
dimensional probability distributions differ in the above numer-
parameters. In fact, only two parameters has to be tuned. The
ical experiments with the Cox model and the RSF. The results
of the test (p-values) for the mean approximation and different first one is the number of generated points N in the local area
considered cases of training and testing data are given in Table 2. around a test example. The generated points have weights wk ,
The null hypothesis is that data obtained by means of the black- which are defined by the function given in (24) with radius r of
box Cox model and the approximating Cox model (the third a local sphere. Radius r is the second parameter.
10 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

Fig. 6. The best, mean and worst approximations for the Cox model (the training set from clusters 0,1 and the testing set from cluster 1).

Table 3 7.4. Experiment 2


The standard deviation of b as a function of r and N.
r N
In the second experiment, our aim is to study how SurvLIME
500 750 1000 2500 depends on the number of training examples. We use only the
0.075 1.87 × 10−3 1.68 × 10−3 1.44 × 10−3 1.42 × 10−3 Cox model for explaining which is trained on 100, 200, 300, 400,
0.1 2.08 × 10−3 1.58 × 10−3 1.39 × 10−3 1.41 × 10−3 500 examples and tested on 100 examples from cluster 0. We
0.25 1.97 × 10−3 1.64 × 10−3 1.38 × 10−3 1.37 × 10−3
study how the difference between bmodel and btrue depends of the
0.5 1.99 × 10−3 1.58 × 10−3 1.38 × 10−3 1.36 × 10−3
sample size. Results of experiments are shown in Fig. 8 where
rows correspond to 100, 200, 300, 400, 500 training examples,
respectively, left pictures illustrate relationships between impor-
We used different values for the parameters N and r, choos- tant features, right pictures show survival functions obtained
ing those leading to the best approximation of the black-box from the black-box Cox model and from the Cox approximation
model cumulative hazard function by the corresponding cumu- (see similar pictures in Figs. 3–6). It should be noted that the
lative hazard function provided by the Cox model. One of the dataset is generated from the same distribution described in
measures for estimating the impact of parameters N and r is the Section 7.1. Therefore, we can compare the experimental results
by different numbers of training points. It is interesting to observe
standard deviation of the optimal vector b. Values of this measure
in Fig. 8 how the vectors bexpl , bmodel and btrue approach each
are shown in Table 3 as a function of r and N. One the one
other with the number of training examples.
hand, we aim to take values r = 0.5 and N = 2500 providing
The measures RMSEmodel and RMSEtrue as functions of the sam-
the smallest values of the standard deviation. However, too large ple size are provided in Table 4. It can be seen from Table 4 a
values of N lead to hard computations, but the standard deviation tendency of the measures to be reduced with increase of the sam-
is not significantly improved in these cases. Therefore, we select ple size for training. The C-index as a measure of the black-box
r = 0.5 and N = 1000 for experiments. model quality is also given in Table 4.
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 11

Fig. 7. The best, mean and worst approximations for the RSF model.

Table 4 on 10 examples from cluster 0. Results of experiments for the


Approximation accuracy measures for four cases of using the black-box Cox
Cox model are shown in Fig. 9 where rows correspond to 10,
model.
20, 30, 40 training examples, respectively, left pictures illustrate
C-index RMSEmodel RMSEtrue
relationships between important features, right pictures show
100 0.777 0.246 0.306
200 0.785 0.201 0.263
survival functions obtained from the black-box Cox model and
300 0.787 0.217 0.183 from the Cox approximation (see similar pictures in Fig. 8). It
400 0.838 0.206 0.177 is interesting to points out from Fig. 9 that important features
500 0.844 0.195 0.183 obtained by means of SurvLIME differ from the parameters of
the Cox model for generating random examples in the case of
10 training examples. However, they almost coincide with fea-
However, it can be seen from Fig. 8 that the large number tures obtained by the black-box model. Moreover, it follows from
of training data may lead to some worsening of results (see Fig. 9 that important features of SurvLIME quickly converge to
the last row of pictures in Fig. 8). A possible reason is that the parameters of the Cox model for generating random examples.
large dataset may contain a lot of outliers which impact on the The C-index, measures RMSEmodel and RMSEtrue as functions
approximation results. of the sample size are provided in Table 5. It can be seen from
Table 5 that RMSEmodel and RMSEtrue are strongly reduced with
7.5. Experiment 3 increase of the sample size.
Results of experiments for the RSF are shown in Fig. 10 where
Another interesting question for studying is how SurvLIME rows correspond to 10, 20, 30, 40 training examples, respectively,
behaves by a very small amount of training data. We use the left pictures illustrate important features, right pictures show
Cox model and the RSF as black-box models and train them on survival functions obtained from the black-box RSF and from
10, 20, 30, 40 examples from cluster 0. The models are tested the Cox approximation (see similar pictures in Fig. 9). It can be
12 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

Fig. 8. Comparison of important features (left pictures) and survival functions (right pictures) for the explainable Cox model by 100, 200, 300, 400, 500 training
examples.
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 13

Fig. 9. Comparison of important features (left pictures) and survival functions (right pictures) for the explainable Cox model by 10, 20, 30, 40 training examples.

Table 5 seen from Fig. 10 that the difference between survival functions
Approximation accuracy measures for four cases of using the black-box Cox
model by the small amount of data. is reduced with increase the training set. However, in contrast
C-index RMSEmodel RMSEtrue
to the black-box Cox model (see Fig. 9), important features are
10 0.444 0.719 0.809
20 0.489 0.659 0.664 very unstable. They are different for every training sets. A reason
30 0.622 0.347 0.428
40 0.733 0.324 0.344 of this important feature behavior is that they explain the RSF
outputs which are significantly changed by the so small numbers
of training examples.
14 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

Fig. 10. Important features (left pictures) and survival functions (right pictures) for the explainable RSF by 10, 20, 30, 40 training examples.

8. Numerical experiments with real data first row), mean (pictures in the second row) and worst (pictures
in the third row). These cases are similar to cases studied for
In order to illustrate SurvLIME, we test it on several well-
known real datasets. A short introduction of the benchmark synthetic data. The cases are studied for the black-box Cox model
datasets are given in Table 6 that shows sources of the datasets (R (Fig. 11) and the black-box RSF (Fig. 12). Again left pictures in
packages), the number of features d for the corresponding dataset, figures show values of important features bmodel and btrue for the
the number of training examples m and the number of extended
Cox model and btrue for the RSF, right pictures illustrate the ap-
features d∗ using hot-coding for categorical data.
Figs. 11–12 illustrate numerical results for the CML dataset. proximated survival function and the survival function obtained
Three cases of approximation are considered: best (pictures in the by the explained model.
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 15

Fig. 11. The best, mean and worst approximations for the Cox model trained on the CML dataset.

Table 6 of similar pictures. Moreover, we do not show important fea-


A brief introduction about datasets.
tures explaining RSFs because, in contrast to the Cox model, they
Data set Abbreviation R Package d d∗ m
cannot be compared with true features. Every figure consists of
Chronic Myelogenous CML Multcomp 5 9 507
three pictures: the first one illustrates the explanation important
Leukemia Survival
NCCTG Lung Cancer LUNG Survival 8 11 228 features and important features obtained from training the Cox
Primary Biliary PBC Survival 17 22 418 model; the second picture shows two survival functions for the
Cirrhosis Cox model; the third picture shows two survival functions for the
Stanford Heart Stanford2 Survival 2 2 185
Transplant RSF.
Trial Of UDCA Survival 4 4 170 If the Cox model is used for training, then one can see an
Usrodeoxycholic Acid explicit coincidence of explaining important features and the
Veterans’ Veteran Survival 6 9 137
Administration Lung
features obtained from the trained Cox model for all datasets.
Cancer Study This again follows from the fact that SurvLIME does not explain
a dataset. It explains the model, in particular, the Cox model.
We also can observe that the approximated survival function is
very close to the survival function obtained by the explained Cox
Figs. 13–17 show numerical results for other datasets. Since
model in all cases. The same cannot be said about the RSF. For
most results are very similar to the same results obtained for
example, one can see from Fig. 12 that the important features
the CML datasets, then we provide only the case of the mean obtained by explaining the RSF mainly do not coincide. The rea-
approximation for every dataset in order to reduce the number son is a difference of results provided by the Cox model and the
16 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

Fig. 12. The best, mean and worst approximations for the RSF trained on the CML dataset.

Fig. 13. The mean approximation for the Cox model (the first and the second picture) and the RSF (the third picture) trained on the LUNG dataset.

RSF. At the same time, it can be seen from Figs. 12–17 that the 9. Discussion
survival functions obtained from the RSF and the approximating
Cox model are close to each other for many models. This implies Developing SurvLIME and its illustrating by several numerical
that SurvLIME provides correct results. examples provided us with an experience of applying some ideas
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 17

Fig. 14. The mean approximation for the Cox model (the first and the second picture) and the RSF (the third picture) trained on the PBC dataset.

Fig. 15. The mean approximation for the Cox model (the first and the second picture) and the RSF (the third picture) trained on the Stanford2 dataset.

Fig. 16. The mean approximation for the Cox model (the first and the second picture) and the RSF (the third picture) trained on the UDCA dataset.

behind the well-known explanation method LIME to a large group LIME uses generated points in order to get a function separating
of methods dealing with survival data. Although many interesting classes of examples, whereas SurvLIME generates points to take
and efficient methods have been proposed to explain prediction into account small changes of CHFs with changes of feature
results of machine learning models, to the best of our knowledge, vectors. Another interesting explanation method, which uses the
none of them are developed to explain the survival analysis linear approximation, is SHAP. In spite of some difficulties of
functional predictions including CHFs and survival functions. its use, for example, computational difficulties and perturbation
LIME has been chosen as a basis for SurvLIME as one of the peculiarities, SHAP can also be modified to be applied to survival
efficient and computationally simple explanation methods. How- analysis. This is an important direction for further research.
ever, it is important to note that the LIME method is not entirely a It turns out that SurvLIME inherits from LIME its drawbacks.
basis for the developed SurvLIME methods. We applied only two One of the drawbacks of LIME is that the precision of explanations
main ideas underlying LIME. The first idea is the approximation at are not guaranteed due to linear approximation [65]. The same
the local area around a test example by means of a linear model. drawback may be attributed to SurvLIME. However, in contrast to
The second idea is to generate points around a test example. LIME, the precision of SurvLIME can be controlled by considering
Moreover, we did not take these ideas directly. The first idea is the difference between the survival function as a prediction of
realized in SurvLIME by using the Cox model, but not a usual the black-box model and the survival function of the approxi-
linear model as it is used in LIME. According to the second idea, mating Cox model. Another drawback of LIME is that the linear
18 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

Fig. 17. The mean approximation for the Cox model (the first and the second picture) and the RSF (the third picture) trained on the Veteran dataset.

model is trained based on a large set of randomly perturbed for example, survival functions, provide more correct results in
examples, which is computationally costly. The same drawback comparison with using point-valued predictions.
takes place for SurvLIME. According to [66], LIME is sensitive to It can be seen from various numerical examples with synthetic
the dimensionality of data. In this case, the local explanation is data generated in accordance with the Cox model that SurvLIME
unable to discriminate among relevant and irrelevant features. provides outperforming results in most cases. This follows from
The same can be said about SurvLIME. SurvLIME as well as LIME the comparison of the important features and approximations of
do not provide counterfactual explanations [27]. This is a direc- survival functions. However, we do not give a standard compari-
tion for further research to develop methods for counterfactual son of the proposed explanation method with other methods due
explanations of the black-box survival models. to two problems. The first problem of using other methods for
It is interesting to point out that SurvLIME have demonstrated comparison is that there exists no a commonly accepted measure
that the classical Cox model is a power and outstanding tool for for comparison of explanation methods. A reason is that the ex-
survival analysis in spite of its limitations caused by its strong planation is local and depends on both an explained example and
specific assumptions. One of the limitations, linear relationship a black-box model explained. But the main problem of using other
of covariates, became a crucial advantage for developing the methods for comparison is that there exist no explanation meth-
explanation method. Of course, this advantage is valid for a local ods dealing with survival data nowadays. Moreover, all methods
area around the explained point. try to explain point-valued predictions whereas the proposed
An important question arising with respect to the proposed SurvLIME explains functions. Therefore, we apply a way to verify
method is what the explanation of survival functions or CHFs the explanation method base on using synthetic data generated
means. Indeed, the predictions of survival models differ from with respect to the Cox model and to consider the difference
between predefined vector btrue of the Cox model coefficients and
usual classification or regression predictions which have the well-
vector bexpl of coefficients obtained by the proposed method for
known meanings. Predictions of survival models are not so obvi-
indicating the important features. The measure RMSE is used to
ous because they are functions. It is difficult to understand for
compare the vectors of coefficients.
a doctor the following explanation: “The main reason for the
obtained survival function is the patient’s blood pressure”. In
10. Conclusion
order to cope with this difficulty, we propose two ways. First,
one can consider some point-valued characteristics for a clear
A new explanation method called SurvLIME which can be
explanation, for example, the mean time to event, the probability
regarded as a modification of the well-known method LIME for
of event before some time, etc., for example, “The main reason survival data has been presented in the paper. The main idea
for the obtained mean time to recession is the patient’s blood behind the method is to approximate a survival machine learn-
pressure” or “A very small probability of recession before time ing model at a point by the Cox proportional hazards model
5 years is due to a special treatment of the patient”. Second, which assumes a linear combination of the example covariates.
survival functions or their parts can be grouped in accordance This assumption allows us to determine the important features
with some classes of objects, for example, with diseases. Then we explaining the survival model.
can explain the corresponding disease of a patient by indicating In contrast to LIME and other explanation methods, SurvLIME
the important features. deals with survival data. It is not complex from computational
At first view, it seems that it would be simpler to take one of point of view. Indeed, we have derived a simple convex un-
the point-valued survival characteristics, for example, the mean constrained optimization problem whose solution does not meet
time to event, as the outcome and to use the well-known ex- any difficulties. Moreover, many numerical experiments with
planation methods, for example, LIME. However, we lose a lot synthetic and real datasets have clearly illustrated accuracy and
of information in this case by replacing the survival function or correctness of SurvLIME. It has coped even with problems char-
the CHF by some points. For example, two patients may have acterizing by small datasets.
identical mean times to recession, but quite different survival The main advantage of the method is that it opens a door
functions. Moreover, we do not know a priori whether the mean for developing many explanation methods taking into account
time in this case depends on some important features. Perhaps, censored data. In particular, an interesting problem is to develop a
these features impact on the variance of time to event, but not method explaining the survival models with time-dependent co-
on the mean time. Therefore, explanations using the functions, variates. This is a direction for further research. Only the quadratic
M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164 19

norm has been considered to estimate the distance between two [8] P. Wang, Y. Li, C.K. Reddy, Machine learning for survival analysis: A survey,
CHFs and to construct the corresponding optimization problem. 2017, arXiv:1708.04649.
[9] D.R. Cox, Regression models and life-tables, J. R. Stat. Soc. Ser. B Stat.
However, there are other distance metrics which are also inter-
Methodol. 34 (2) (1972) 187–220.
esting with respect to constructing new explanation methods. [10] S.M. Shankaranarayana, D. Runje, ALIME: Autoencoder based approach for
This is another direction for further research. An open and very local interpretability, 2019, arXiv:1909.02437.
interesting direction is also the counterfactual explanation with [11] I. Ahern, A. Noack, L. Guzman-Nateras, D. Dou, B. Li, J. Huan, NormLime: A
censored data. It could be a perspective extension of SurvLIME. new feature importance metric for explaining deep neural networks, 2019,
arXiv:1909.04200.
It should be noted that SurvLIME itself can be further investi-
[12] M.R. Zafar, N.M. Khan, DLIME: A deterministic local interpretable model-
gated by considering its different parameters, for example, the agnostic explanations approach for computer-aided diagnosis systems,
assignment of weights of perturbed examples in different ways, 2019, arXiv:1906.10263.
the robustness of the method to outliers, etc. The original Cox [13] M.T. Ribeiro, S. Singh, C. Guestrin, Anchors: High-precision model-agnostic
model has several modifications based on the Lasso method. explanations, in: AAAI Conference on Artificial Intelligence, 2018, pp.
1527–1535.
These modifications could improve the explanation method in
[14] L. Hu, J. Chen, V.N. Nair, A. Sudjianto, Locally interpretable models and
some applications, for example, in medicine applications to take effects based on supervised partitioning (LIME-SUP), 2018, arXiv:1806.
into account a high dimensionality of survival data. This is also 00663.
a direction for further research. A lot of tasks are devoted to [15] J. Rabold, H. Deininger, M. Siebers, U. Schmid, Enriching visual with verbal
analysis of groups of patients with similar symptoms. In this case, explanations for relational concepts: Combining LIME with Aleph, 2019,
arXiv:1910.01837v1.
we meet with a problem of explaining the overall behavior of [16] Q. Huang, M. Yamada, Y. Tian, D. Singh, D. Yin, Y. Chang, GraphLIME:
a black-box model when the contribution of input features is Local interpretable model explanations for graph neural networks, 2020,
considered with respect to their impact on predictions concerning arXiv:2001.06216.
with a group of patients. In other words, we have to develop a [17] E. Strumbel, I. Kononenko, An efficient explanation of individual
classifications using game theory, J. Mach. Learn. Res. 11 (2010) 1–18.
global explanation method. The main difficulty in this case is that
[18] S.M. Lundberg, S.-I. Lee, A unified approach to interpreting model predic-
the linear approximation by the Cox model cannot be directly tions, in: Advances in Neural Information Processing Systems, 2017, pp.
used because an area around analyzed points is no longer local. 4765–4774.
This is a very interesting problem whose solution is a direction [19] P.W. Koh, P. Liang, Understanding black-box predictions via influence
for further research. functions. in: Proceedings of the 34th International Conference on Machine
Learning, volume 70, 2017, pp. 1885–1894.
[20] C. Burns, J. Thomason, W. Tansey, Interpreting black box models with
CRediT authorship contribution statement statistical guarantees, 2019, arXiv:1904.00045.
[21] S. Wachter, B. Mittelstadt, C. Russell, Counterfactual explanations without
Maxim S. Kovalev: Methodology, Validation, Formal analysis, opening the black box: Automated decisions and the GPDR, Harv. J. Law
Writing - original draft, Writing - review & editing. Lev V. Utkin: Technol. 31 (2017) 841–887.
[22] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, S. Lee, Counterfactual visual
Conceptualization, Methodology, Writing - original draft, Writing
explanations, 2019, arXiv:1904.07451.
- review & editing. Ernest M. Kasimov: Software, Data curation, [23] L.A. Hendricks, R. Hu, T. Darrell, Z. Akata, Grounding visual explanations.
Validation. in: Proceedings of the European Conference on Computer Vision (ECCV),
2018, pp. 264–279.
Declaration of competing interest [24] A. Van Looveren, J. Klaise, Interpretable counterfactual explanations guided
by prototypes, 2019, arXiv:1907.02584.
[25] J. van der Waa, M. Robeer, J. van Diggelen, M. Brinkhuis, M. Neerincx,
The authors declare that they have no known competing finan- Contrastive explanations with local foil trees, 2018, arXiv:1806.07470.
cial interests or personal relationships that could have appeared [26] Y. Ramon, D. Martens, F. Provost, T. Evgeniou, Counterfactual explanation
to influence the work reported in this paper. algorithms for behavioral and textual data, 2019, arXiv:1912.01819.
[27] A. White, A.dA. Garcez, Measurable counterfactual local explanations for
any classifier, 2019, arXiv:1908.03020v2.
Acknowledgments
[28] R. Fong, A. Vedaldi, Explanations for attributing deep neural network
predictions, in: Xplainable AI, Volume 11700 of LNCS, Springer, Cham,
The authors would like to express their appreciation to the 2019, pp. 149–167.
anonymous referees whose very valuable comments have im- [29] R.C. Fong, A. Vedaldi, Interpretable explanations of black boxes by mean-
proved the paper. The reported study was funded by RFBR, project ingful perturbation, in: Proceedings of the IEEE International Conference
on Computer Vision, IEEE, 2017, pp. 3429–3437.
number 20-01-00154. [30] V. Petsiuk, A. Das, K. Saenko, Rise: Randomized input sampling for
explanation of black-box models, 2018, arXiv:1806.07421.
References [31] M.N. Vu, T.D. Nguyen, N. Phan, M.T. Thai R. Gera, Evaluating explainers via
perturbation, 2019, arXiv:1906.02032v1.
[1] V. Arya, R.K.E. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S.C. Hoffman, [32] M. Du, N. Liu, X. Hu, Techniques for interpretable machine learning, 2019,
S. Houde, Q.V. Liao, R. Luss, A. Mojsilovic, S. Mourad, P. Pedemonte, arXiv:1808.00033.
R. Raghavendra, J. Richards, P. Sattigeri, K. Shanmugam, M. Singh, K.R. [33] A. Adadi, M. Berrada, Peeking inside the black-box: A survey on explainable
Varshney, D. Wei, Y. Zhang, One explanation does not fit all: A toolkit artificial intelligence (XAI), IEEE Access 6 (2018) 52138–52160.
and taxonomy of AI explainability techniques, 2019, arXiv:1909.03012. [34] A.B. Arrieta, N. Diaz-Rodriguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado,
[2] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, F. Herrera, Ex-
A survey of methods for explaining black box models, ACM Comput. Surv. plainable artificial intelligence (XAI): Concepts, taxonomies, opportunities
51 (5) (2019) 93. and challenges toward responsible AI, 2019, arXiv:1910.10045.
[3] C. Molnar, Interpretable machine learning: A guide for making black box [35] D.V. Carvalho, E.M. Pereira, J.S. Cardoso, Machine learning interpretability:
models explainable, 2019, Published online, https://christophm.github.io/ A survey on methods and metrics, Electronics 8 (832) (2019) 1–34.
interpretable-ml-book/. [36] C. Rudin, Stop explaining black box machine learning models for high
[4] W.J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, B. Yua, Interpretable stakes decisions and use interpretable models instead, Nat. Mach. Intell. 1
machine learning: definitions, methods, and applications, 2019, arXiv: (2019) 206–215.
1901.04592. [37] R. Tibshirani, The lasso method for variable selection in the cox model,
[5] M.T. Ribeiro, S. Singh, C. Guestrin, Why should I trust You? Explaining the Stat. Med. 16 (4) (1997) 385–395.
predictions of any classifier, 2016, arXiv:1602.04938v3. [38] S. Kaneko, A. Hirakawa, C. Hamada, Enhancing the lasso approach for
[6] D. Garreau, U. von Luxburg, Explaining the explainer: A first theoretical developing a survival prediction model based on gene expression data,
analysis of LIME, 2020, arXiv:2001.03447. Comput. Math. Methods Med. 2015 (2015) 1–7, 259474.
[7] D. Hosmer, S. Lemeshow, S. May, Applied Survival Analysis: Regression [39] J. Kim, I. Sohn, S.-H. Jung, S. Kim, C. Park, Analysis of survival data with
Modeling of Time to Event Data, John Wiley & Sons, New Jersey, 2008. group lasso, Comm. Statist. Simulation Comput. 41 (9) (2012) 1593–1605.
20 M.S. Kovalev, L.V. Utkin and E.M. Kasimov / Knowledge-Based Systems 203 (2020) 106164

[40] N. Ternes, F. Rotolo, S. Michiels, Empirical extensions of the lasso penalty to [54] U.B. Mogensen, H. Ishwaran, T.A. Gerds, Evaluating random forests for
reduce the false discovery rate in high-dimensional cox regression models, survival analysis using prediction error curves, J. Stat. Softw. 50 (11) (2012)
Stat. Med. 35 (15) (2016) 2561–2573. 1–23.
[41] D.M. Witten, R. Tibshirani, Survival analysis with high-dimensional [55] J.B. Nasejje, H. Mwambi, K. Dheda, M. Lesosky, A comparison of the con-
covariates, Stat. Methods Med. Res. 19 (1) (2010) 29–51. ditional inference survival forest model to random survival forests based
[42] H.H. Zhang, W. Lu, Adaptive Lasso for Cox’s proportional hazards model, on a simulation study as well as on two applications with time-to-event
Biometrika 94 (3) (2007) 691–703. data, BMC Med. Res. Methodol. 17 (115) (2017) 1–17.
[43] D. Faraggi, R. Simon, A neural network model for survival data, Stat. Med. [56] I.K. Omurlu, M. Ture, F. Tokatli, The comparisons of random survival forests
14 (1) (1995) 73–82. and cox regression analysis with simulation and an application related to
[44] C. Haarburger, P. Weitz, O. Rippel, D. Merhof, Image-based survival analysis breast cancer, Expert Syst. Appl. 36 (2009) 8582–8588.
for lung cancer patients using CNNs, 2018, arXiv:1808.09679v1. [57] M. Schmid, M.N. Wright, A. Ziegler, On the use of harrell’s c for clinical
[45] J.L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, Y. Kluger, Deepsurv: risk prediction via random survival forests, Expert Syst. Appl. 63 (2016)
Personalized treatment recommender system using a Cox proportional 450–459.
hazards deep neural network, BMC Med. Res. Methodol. 18 (24) (2018) [58] H. Wang, L. Zhou, Random survival forest with space extensions for
1–12. censored data, Artif. Intell. Med. 79 (2017) 52–61.
[46] R. Ranganath, A. Perotte, N. Elhadad, D. Blei, Deep survival analysis, 2016, [59] M.N. Wright, T. Dankowski, A. Ziegler, Random forests for survival analysis
arXiv:1608.02158. using maximally selected rank statistics, 2016, arXiv:1605.03391v1.
[47] X. Zhu, J. Yao, J. Huang, Deep convolutional neural network for survival [60] M.N. Wright, T. Dankowski, A. Ziegler, Unbiased split variable selection for
analysis with pathological images, in: 2016 IEEE International Conference random survival forests using maximally selected rank statistics, Stat. Med.
on Bioinformatics and Biomedicine (BIBM), IEEE, 2016, pp. 544–547. 36 (8) (2017) 1272–1284.
[48] V. Van Belle, K. Pelckmans, S. Van Huffel, J.A. Suykens, Support vec- [61] F. Harrell, R. Califf, D. Pryor, K. Lee, R. Rosati, Evaluating the yield of
tor methods for survival analysis: a comparison between ranking and medical tests, JAMA 247 (1982) 2543–2546.
regression approaches, Artif. Intell. Med. 53 (2) (2011) 107–118. [62] F. Barthe, O. Guedon, S. Mendelson, A. Naor, A probabilistic approach to
[49] F.M. Khan, V.B. Zubek, Support vector regression for censored data (SVRc): the geometry of the l-ball, Ann. Probab. 33 (2) (2005) 480–513.
a novel tool for survival analysis, in: 2008 Eighth IEEE International [63] R. Harman, V. Lacko, On decompositional algorithms for uniform sampling
Conference on Data Mining, IEEE, 2008, pp. 863–868. from n-spheres and n-balls, J. Multivariate Anal. 101 (2010) 2297–2304.
[50] P.K. Shivaswamy, W. Chu, M. Jansche, A support vector approach to cen- [64] R. Bender, T. Augustin, M. Blettner, Generating survival times to simulate
sored targets, in: Seventh IEEE International Conference on Data Mining, cox proportional hazards models, Stat. Med. 24 (11) (2005) 1713–1723.
ICDM 2007, IEEE, 2007, pp. 655–660. [65] N. Xie, G. Ras, M. van Gerven, D. Doran, Explainable deep learning: A field
[51] A. Widodo, B.-S. Yang, Machine health prognostics using survival prob- guide for the uninitiated, 2020, arXiv:2004.14545.
ability and support vector machine, Expert Syst. Appl. 38 (7) (2011) [66] G. Visani, E. Bagli, F. Chesani, A. Poluzzi, D. Capuzzo, Statistical stability
8430–8437. indices for LIME: obtaining reliable explanations for machine learning
[52] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. models, 2020, arXiv:2001.11757.
[53] N.A. Ibrahim, A. Kudus, I. Daud, M.R. Abu Bakar, Decision tree for compet-
ing risks survival probability in breast cancer study, Int. J. Biol. Med. Res.
3 (1) (2008) 25–29.

You might also like