f φ (w w φ w x

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A NEURAL-BAYESIAN APPROACH FOR PREDICTING SMEAR NEGATIVE

PULMONARY TUBERCULOSIS

Alcione M. dos Santos∗, Aristófanes C. Silva†, Leonardo de O. Martins†



Universidade Federal do Maranhão - UFMA
Departamento de Saúde Pública
Rua Barão de Itapary,155, Centro
CEP: 65020 - 070, São Luı́s-MA,Brasil

Universidade Federal do Maranhão - UFMA
Departamento de Engenharia e Eletricidade
Av. dos Portugueses, SN, Campus do Bacanga, Bacanga
65085-580, São Luı́s, MA, Brasil

Emails: alcione.miranda@terra.com.br, ari@dee.ufma.br, leonardo@gia.deinf.ufma.br

Abstract— This work analyzes the application Bayesian neural network for predicting Smear Negative Pul-
monary Tuberculosis (SNPT). The data used for developing the proposed model comprised of one hundred and
thirty-six patients from Health Care Units. They were referred to the University Hospital in Rio de Janeiro,
Brazil, from March 2001 to September 2002, with clinical-radiological suspicion of smear negative pulmonary tu-
berculosis. Only symptoms and physical signs were used for constructing the Bayesian neural network modelling,
which was able to correctly classify 82% of patients from a test sample.

Keywords— Neural networks, Bayesian learning, Markov chain Monte Carlo, medical diagnosis.

Resumo— Este trabalho analisa a aplicação de rede neural Bayesiana para predizer o diagnóstico de tuber-
culose pulmonar paucibacilar. Os dados utilizados para desenvolvimento do modelo neural referem-se ao estudo
de pacientes atendidos no Hospital Universitário Clementino Fraga Filho da Universidade Federal do Rio de
Janeiro, que apresentaram suspeita de TB pulmonar e com resultado de baciloscopia negativa, no perı́odo de
março de 2001 a setembro de 2002. Apenas sintomas e sinais fı́sicos foram usados para construção da rede neural
Bayesiana, a qual foi capaz de classificar corretamente 82% dos pacientes da amostra de teste.

Palavras-chave— Redes Neurais Artificiais, Aprendizado Bayesiano, Método Monte Carlo via Cadeia de
Markov, Diagnóstico Médico.

1 Introduction vide us not only the means of predictive weights


but also their uncertainties. Another advantage
During the last years, the application of artificial of BNN is the appropriate choices of a number
neural networks (ANN) for prognostic and diag- of hidden layers and their dimensions (Rios Insua
nostic classification in clinical medicine has at- and Müller, 1998).
tracted growing interest in the medical literature In this work, we use Bayesian neural networks
(Edwards et al., 1999; El-Solh et al., 1999; Tu, for developing and evaluating a prediction model
1996). for diagnosing SNPT, useful for patients attended
(Santos, 2003) uses neural networks and clas- in Health Care Units within areas of limited re-
sification trees to identify patients with clinical- sources. In addition, we analyze the information
radiological suspicion of smear negative pulmo- that is extracted by the model and compare it to
nary tuberculosis. (Kiyan and Yildirim, 2003) expert analysis for diagnosis. This analysis aims
employed radial basis function, general regression at helping doctors to understand the way the mo-
neural network and probabilistic neural network del works and to make them more confident in its
in order to breast cancer diagnosis. This work practical application.
also indicates that statistical neural networks can
be effectively used for breast cancer diagnosis to 2 Bayesian Neural Networks
help oncologists.
An attractive feature of ANN is their abi- In this paper, we focus exclusively on feedforward
lity to model highly complex non-linear relati- networks with a single hidden layer with m nodes
onships in data (Bishop, 1996). Recently, Baye- and k outputs, besides do not allow direct connec-
sian methods have been proposed for neural tions from the inputs and outputs. The particular
networks to solve regression and classification pro- form of the neural model we will work is
blems (Lampinen and Vehtari, 2001). These
methods claim to overcome some difficulties en- m p
X X
countered in the standard approach such as over- fk (x, w) = φk {wk0 + wkj φj (wj0 + wij xi )}
fitting. j=1 i=1

The Bayesian neural networks (BNN) can pro- (1)


where x is the input vector with p explanatory ϕ = (λα , λβ , λγ ).
variables, xi and φ represents the activation func- Once we have chosen the prior distributions,
tion and the set of all weights (parameters), re- we combine the evidence from the data to get the
presented by the vector w, including input-hidden posterior distribution for the parameters and hy-
weights, biases and hidden-output weights. If the perparameters.
neural model is used for classification problem, the
output fk (x, w) is the final value used for classifi- 2.2 Likelihood Function and Posterior Distribu-
cation process. tion
In the Bayesian approach to learning neu-
ral network (Buntine and Weigend, 1991; Mac- We assume that we have a data set consisting of
kay, 1991; Mackay, 1992), the objective is to find n input vectors xi , . . . , xn and the corresponding
the weights posterior distribution mode. To ob- target yi . For classification problems with two
tain the posterior distribution of the weights, we classes, it is known that under conditions the out-
need to specify the prior distribution, which is a put fk (x, w) can be interpreted as the probability
probability distribution that represents the prior that a target yi belongs to a certain class. For
information associated with the weights of the the specific problem at hand, the BNN is used to
network, and the data likelihood. Firstly, we will find the probability of the patient is with active
discuss how we choose the prior distribution of the pulmonary tuberculosis (PT).
weights. Assuming that (x1 , y1 ), . . . , (xn , yn ) are inde-
pendents and identically distributed according to
Bernoulli distribution, we have the following like-
2.1 Prior Distribution for Neural Networks lihood function for the training data
Many implementations of Bayesian neural n
Y
networks use Gaussian distribution, with zero P (D | w) = fk (xi , w)yi [1 − fk (xi , w)]1−yi (2)
mean and some specified width, as the priors for i=1
all weights and biases in the network. To specify
After observing the data, using Bayes theorem
the prior distributions, the weights were divided
and likelihood, prior distribution is updated to the
into three separate groups: bias terms, input to posterior distribution
hidden weights and hidden to output weights.
We consider a Gaussian prior with zero mean P (D | w, ϕ)P(w, ϕ)
P (w, ϕ | D) =
and unknown variance 1/λα for the input to hid- P (D)
den weights, where λα is a precision parameter. P (D | w, ϕ)P(w | ϕ)P(ϕ)
= RR
Instead of fixing the λα value, we regard it as P (D | w, ϕ)P(w, ϕ)dwdϕ
another parameter. We would then call it a hyper- (3)
parameter to separate it from weights and biases.
Now, we need to specify a hyperprior distribution
for λα . The denominator in the Equation 3 is someti-
mes called normalizing constant that ensures that
Although there are several ways to implement
the posterior distribution integral is equal to one.
the required hyperprior, we choose a Gamma hy- This constant can be ignored since it is irrelevant
perprior, with mean and specified shape parame- to the first level of inference. Therefore, the theo-
ter (Berger, 1985). This process can be exten- rem may also be written as
ded, where each input weight have different priors
and hyperpriors. This process is called Automa- P (w, ϕ | D) ∝ P (D | w, ϕ)P(w | ϕ)P(ϕ) (4)
tic Relevance Determination (ARD) (Neal, 1996).
Using this prior distribution, it is possible deter-
Given a training data, to find the weight vec-
mine the relative importance of the different in-
tor w∗ , corresponding to the maximum of the pos-
puts. The relevance of each input is considered to
terior distribution, is equivalent to minimize the
be inversely proportional to the variance of this
error-function E(w), which is given by
distribution.
In order, the prior distribution for hidden to
output weights was also considered Gaussian with E(w) = − ln P(D | w) + ln P(w, ϕ) (5)
zero mean and unknown variance λβ . We use
Gamma hyperprior with mean and specified shape where P (D | w) is the likelihood function presen-
to hyperparameter λβ . ted in Equation 2.
Finally, all biases terms are then assumed to In the Bayesian learning, the posterior distri-
be distributed according to a gaussian prior with bution is used to find the predictive distributions
mean zero and variance λγ , where the Gamma for the target values in the new test case given the
hyperprior, with mean and specified shape, was inputs for that case as well as inputs and target for
again used to the hyperparameter ηγ . To facilitate training cases (Ghosh et al., 2004). To predict the
the notation, let’s denote a set of hyperparameters new output yn+1 for new input xn+1 , predictive
distribution is obtained by integrating the predic- 3 Methodology
tions of the model with respect to the posterior
distribution of the model parameters 3.1 Data Set

P The data set refers to one hundred and thirty-six


R (y
R (n+1) | x(n+1) , D) =
P (y(n+1) | x(n+1) , w, ϕ)P(w, ϕ | D)dwdϕ patients, who agreed to participate in the study.
(6) They were referred to Hospital Clementino Fraga
Filho, a University Hospital of Federal Univer-
The posterior distribution presented in Equa-
sity of Rio de Janeiro, Brazil, from March, 2001
tion 4 is very complex and its evaluation requi-
to September, 2002, with clinical-radiological sus-
res high-dimensional numerical integration, then
picion of SNPT. The data consisted of informa-
it is impossible to compute it exactly. In order, to
tion from anamnesis interview and included de-
make this integral analytically tractable, we need
mographic and risk factors typically known for tu-
to introduce some simplifying approximations.
berculosis diagnosis.
These patients were under suspicion of ac-
tive pulmonary tuberculosis, presenting negative
2.3 Markov Chain Monte Carlo Methods smear. Forty three per cent of these patients ac-
tually showed PT in activity.
There are different approaches to calculate the Twelve clinical variables were considered for
posterior distribution. In (Mackay, 1992) is used model development. These included: age, cough,
a Gaussian approximation for the posterior dis- sputum, sweat, fever, weight loss, chest pain,
tribution, while in (Neal, 1996) is introduced shudder, alcoholism and others.
a hybrid Monte Carlo method. Another ap-
To obtain the training and testing sets, the
proach to approximate the posterior distribu-
original data were randomly divided into the trai-
tion uses Markov Chain Monte Carlo method
ning and testing sets. Given the statistical limi-
(MCMC) (Rios Insua and Müller, 1998). For a
tations of the available data, a division in the
review of these methods see, for instance (Bishop,
form of 80% of the patients for the training set
1996).
and 20% for the testing set would be preferable.
The idea of MCMC is to draw a sample of With this strategy, model development was made
values w(t) , t = 1, . . . , M from the posterior dis- varying the number of hidden neurons. It was ob-
tribution of network parameters. In this work, we served that some neural networks exhibited poor
used Gibbs sampling (Geman and Geman, 1984) performance. This is due to poor statistical repre-
to generate samples to the posterior distribution. sentation of training set with respect to patterns
Gibbs sampling is perhaps the simplest belonging to testing set (Santos et al., 2006).
MCMC method and it is applicable when the To avoid the possibility of some regions not
joint distribution is not known explicitly, but the represented, training and testing sets were also
full conditional distribution of each parameter is obtained data clustering. Efficient alternative for
known. In a single iteration, Gibbs sampling in- data selection is cluster analysis (Morrison, 1990).
volves sampling one parameter from full conditio- This technique permits to detect the existence of
nal distribution given all other parameters. clusters in given data. This grouping process can
Gibbs sampling requires that all conditional be seen as an unsupervised learning technique.
distributions of the target distribution can be sam- The clustering method under investigation
pled exactly. When the full conditional distribu- pointed out three clusters in the data set. In this
tion was unknown, it was used the Metropolis- case, the training set was obtained by randomly
Hasting algorithm (Hastings, 1970) or adaptive selecting 75% of the patients in each cluster and
sampling procedure (Gilks and Wild, 1992). For 25% of patients was selected to form the testing
more details of this method, see (Gamerman, set.
1997).
We can observe that the integral of Equation 6
is the expectation of function fk (x(n+1) , w) with 3.2 Implementation of BNN
respect to the posterior distribution of the para-
meters. This expectation can be approximated by We generated a BNN with 12 inputs, a single-layer
MCMC, using a sample of values w(t) drawn from feedforward with a fixed number m of hidden no-
the posterior distribution of parameters. These des and one output node. The nonlinear activa-
values are then used to calculate tion function used for the hidden units and output
units was the logistic sigmoid, which produces an
M output between 0 and 1.
1 X
y(n+1) ≈ fk (x(n+1) , w(t) ) (7) We used the Gaussian prior distribution as
M t=1
described in Section 2, with three separate weight-
groups. The prior over network parameters are one, that was obtained with two hidden neurons.
The results of BNN are comparable to those
ui | λα ∼ N (0, λ−1
α ), i = 1, . . . , I. of artificial neural network with four hidden neu-
vj | λβ ∼ N (0, λ−1
β ), j = 1, . . . , H. rons. This network was selected because it has the
smallest classification errors in the test set.
bk | λγ ∼ N (0, λ−1
γ ), k = 1, . . . , S. (8)
According to Table 1, the BNN with presented
where ui represents the input-hidden weights, vj the largest accuracy (82%), as well as the largest
the hidden-outputs weights and bk the biases specificity (85%). However, the ANN possesses
terms. the largest sensitivity (100%), i.e., this network
A convenient form for the hyperprior distri- better classifies the individuals with active PT.
butions is vague Gamma distribution. Here, we
considered all hyperparameters distributed accor- Tabela 1: Classification efficiencies
ding Gamma distribution with the scale and shape ANN BNN
parameter equal to 0.001. The priors for different
Accuracy 76% 82%
parameters and hyperparameters are all indepen-
Specificity 60% 85%
dent.
Sensitivity 100% 79%
The software WinBUGS (Spiegelhalter et al.,
2003) was used to implement the Bayesian neural
network. Through WinBUGS, we specified the
model described in Section 2. Next, the software
simulated the posterior distribution values for 5 Conclusion
each parameter of interest, using the Metropolis-
The present study has applied Bayesian neural
within-Gibbs procedure. We computed a single
network to predict medical diagnosis. The Baye-
chain of a MCMC sampler in WinBUGS for each
sian neural network model achieved good classi-
parameter of interest (weights and bias). We si-
fication performance, exhibiting sensitivity from
mulate 20000 iterations, and discarded the 10000
79% and specificity from 85%.
first in each sequence. The experiment was con-
It is known that parsimonious models with
figured with 102 training samples and 34 samples
few hidden neurons are preferable, because they
for tests.
tend to show a better generalization ability, redu-
The posterior distribution samples for the mo-
cing the overfitting problem. Although, the ANN
del parameters were used to estimate the predic-
possesses the largest sensitivity, this network has
tive distribution for the new test inputs. For each
more hidden neurons, and it is not good when the
iteration t, the BNN has parameters w(t) and pro-
amount of available data is small.
duces an output y (t) , for an input vector x . Thus,
Many issues remain to be explored. For exam-
for each test sample, we calculate the arithme-
ple, we should try other prior distributions and to
tic mean of the M network outputs, according to
treat the number hidden neurons as an additional
Equation 7.
parameter.
Several feedforward neural networks also were
tested. The networks have log-sigmoid activa- Acknowledgements
tion functions, one output neuron and one hidden The authors thank the staff and students of
layer, with the optimum number of hidden neu- the Tuberculosis Research Unit, Faculty of Me-
rons found empirically. Alternative parameters dicine, Federal University of Rio de Janeiro, for
were used (learning rate, momentum and number making available the data used in this work.
of iterations) of the backpropagation algorithm.
The performance of the BNN and ANN was
evaluated through the classification for the testing References
set, which is referred to here as accuracy. Other
descriptive statistics were also used to evaluate Berger, J. O. (1985). Statistical Decision The-
the performance of neural networks in study, they ory and Bayesian Analysis, 2a edn, Springer-
are: sensitivity and specificity, since those measu- Verlag, New York.
red are of general use in the medicine. Sensitivity
Bishop, C. M. (1996). Neural Networks for Pat-
of the neural model will tell us how the model is
tern Recognition, 2a edn, Oxford University
classifying the patients with PT in activity, while
Press, New York.
the specificity will tell how the model is classifying
patients without PT in activity. Buntine, W. L. and Weigend, A. S. (1991).
Bayesian backpropagation, Complex Systems
4 Results 5: 603–643.

Bayesian neural networks with 2, 3, 4 and 5 hidden Edwards, D. F., H., H., Zazulia, A. R. and
neurons were tested and we report just the best Diringer, M. N. (1999). Artificial neural
networks improve the prediction of mortal- Santos, A. M. (2003). Redes neurais e árvores de
ity in intracerebral hemorrhage, Neurology classificação aplicadas ao diagnóstico da tu-
53: 351–356. berculose pulmonar paucibacilar, PhD thesis,
COPPE/ UFRJ, Rio de Janeiro.
El-Solh, A. A., Hsiao, C.-B., Goodnough, S.,
Serghani, J. and Grant, B. J. B. (1999). Santos, A. M., Pereira, B. B., Seixas, J. M.,
Predicting active pulmonary tuberculosis Mello, F. C. Q. and Kritski, A. L. (2006).
using an artificial neural network, Chest Neural networks: An application for pre-
116(4): 968–973. dicting smear negative pulmonary tubercu-
URL: 1 losis, in A. Jean-Louis, N. Balakrishnan,
M. Mesbah and G. Molenberghs (eds), Ad-
Gamerman, D. (1997). Markov Chain Monte vances in Statistical Methods for the Health
Carlo: Stochastic Simulation for Bayesian Sciences: Applications to Cancer and AIDS
Inference, Chapman & Hall, Londres. Studies, Genome Sequence Analysis, and
Geman, S. and Geman, D. (1984). Stochas- Survival Analysis, Birkhäuser Boston, New
tic relaxation, Gibbs distributions and the York, pp. 181–193.
Bayesian restoration of images, IEEE Trans- Spiegelhalter, D. J., Thomas, A., Best, N. and
actions on Pattern Analysis and Machine In- Lunn, D. (2003). WinBUGS manual ver-
telligence 6: 721–741. sion 1.4, Technical report, MRC Biostatistics
Ghosh, M., Maiti, T., Chakraborty, S. and Tewari, Unit, Institute of Public Health, Robinson
A. (2004). Hierarchical bayesian neural net- Way, Cambridge.
works: An application to prostate cancer
Tu, J. V. (1996). Advantages and disadvan-
study, Journal of the American Statistical
tages of using artificial neural networks ver-
Association 99(467): 601–608.
sus logistic regression for predicting medical
Gilks, W. and Wild, P. (1992). Adaptive rejec- outcomes, Journal Clinical of Epidemiology
tion sampling for Gibbs sampling, Journal of 49(11): 1225–1231.
the Royal Statistical Society Series C (Ap-
plied Statistics) (41): 337–348.
Hastings, W. (1970). Monte Carlo sampling meth-
ods using Markov chains and their applica-
tions, Biometrika 57: 97–109.
Kiyan, T. and Yildirim, T. (2003). Breast cancer
diagnosis using statistical neural networks,
International XII. Turkish Symposium on
Artificial Intelligence and Neural Networks .
Lampinen, J. and Vehtari, A. (2001). Bayesian
approach for neural networksreview and case
studies, Neural Network (14): 7–24.
Mackay, D. J. C. (1991). Bayesian Mathods for
Adaptive Models, Tese de doutorado, Califor-
nia Institute of Technology.
Mackay, D. J. C. (1992). A practical Bayesian
framework for backpropagation networks,
Neural Computation 4(3): 448–472.
Morrison, D. F. (1990). Multivariate Statistical
Methods, 3a edn, McGraw Hill, New York.
Neal, R. M. (1996). Bayesian Learning for Neural
Networks, Springer-Verlag, New York.
Rios Insua, D. and Müller, P. (1998). Feedforward
neural networks for nonparametric regres-
sion, in D. Dey, P. Müller and D. Sinha (eds),
Practical Nonparametric and Semiparamet-
ric Bayesian Statistics, Springer-Verlag, New
York, pp. 181–193.

You might also like