Wei Wei - Application of Naive Bayes Model

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

The Application of

Naive Bayes Model Averaging


to Predict Alzheimer͛s Disease
from Genome-Wide Data
Wei Wei, Shyam Visweswaran and Gregory F. Cooper

m 
    
 

@    
   
 
     
 
 
° Genome-wide association studies (GWASs)
° Single-nucleotide polymorphism (SNP)
° High-throughput genotyping technologies
° Alzheimer͛s disease (AD):
° AD afflicts about 10% of persons over 65 and
almost half of those over 85
° ~5.5 million cases currently in U.S.
° 95% of all AD cases are Late-Onset AD (LOAD)
 
° Source
TGEN dataset by Reiman et al *
° Cases
° 1411 individuals
° 861 LOAD and 550 controls
° SNPs
° 312,316 SNPs
° Two additional SNPs (rs429358 and rs7412) genotyped
separately (these determine APOE status)
____________________________________________________________________
* Reiman E, Webster J, Myers A, Hardy J, Dunckley T, Zismann V, et al. GAB2 alleles
modify Alzheimer's risk in APOE epsilon4 carriers. Neuron. 2007;54(5):713-20.
 
° Bayesian Model Averaging
° Represents uncertainty about the correctness of
any given model
° Performs inference by weighting the prediction of
each model by our uncertainty in that model
° Model-Averaged Naïve Bayes (MANB)
MANB efficiently averages over all naive Bayes
models (on a given set of variables) in making a
prediction for an individual patient case
@ 

LOAD

SNP 1 SNP 2 SNP 3 ͙ SNP


312318
@   !  !
Perform feature selection using a greedy, forward-stepping
search that optimizes the prediction of LOAD

LOAD

SNP SNP SNP SNP


25,920 276,455 104,582 1,100
@ @ "@"

LOAD

SNP 1 SNP 2
͙ SNP
312,318
@ @"

Model 1 ͙ Model i ͙ Model


1 , 1 

    *
ÿ  ÿ 


6*
    
6  
6  66  
@ @"
° We can take advantage of the conditional independence
relationships in NB models to make it efficient to model
average over all those many models.
° The computational ͞trick͟ is as follows*
° For each O  we construct a model-averaged conditional
probability,  (O  | ), by averaging over whether or
not there is an arc from  to O 

This step can be viewed


as a ͞soft͟ form of
feature selection.

____________________________________________________________________
* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.
International Conference on Machine Learning (2002) 91 - 98.
@ @"
° We can take advantage of the conditional independence
relationships in NB models to make it efficient to model
average over all those many models.
° The computational ͞trick͟ is as follows*
° For each O  we construct a model-averaged conditional
probability,  (O  | ), by averaging over whether or not
there is an arc from  to O 
° We use these model-averaged conditional probabilities to define a
new NB model M over which we now perform NB inference.
° Performing inference with M is the same as model averaging over
the exponential number of NB models discussed previously.
____________________________________________________________________
* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.
International Conference on Machine Learning (2002) 91 - 98.
@      
° Structure priors
° FSNB and MANB assume each arc is present with some
probability „, independent of the status of other arcs in
the model.
° Informed by the literature, we chose a value of „ that
yields an expected number of arcs of 20.
° Parameter priors
If we think of (O  |) as defining a table of
probabilities, then we assume that every way of filling in
that table (consistent with the axioms of probability) is
equally likely
„  
@ #$ 

° Five-fold cross-validation
° Performance measures
° Area under the ROC curve (AUC) as a measure of
discrimination
° Calibration plots and Hosmer-Lemeshow goodness-of-
fit statistics
° Run time
° Control algorithms
° NB
° FSNB
º º   
%  %
2000
1684.2
1500

MANB
1000
NB

500 FSNB

16.1 15.6
0
MANB NB FSNB

Machine parameters: CPU 2.33 GHz, RAM 2 GB. Training time


was the average over the five cross-validation folds. Time for loading
data into memory is not included, but was about XYZ seconds.
º " º  "m
  
 "m
 ! 
@"(95%
confidence interval of
their AUC difference is
-0.008 to 0.029). Their
performance is strongly
influenced by several
APOE SNPs.
 "m
 @"
 


 (p<0.00001).
º    

  
 
 with
almost all the test
cases having
probability
predictions near 0 or
1. Such extreme
predictions occur
because there are
such a large number
of features in the
model.
º    
  !
  
! 
 algorithm
among the three we
evaluated. This result
is likely due to the
FSNB models
containing only a few
SNP features (< 4).
º    
& ! @"

  
@" 
  '

@" 
  !'We
believe this result may
be due to FSNB having
such a small number of
features in its models.
! 
º 

 ! @"

"m Ë Ë

  ËË Ë

º  ËË ËË
" " 

 A full description of the MANB algorithm is available


in the appendix of our paper.
 It provides all the details needed to readily
implement the algorithm.
 (        
° Apply the MANB algorithm to additional
datasets
° Predict additional clinical outcomes
° Use both genomic and clinical data to predict
clinical outcomes
° Explore the use of additional genome-wide
measurement platforms, including next
generation sequencing data
° Include additional control algorithms in future
evaluations
"  

° We thank Mr. Kevin Bui for his help in data


preparation, software development, and the preparation
of the appendix. We thank Dr. Pablo Hennings-
Yeomans, Dr. Michael Barmada, and the other members
of our research group for helpful discussions.
° The research reported here was funded by NLM grant
R01-LM010020 and NSF grant IIS-0911032.
Thank you

Questions?

You might also like