Professional Documents
Culture Documents
Articulo 1
Articulo 1
2 PROPOSED APPROACH
We introduce a novel approach for the discretization of gene ex-
pression data to obtain gene expression profiles, interpretable by
GECCO ’22 Companion, July 9–13, 2022, Boston, MA, USA A. Lopez, et al.
1.0
𝑓 (𝐼 ) = 𝑤 𝑝 · + 𝑤𝑟 · 𝑛𝑝 (1)
1.0 + 𝐴(𝑋, 𝑦)
where 𝑋 is the discretized dataset, 𝑦 is the vector with the known Figure 2: 10 runs of the REFS algorithm. The solution consid-
labels (mild/severe symptoms) for each sample, 𝐴(𝑋, 𝑦) is the classi- ered as the best compromise between accuracy and number
fication accuracy (number of correct label predictions by a classifier of features is marked with a red line (n=12).
over total number of samples), 𝑛𝑝 is the number of different profiles
in the dataset after discretization, and 𝑤 𝑝 and 𝑤𝑟 are weights. The
fitness function is to be minimized, as the ideal candidate solution
features both high accuracy and low number of different profiles, 3.3 Profile Generator
to ease the sensemaking of the domain experts. The EA selected for the profile generator is the Covariance Matrix
Adaptation Evolution Strategy (CMA-ES) [7], considered the state
of the art for numerical optimization of non-convex functions with
3 EXPERIMENTAL EVALUATION continuous values. After a few trial runs, the algorithm is set with
The proposed approach is implemented in Python 3, relying on the following parameters: 𝑁 = 12, 𝜇 = 100, 𝑥 0 = {0.5, 0.5, ...0.5},
the cma package for CMA-ES and the scikit-learn [9] package 𝜎0 = 0.1, 𝑤 𝑝 = 1.0, 𝑤𝑟 = 10−5 , all default stop conditions, and
for classification. All the code and data needed to reproduce the Logistic Regression as the classifier chosen to compute classification
experiments is freely available in the GitHub repository: https: accuracy 𝐴 for the fitness function described in Eq. 1. The choice of
//github.com/albertotonda/ea-discretization-health. Logistic Regression is motivated by its effectiveness and training
An Evolutionary Approach to the Discretization of Gene Expression Profiles to Predict the Severity of COVID-19 GECCO ’22 Companion, July 9–13, 2022, Boston, MA, USA
9 Genes
Mean (stdev) Healthy Best
Transformation
Figure 3: Heatmap of normalized gene expression data, show- Gradient Boosting [6] 0.9395 (0.0200) 0.8846 0.9703 0.9484
Random Forest [2] 0.9461 (0.0171) 0.9137 0.9703 0.9637
ing the values of each patient for the 12 most important Logistic Regression 0.9608 (0.0198) 0.9060 0.9929 0.9929
features selected by the REFS algorithm. Passive Aggressive Classifie [4] 0.9403 (0.0291) 0.8484 0.9929 1.0000
Stochastic Gradient Descent [13] 0.9344 (0.0298) 0.8764 0.9780 0.9709
Support Vector Machines (linear) [10] 0.9768 (0.0147) 0.8978 1.0000 0.9929
Ridge [8] 0.9372 (0.0203) 0.8918 0.9786 0.9418
speed, making it one of the most suitable algorithms for our scenario. Bagging [1] 0.9344 (0.0179) 0.8923 0.9703 0.9484
The classifier is run in a 10-fold cross-validation at each evaluation, Mean accuracy 0.9462 0.0143 0.8889 0.9817 0.9699
in order to obtain a more reliable estimate of accuracy. Mean #Profiles 45.4 8.9129 122.0 48.0 45.0
In order to provide a comparison, we also compute profiles based
on a classical technique of the domain, using the gene expression
on the gene expression data from healthy controls, both from the
levels of the healthy controls as a reference to discretize the gene
point of view of classification accuracy and number of different pro-
expressions of the patients.
files created after discretization. Transforming the dataset using the
A possible downside to our methodology is to create discretiza-
thresholds of the best individual found during the 20 experimental
tion thresholds that are uniquely tailored to the classifier used for
runs (Table 2) results in the heatmap shown in Fig. 5 where we have
the fitness function, in this case, Logistic Regression, and lose gener-
only values of 0 or 1, corresponding to under- or over-expressed
ality. In order to test the generality of our method, we transform the
genes.
data using all the resulting thresholds and compute mean accuracy
in a 10-fold cross-validation, using Logistic Regression and seven
Table 2: Generated thresholds for each of selected gene, con-
other state-of-the-art classifiers. The results available in Table 1
sidering the best individual obtained over the 20 runs.
show that even classifiers not included in the fitness function show
high levels of accuracy, providing evidence against overfitting. The
number of profiles obtained by each discretization, in comparison Ensemble ID Gene ID Thresholds
ENSG00000198826 ARHGAP11A 0.0032
to the global accuracy is shown in Fig. 4.
ENSG00000170298 LGALS9B 0.3639
ENSG00000214548 MEG3 0.1874
ENSG00000287576 - 0.4734
ENSG00000240403 KIR3DL2 0.1129
ENSG00000214174 AMZ2P1 0.0005
ENSG00000214460 TPT1P6 0.0746
ENSG00000263551 - 0.2120
ENSG00000220785 MTMR9LP 0.6295
ENSG00000224227 OR2L1P 0.0371
ENSG00000186523 FAM86B1 0.0012
ENSG00000155657 TTN 0.3101
Figure 4: Runs of the EA profile generator. The best solution Now, using the transformed dataset with the best thresholds
found by the proposed approach is marked in red (n=48). found (Table 2) we are able to apply the REFS algorithm again
The solution computed using the classical technique of con- (Fig. 6), which reduces even further the number of genes to (n=9) to
sidering the gene expression levels of healthy controls as separate between the two groups with a global accuracy of 0.9699,
references is marked in orange (n=122). reducing even further the number of profiles from 48 to 45 (Fig. 7).
From the results, we consider the best run as the solution gener- 4 DISCUSSION
ating 48 different profiles, with an average classification accuracy of The results presented in Table 1 clearly show that the proposed
0.9817 over the 8 classifiers. Our framework clearly outperforms the approach outperforms the more traditional discretization method-
more traditional approach of discretizing profiles based uniquely ology, both concerning the classification accuracy and the number
GECCO ’22 Companion, July 9–13, 2022, Boston, MA, USA A. Lopez, et al.
REFERENCES
[1] Leo Breiman. 1999. Pasting small votes for classification in large databases and
on-line. Machine Learning 36, 1-2 (1999), 85–103.
[2] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
Figure 6: 10 runs of the REFS algorithm on the dataset dis- [3] Joseph J. Cavallo, Daniel A. Donoho, and Howard P. Forman. 2020. Hospital
cretized by the thresholds found by the best individual. The Capacity and Operations in the Coronavirus Disease 2019 (COVID-19) Pan-
solution with the best compromise between accuracy and demic—Planning for the Nth Patient. JAMA Health Forum 1, 3 (March 2020),
e200345. https://doi.org/10.1001/jamahealthforum.2020.0345
number of features is marked with the red line at n=9. [4] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram
Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning
Research 7, Mar (2006), 551–585.
[5] Manuel Castro de Moura, Veronica Davalos, Laura Planas-Serra, Dami-
ana Alvarez-Errico, Carles Arribas, Montserrat Ruiz, Sergio Aguilera-Albesa,
Jesús Troya, Juan Valencia-Ramos, Valentina Vélez-Santamaria, et al. 2021.
Epigenome-wide association study of COVID-19 severity with respiratory failure.
EBioMedicine 66 (2021), 103339.
[6] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting
machine. Annals of statistics (2001), 1189–1232.
[7] N. Hansen and A. Ostermeier. 2001. Completely Derandomized Self-adaptation
in Evolution Strategies. Evolutionary computation 9, 2 (2001), 159–195. https:
//doi.org/10.1063/1.2713540
[8] Arthur E. Hoerl and Robert W. Kennard. 1970. Ridge Regression: Biased
Estimation for Nonorthogonal Problems. Technometrics 12, 1 (1970), 55–67.
https://doi.org/10.1080/00401706.1970.10488634
[9] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss,
Figure 7: Heatmap obtained after applying the thresholds Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the
represented by the best individual found to the dataset, con- Journal of machine Learning research 12 (2011), 2825–2830.
[10] John Platt and Others. 1999. Probabilistic outputs for support vector machines
sidering only the 9 most representative genes. and comparisons to regularized likelihood methods. Advances in large margin
classifiers 10, 3 (1999), 61–74.
[11] Max Roser. 2022. Covid-19 data explorer. https://ourworldindata.org/explorers/
coronavirus-data-explorer
of profiles produced after discretization. Furthermore, the excellent [12] Marilyn Safran, Irina Dalah, Justin Alexander, Naomi Rosen, Tsippi Iny Stein,
accuracy results in a 10-fold cross-validation for several classifier Michael Shmoish, Noam Nativ, Iris Bahir, Tirza Doniger, Hagit Krug, et al. 2010.
provide evidence that using just Logistic Regression as part of the GeneCards Version 3: the human gene integrator. Database 2010 (2010).
[13] Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic
fitness function does not overfit the discretization thresholds to a gradient descent algorithms. In Twenty-first international conference on Machine
single classifier. learning - ICML '04. ACM Press. https://doi.org/10.1145/1015330.1015332