Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

pubs.acs.

org/molecularpharmaceutics Article

Bioactivity Comparison across Multiple Machine Learning


Algorithms Using over 5000 Datasets for Drug Discovery
Thomas R. Lane, Daniel H. Foil, Eni Minerali, Fabio Urbina, Kimberley M. Zorn, and Sean Ekins*
Cite This: Mol. Pharmaceutics 2021, 18, 403−415 Read Online

ACCESS Metrics & More Article Recommendations *


sı Supporting Information
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

ABSTRACT: Machine learning methods are attracting considerable attention from the
Downloaded via AMRITA VISHWA VIDYAPEETHAM on June 14, 2024 at 07:46:51 (UTC).

pharmaceutical industry for use in drug discovery and applications beyond. In recent studies, we
and others have applied multiple machine learning algorithms and modeling metrics and, in some
cases, compared molecular descriptors to build models for individual targets or properties on a
relatively small scale. Several research groups have used large numbers of datasets from public
databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The
largest of these types of studies used on the order of 1400 datasets. We have now extracted well over
5000 datasets from CHEMBL for use with the ECFP6 fingerprint and in comparison of our
proprietary software Assay Central with random forest, k-nearest neighbors, support vector
̈ Bayesian, AdaBoosted decision trees, and deep neural networks (three layers).
classification, naive
Model performance was assessed using an array of fivefold cross-validation metrics including area-
under-the-curve, F1 score, Cohen’s kappa, and Matthews correlation coefficient. Based on ranked
normalized scores for the metrics or datasets, all methods appeared comparable, while the distance
from the top indicated that Assay Central and support vector classification were comparable. Unlike
prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case.
If anything, Assay Central may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing
over 570,000 unique compounds was based on Assay Central performance, although support vector classification seems to be a
strong competitor. We also applied Assay Central to perform prospective predictions for the toxicity targets PXR and hERG to
further validate these models. This work appears to be the largest scale comparison of these machine learning algorithms to date.
Future studies will likely evaluate additional databases, descriptors, and machine learning algorithms and further refine the methods
for evaluating and comparing such models.
KEYWORDS: deep learning, drug discovery, machine learning, pharmaceutics, support vector machines

■ INTRODUCTION
The pharmaceutical industry is increasingly looking to using
warranted comparisons with other methods30 described as
follows.
machine learning for drug discovery applications to leverage Large scale analyses of public structure activity relationship
their considerable high-throughput screening as well as many datasets have been primarily focused on creating chemo-
other types of data.1 Some recent examples outside of big genomic databases which can be used to predict off-target
pharma have demonstrated the speed with which machine effects as well as compound repurposing. In 2013, Target-
learning and in vitro testing can produce new leads compared Hunter was described as a web portal that integrated ECFP6
to traditional efforts.2 With many thousands of structure descriptors with two different algorithms [Targets Associated
activity datasets with data from hundreds to hundreds of with its MOst Similar Counterparts (TAMOSIC) and multiple
thousands of molecules screened against a single target or category models (MCM)] for 794 targets in ChEMBL. Seven-
organism, this represents a potentially valuable starting point fold cross-validation results gave an average accuracy of 90.9%
for drug discovery as well as understanding molecular for TAMOSIC and 74.8% for MCM.31 894 human protein
properties and predicting toxicity. Accessible public databases targets from ChEMBL were used to train and compare two
probabilistic machine-learning algorithms (Naive ̈ Bayes and
like ChEMBL,3 PubChem,4,5 and many others have spawned
numerous studies utilizing such data for a wide array of
applications in machine learning with many algorithms such as Received: October 12, 2020
deep neural networks (DNNs),6−8 support vector machines Revised: November 13, 2020
(SVMs),9−15 k-nearest neighbors (Knns),16 Naive ̈ Baye- Accepted: December 4, 2020
sian,17−21 22
decision trees, and others 23,24
which have been Published: December 16, 2020
increasingly used.25−28 We are also observing wider usage of
deep learning for pharmaceutical applications,29 and this has

© 2020 American Chemical Society https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013


403 Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

Parzen−Rosenblatt window) against a test set from the using 10-fold cross-validation and an independent test set, and
WOMBAT database used as an external test set. Recall was in both cases, the receiver operating characteristic (ROC)
consistently higher for the Parzen−Rosenblatt window.32 score was 0.996−0.997.50 Subsets of natural products such as
Smaller subsets of ChEMBL have also been used for various macrolactones have also been the recent focus of machine
studies including extraction of 29 datasets for comparing 11 learning. In one case using biological data from ChEMBL,
ligand efficiency scores with pIC50 as well using Morgan models were generated for macrolactones that possessed a
fingerprints or physicochemical descriptors with four machine range of activities for Plasmodium falciparum, hepatitis C virus,
learning algorithms.33 Models were found to have a higher and T-cells using RF, SVM Regression, Naive ̈ Bayes, Knn,
predictive power when based on ligand efficiencies. Viral DNN, as well as consensus and hybrid approaches from the
focused datasets from ChEMBL (including human immuno- two best algorithms (RF_KNN and RF_DNN, the average
deficiency virus, hepatitis C virus, hepatitis B virus, and human prediction from RF and Knn or DNN). These three case
herpes virus among a selection of 26 viruses) were used to studies demonstrated that RF was the best predictor across six
create SVM regression or classification models after generating descriptor sets while consensus modeling or the hybrid slightly
18,000 descriptors with Padel software. Ten-fold cross- increased the predictive power of the quantitative structure
validation statistics as well as validation with molecules left activity relationship (QSAR) models.51
out of the models were generated. Validation statistics for All of the preceding examples have in common that few, if
classification models indicated sensitivity, specificity, and any, used prospective prediction as a form of validation. The
accuracy >80%. These models were subsequently used to main aim of this study was to evaluate a Bayesian method,
generate an integrated web server called AVCpred.34 A set of Assay Central, which we have previously used in several
G-protein coupled receptor targets for 5-HT2c, melanin- examples where we have compared it versus other machine
concentrating hormone, and adenosine A1 were used to build learning methods against a relatively small number of datasets
Gaussian process models with CATS2 descriptors which were or targets such as drug-induced liver injury,52 rat acute oral
more predictive for a test set than models developed from a toxicity,53 estrogen receptor,54,55 androgen receptor,56
proprietary Boehringer Ingelheim dataset. This was likely due GSK3β,57 Mycobacterium tuberculosis,58,59 non-nucleoside
to the fact that the ChEMBL set was larger and more reverse transcriptase, and whole cell HIV.60 Our evaluation
structurally diverse.35 173 human targets with Ki bioactivity has now been expanded with this study to over 5000
data were extracted from ChEMBL and used to build Naive ̈ CHEMBL datasets and provides statistics and data visual-
Bayes, logistic regression, and random forest (RF) models ization methods in order to assess each algorithm. In contrast
using Morgan fingerprints or FP2 that were in turn used for to other large-scale machine learning comparisons, we also
target inference.36 Seven-fold cross-validation and temporal now describe prospective prediction for a selection of
cross-validation demonstrated that cutoffs that were more absorption, distribution, metabolism, excretion and toxicity
potent had better accuracy and Matthew’s correlation (ADME/Tox) properties and retrospective analysis with
coefficient (MCC). These models were then used for multiple infectious disease datasets for M. tuberculosis and HIV.
target prediction using the most similar ligand with false
discovery rate control procedures to correct the p values. Only
two molecules were used as examples for which predictions
■ EXPERIMENTAL SECTION
Computing. Computational servers consisted of the
were made.36 25 datasets from ChEMBL were used to following components: Supermicro EATX DDR4 LGA 2011,
compare RF, multiple linear regression, ridge regression, Intel Computer CPU 2.1 8 BX80660E52620V4, Crucial 64GB
similarity searching, and random selection of compounds as Kit (16GB × 4) DDR4 2133 (PC42133) DR × 4 288 Pin
approaches to identify highly potent molecules.37 Linear and Server Memory CT4K16G4RFD4213/CT4C16G4RFD4213,
ridge regression were 2−3 times faster than RF and similarity 10 × EVGA GeForce GTX 1080 Ti FOUNDERS EDITION
searching at finding highly potent molecules. 550 human GAMING, 11GB GDDR5X, Intel 730 SERIES 2.5 in. Solid
protein targets from ChEMBL version 22 were used for RF State Drive SSDSC2BP480G410, WD Gold 4TB Datacenter
and conformal prediction models and used to predict Hard Disk Drive 7200 RPM Class SATA 6 Gb/s 128MB
additional targets from versions 23 and 24. The conformal Cache 3.5 in. WD4002FYYZ, and Supermicro 920 W 4U
prediction approach generally outperformed RF on internal Server. The following software modules were installed: nltk
and temporal validation, whereas for external validation, the 3.2.2, scikit-learn 0.18.1, Python 3.5.2, Anaconda 4.2.0 (64-
situation was reversed.38 There are likely many additional bit), Keras 1.2.1, Tensorflow 0.12.1, and Jupyter Notebook
published examples like this, but these clearly point to the 4.3.1.
progress that has been made to date. Datasets. Datasets were generated in the following manner
As a database like ChEMBL is so large, it has become a as described previously.61 A list of all molecules present in
central resource for many types of machine learning ChEMBL 6 2 version 25 was generated from the
projects.39−48 It also contains a large number of cell lines chembl_25.sdf.gz file downloaded from the ChEMBL ftp
with cytotoxicity data and cancer-related targets, and these server. A list of all the targets present in ChEMBL25 was
have been combined and used recently to derive combined generated via the interface on ChEMBL API. Targets of types
perturbation theory machine learning models;49 however, the “ADMET”, “cell-line”, “chimeric protein”, “macromolecule”,
authors did not explore prospective external testing of this “metal”, “oligosaccharide”, “organism”, “phenotype”, “protein
method. Natural products are important as they are the basis complex”, “protein complex group”, “protein family”, “protein
for many small molecule drugs. Efforts to identify compounds nucleic-acid complex”, “single protein”, “small molecule”, and
that are natural product-like have curated 265,000 natural “tissue” were included in the analysis, while targets of types
products (including those in ChEMBL) alongside 322,000 “lipid”, “nonmolecular”, “no target”, “nucleic-acid”, “selectivity
synthetic molecules to derive RF classification models using group”, “subcellular”, “unchecked”, and “unknown”, were
Morgan2 fingerprints and MACCS keys. Models were tested excluded. In total, 12,310 targets were present in the initial
404 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

data collation. Assays which matched the selected targets were It should also be noted that the hyperparameters for these
collated via ChEMBL API, producing a list of 915,781 assays. different models were determined as previously described.76
From these assays, all activities expressed in molar units (i.e. Validation Metrics. Internal five-fold cross-validation
units of type “M”, “mM”, “μM”, and “nM”) were collected. For metrics between algorithms for 5091 datasets were compared
each target, the activities were partitioned into datasets based with a rank-normalized score, which was also used in previous
on the type of assay (“functional”, “binding”, “ADME”, or investigations60,76,77 of various machine learning methods.
“all”) and the type of assay endpoint (“C50” corresponding to Rank normalized scores60 (RNS) can be evaluated in either a
combined IC50/EC50/AC50/GI50, “K” corresponding to Ki/ pairwise or independent fashion: the former is informative for
Kd, “MIC”, or “all” which combined all possible endpoints). the comparison of algorithms per training set, while the latter
This produced 16,135 datasets, of which 2915 had only one provides a more generalized comparison of algorithms overall.
activity and were excluded from further analysis. From the A total of 188 datasets were excluded from this study because
resulting 13,220, we excluded datasets with fewer than 100 one or more of the internal five-fold cross-validation metrics
activities so as to compare to our previous work,61 to give 5279 was unable to be calculated (i.e. contained a denominator
datasets. Of these, eight datasets failed to converge when used calculated to be zero).
to train SVM machine learning models and were excluded The RNS takes the mean of the normalized scores of several
from further consideration. The binarization of these traditional measurements of model performance including
continuous datasets by our Assay Central algorithm occasion- accuracy (ACC), recall, precision, F1 score, ROC area under
ally picked a threshold so that there were fewer than five the curve (AUC), CK, and MCC. As ACC, ROC AUC,
actives. These datasets have also been excluded from the precision, recall, specificity, and F1 scores are fixed from 0 to 1,
analysis, resulting in a final total of 5091 datasets with at least the raw scores were used directly to calculate the rank-
100 activity measurements each, the largest of which has normalized score. As CK and MCC range from −1 to 1, the
15,994 molecules. As these datasets were curated in an normalized score was calculated as CK or MCC + 1 in order to
automated manner, as opposed to manually seeking out 2
specific endpoints or assays, these datasets are referred to truncate the range from 0 to 1. The rank normalized score is
herein as “auto-curated” datasets and we have provided these just the mean of these values.
files for others to use (Supplemental Datasets). Additionally, there is a “difference from the top” (ΔRNS)
Machine LearningAssay Central. Assay Central is metric, where the rank normalized score for each algorithm is
proprietary software for curating high-quality datasets, subtracted from the highest corresponding score of a specific
generating Bayesian machine learning models with extended dataset.76 This metric maintains the pairwise comparison
connectivity fingerprints (ECFP6) generated from the CDK results and enables a direct assessment of the performance of
library63 and making prospective predictions of potential multiple machine learning algorithms while maintaining
bioactivity.1,54,55,58,60,64−70 We have previously described this information from the other machine learning algorithms tested
software in detail1,55,58,60,64−71 as well as the interpretation of simultaneously.
prediction scores61,72 and the metrics generated from internal External Validation Examples for Assay Central
five-fold cross-validation used to evaluate and compare Models. Several recent internal projects have involved using
predictive performances. These metrics include, but are not manually curated and specific machine learning models (i.e.
limited to: ROC score,55 Cohen’s kappa (CK),73,74 and single measurement type such as IC50, Ki, etc.) to score
MCC75 as described by us previously. compounds prior to selection of compounds for testing in vitro
Each of the 5091 ChEMBL version 25 datasets described with Assay Central. We have compared the predictions made
previously was subjected to the same standardization processes previously for PXR (ChEMBL target ID 3401) and hERG
(i.e. removing salts, metal complexes, and mixtures and (ChEMBL target ID 240) using manually curated datasets to
merging finite activities for duplicate compounds) prior to those predictions from auto-curated “C50” datasets from
building a Bayesian classification model.72 Classification ADME or binding assays (described earlier in the “Datasets”
models such as these require a defined threshold of bioactivity, section). All datasets were subjected to the same stand-
and Assay Central implements an automated method to select ardization steps (i.e. removing salts, neutralizing anions, and
this threshold to optimize internal model performance merging duplicates) as described for generating models with
metrics.61 After the initial generation of the Bayesian model Assay Central. When structures are predicted, they are also
with Assay Central, a binary activity column was applied to the desalted and neutralized for compatibility with a model.
output dataset for the accurate comparison of machine learning Additionally, several test sets from previous studies were
methods. subjected to external validation by corresponding auto-curated
Other Machine Learning Methods. The threshold for all datasets from ChEMBL. Two HIV datasets from NIAID
the datasets was set by Assay Central, and these were ChemDB used previously by us60 as training datasets were
outputted (i.e. with the threshold applied so that the activity utilized herein as test sets for either whole cell HIV or reverse
is binary, merged duplications, etc.) so that they could then be transcriptase. All whole cell data was from MT-4 cells with a
applied by the alternative machine learning algorithms. molecular weight limit of 500 g/mol and totalled 13,783
Additional machine learning algorithms were used which also compounds. A test set of three reverse transcriptase training
utilized ECFP6 molecular descriptors for consistency between sets from the same publication60 (from colorimetric,
methods and were generated from the RDKit (www.rdkit.org) immunological, and incorporation assays) totalled 1578
cheminformatics library. These additional algorithms include compounds. A final test set of literature MIC data against M.
RF, Knn, SVM classification, Naive ̈ Bayesian, AdaBoosted tuberculosis H37Rv strain was utilized in another previous
decision trees, and DNN of three layers; these algorithms have publication58 and totalled 155 compounds. All these test sets
been described fully in detail in our earlier publications.55,58,60 were re-evaluated to remove compounds overlapping with the
405 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

Table 1. Median Values for Each Model and Metric Color Coded by Scalea

a ̈ Bayesian,
AC = Assay Central (Bayesian), RF = random forest, KNN = k-nearest neighbors, SVC = support vector classification, BNB = Naive
Ada = AdaBoosted decision trees, DL = deep learning.

Figure 1. Machine learning algorithm comparisons for ChEMBL datasets across multiple five-fold cross-validation metrics based on the rank
normalized scores. (A) Rank normalized score and (B) ΔRNS distributions. Truncated violin plots are shown with minimal smoothing to retain an
accurate distribution representation. The solid central line represents the median with the quarterlies indicated. AC = Assay Central (Bayesian), RF
= random forest, Knn = k-nearest neighbors, SVC = support vector classification, Bnb = Naive ̈ Bayesian, Ada = AdaBoosted decision trees, DL =
deep learning.

corresponding ChEMBL auto-curated dataset in order to Discovery Studio (Biovia, San Diego, CA) for each dataset
enable a true external validation. individually, and these property distributions were compared.
The HIV whole cell datasets from NIAID ChemDB were Statistics. All statistics were performed using Graphpad
evaluated with the auto-curated “C50” datasets from functional
Prism 8 for macOS version 8.4.3.
assays from ChEMBL target ID 378 (HIV-1), and the reverse
transcriptase dataset with auto-curated “C50” datasets from External Biological Testing. External PXR and hERG
binding assays from ChEMBL target ID 2366516 (HIV-1 testing was performed by Thermo Fisher Scientific Se-
reverse transcriptase). The M. tuberculosis test set was lectScreen Profiling Service (Life Technologies Corporation,
evaluated against the auto-curated MIC dataset from func- (Chicago, IL 60693). The PXR assay uses a LanthaScreen TR-
tional assays from ChEMBL target ID 2111188 (M. tuberculosis FRET competitive binding assay and uses SR-12813 as a
H37Rv strain). All external validations were then evaluated at
the calculated activity threshold unique to each training model. positive control. A Tb-labeled anti-GST antibody is used to
Descriptor Calculations. To assess the molecular proper- indirectly label the receptor by binding to its GST tag.
ties of the ChEMBL datasets (570,872 unique compounds) Competitive ligand binding is detected by a test compound’s
versus known drug-like molecules, we used the SuperDrug2 ability to displace a fluorescent ligand (tracer) from the
library (3992 unique compounds)78 and we generated receptor, which results in the loss of FRET signals between the
molecular descriptors (A log P, molecular weight, number of
rotatable bonds, number of rings, number of aromatic rings, Tb-anti-GST antibody and the tracer. The hERG screening
number of hydrogen bond acceptors, number of hydrogen used a previously described fluorescence polarization assay79
bond donors, and molecular fractional polar surface area) using and E-4031 as a positive control.
406 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

Table 2. Rank Normalized Score Comparisons (Friedman (Pairwise Comparison) (A) or Kruskal−Wallis Test (Independent
Comparison) (B) with Dunn’s Multiple Comparisons Follow up Tests)a

a ̈ Bayesian, Ada =
AC = Assay Central (Bayesian), rf = random forest, knn = k-nearest neighbors, svc = support vector classification, bnb = Naive
AdaBoosted decision trees, DL = deep learning.

■ RESULTS
Model Statistics. Multiple machine learning models were
RNS metrics (RNS and ΔRNS). As shown in Figure 1A, the
two algorithms that show the most pronounced score
generated and their internal five-fold cross-validation metrics superiority with these metrics are the Bayesian algorithm
were compared to assess their performance. The distribution of used by Assay Central (AC) and SVC. Even though the RNS
the model metrics for each model approach was not normally between these two algorithms were statistically indifferent from
distributed (Figure S1); therefore, median model statistics are each other (Table 2), there was a statistically significant
used for comparative purposes (Table 1). Even though the difference with ΔRNS between AC and SVC, suggesting AC as
magnitudes vary drastically, there is a statistically significant the better algorithm based on 5-fold cross-validation for these
difference between essentially all of the algorithms using the datasets. This difference in the ΔRNS metric is visualized in
407 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

Table 3. Distance from the Top Normalized Score (Kruskal−Wallis Test (Independent Comparison) with Dunn’s Multiple
Comparisons Follow up Tests)a

a ̈ Bayesian, Ada =
AC = Assay Central (Bayesian), rf = random forest, knn = k-nearest neighbors, svc = support vector classification, bnb = Naive
AdaBoosted decision trees, DL = deep learning.

Figure 1B and quantified in Table 3. Knn and RF lacked compare it to the actual drugs, using the Superdrug2 library.
statistically significant differences when comparing RNS or The distribution of each of the eight calculated molecular
ΔRNS independently, but there was a statically significant win properties selected was assessed for normality and none of
for RF when utilizing a pairwise RNS comparison (Table 2). these were statistically likely to be normally distributed
As the ΔRNS score comparison shows no significance (D’Agostino & Pearson test, p < 0.0001****); therefore, the
difference and should be a more sensitive metric to compare differences were evaluated nonparametrically. The Gaussian fit
the algorithms in the same way as the RNS pairwise displayed on each graph is meant only for easier visual
comparison is intending to do, this difference, while significant, discrimination of the two datasets and not for statistical
is likely inconsequential. DL showed a similar but reduced purposes. Both the distributions (Kolmogorov−Smirnov test)
performance as compared to Knn and RF using both the and medians (Mann Whitney U test) of each chemical library
pairwise and independent methods. Ada and Bnb had the most were determined to be statically significantly different for all
drastic disparity in their performances as compared to all the measured metrics except for the number of hydrogen-bond
other algorithms tested (Table 2). This is quantified by the donors, where the medium was not significantly different
mean rank differences, which show the extent of these between the libraries (p = 0.1404) (Figure 3). The molecular
differences (Table 3). weight, A log P, number of aromatic rings, number of rings,
Overall, the trends seen with RNS and ΔRNS score number of hydrogen bond acceptors, and number of rotatable
comparisons are mimicked in each individual metric bonds were all higher in the ChEMBL dataset. These results
comparison (Figure 2, Table S2), with some minor exceptions would suggest that the ChEMBL dataset consists of molecules
being highlighted. AC or SVC ranked in top 2 for every metric that have properties outside of the known drugs. Individual
examined except for specificity, where RF scored the best. The models built with subsets of the ChEMBL data would likely
most pronounced metric difference between the top scoring also vary in their overlap with the approved drug properties.
algorithms was AUC, with AC and SVC having median scores External Test Sets. We used machine learning to
of 0.887 (95% CI: 0.884−0.888) and 0.826 (95% CI: 0.823− prospectively score libraries of compounds to test in vitro for
0.829), respectively. While the RNS of Knn and RF were several toxicology properties of interest important for drug
similar, each algorithm had multiple metrics where one discovery. We have now collated this unpublished exper-
dominated over the other. RF had significantly higher precision imental data against the ion channel hERG (N = 10) and the
and specificity where Knn was substantially better in recall. nuclear receptor PXR (N = 14). We then compared the
Finally, while Ada Boosted decision trees (Ada) had one of the predictions for these compounds from the auto-curated models
lowest RNSs, it did have the second-best specificity score out with additional machine learning models that used manually
of all the algorithms tested. The median score for each metric curated datasets in order to evaluate whether there is a
by algorithm is shown in Table 1, which is expanded to show difference between these techniques in the model prediction
the confidence intervals in Table S1. A full statistical outcome. These two datasets therefore represent prospective
comparison of each metric is shown in Table S2. external test sets. We have also anonymized these molecules to
Molecular Properties. As the ChEMBL dataset is maintain confidentiality of our pipeline which consists of
proposed to represent drug-like molecules, we sought to multiple infectious disease targets.
408 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

Figure 2. Machine learning algorithm comparisons for ChEMBL datasets across multiple five-fold cross-validation using multiple classical metrics.
Distributions are shown as either a raw score (A) or as a “difference from the top” metric score (B). Truncated violin plots are shown with minimal
smoothing to retain an accurate distribution representation. The solid central line represents the median with the quarterlies indicated. AC = Assay
Central (Bayesian), RF = random forest, Knn = k-nearest neighbors, SVC = support vector classification, Bnb = Naive ̈ Bayesian, Ada = AdaBoosted
decision trees, DL = deep learning.

The two auto-curated datasets for PXR were quite dissimilar assay descriptions), whereas the other models are capturing
from the manually curated dataset (Table S3a, Figure S2). different assay types. An enrichment curve (Figure 4A)
While the model metrics remained in a reasonably tight range, suggests that the manually curated and auto-curated
the calculated model thresholds were ∼16 μM and 329 nM for “ADME” model performed similarly, though the manually
“ADME” and “binding” models, respectively. The manually curated model suggested that it has a slight edge.
curated ChEMBL model contained agonism data exclusively The metrics generated from five-fold cross-validation auto-
and had an activity threshold of 665 nM. This resulted in little curated and manually curated models of hERG were largely
agreement between the predictions of PXR activity for the similar, the number of actives and total size were nearly the
compounds chosen as a prospective test set from our internal same, and the calculated threshold was the same (Table S4a,
library of compounds. Only three of the 14 molecules of the Figure S3). Unsurprisingly, predictions of hERG inhibition
test set were predicted to have the same classification across were nearly identical between these two models (Table S4b).
the PXR models. However, both the manually curated and the While the predicted classifications were identical with each of
auto-curated “ADME” assays datasets generated similar the models, a small difference can be visualized using an
prediction classifications for nine of the molecules in the test enrichment curve (Figure 4B), which appears to favor the
set. This is despite the vast threshold difference and less than auto-curated model.
half the amount of data present in the manually curated model To further test the predictive performance of auto-curated
(Table S3b). Additionally, this finding is interesting consider- datasets on much larger external test sets (consisting of
ing the specificity of the manually curated dataset (agonism thousands of molecules), we utilized data from our previous
409 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

Figure 3. Comparison of the distribution of molecular properties for >570,000 unique compounds in the ChEMBL datasets versus 3992
compounds in the SuperDrug2 library as an example of approved drugs. Median values are highlighted along with statistical analyses.

410 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

Figure 4. Comparison of enrichment rates of finding actives in the external test sets for (A) PXR and (B) hERG models using manually curated
and auto-curated Assay Central models.

publications for HIV whole cell and reverse transcriptase from diverse, then, single task learning was superior (ROC 0.76−
NIAID60 and M. tuberculosis whole cell data from the 0.79).41 ChEMBL version 22 was used to extract molecules
literature.58 These datasets were predicted against auto-curated tested in 758 cell lines which was then used for machine
training models of HIV, whole cell and reverse transcriptase learning with SVM and similarity analysis which illustrated
(Table S5, Figure S4), or M. tuberculosis (Table S6, Figure S5). comparable data on 10-fold cross-validation (accuracy = 0.65),
All compounds which overlapped with the auto-curated but SVM was superior on external validation.42 ChEMBL
training set were removed from the original published test version 22 was used for compound−target interactions, and
set, and the activity threshold was set as equivalent to the 1720 targets were selected with at least 10 compounds. Various
training set. molecular fingerprints and machine learning (Naive ̈ Bayes and
The auto-curated HIV whole cell training model was able to DNN) were compared with the nearest neighbor approaches.
more accurately predict the corresponding testing set (ROC = It was demonstrated that combining Naive ̈ Bayes and K-
0.79) in comparison to the reverse transcriptase model (ROC nearest neighbors performed the best in terms of recall and in a
= 0.73) and test set (Table S5, Figure S4). In relation to whole case study using a TRPV6 inhibitor.43 1360 datasets from
cell validations, most of the reverse transcriptase test set
ChEMBL version 19 were used to build RF regression models
metrics are diminished with the exception of ROC and
with ECFP4 descriptors out of which 440 were reliable and, in
specificity (Table S5c, Figure S4d). The M. tuberculosis
validation (ROC = 0.77) had the highest specificity of all turn, were used to create a QSAR-based affinity fingerprint.
test sets, but all other metrics were comparable to the HIV test This was then evaluated for similarity searching and biological
sets (Table S6, Figure S5). activity classification and found to be equivalent to Morgan
fingerprints for similarity searching and outperformed Morgan

■ DISCUSSION
Several of the largest comparisons of machine learning models
fingerprints for scaffold hopping.44 DL has been used to learn
compound protein interactions using data from ChEMBL,
BindingDB, and DrugBank using latent semantic analysis to
relevant to drug discovery to date have used over 1000 models compare a method called DeepCPI to other similar
in numerous studies. For example, 1227 datasets from approaches. It was found to perform well with AUC ROC of
ChEMBL version 20 were used to compare Naive ̈ Bayes, RF, 0.92. Several G-protein coupled receptors (GLP-1R, GCGR,
SVM, logistic regression, and DNN using random split and and VIPR) were then selected for prospective virtual screening,
temporal validation.39 Morgan fingerprints were used as and three positive allosteric modulators for GLP-1R were
descriptors, and validation metrics included MCC and identified.45 These examples are representative of the types of
Boltzmann enhanced discrimination of ROC (BEDROC). comparisons performed to date using subsets of ChEMBL.
The random split used 70% training and 30% validation; DNN There has also been considerable discussion about how best
appeared to perform better than other methods when to evaluate machine learning models. One approach offered the
BEDROC was used, while RF was the best for MCC using quantile-activity bootstrap to mimic extrapolation onto
random split validation. Temporal validation demonstrated
previously unseen areas of the molecular space in order to
that DNN outperformed all other methods, but all approaches
evaluate 25 sets of targets from ChEMBL with multiple
performed better in the random split validation.39 Another
study used 1310 assays from ChEMBL version 20 to compare machine learning methods. This approach was found to
binary models generated with SVM, kNN, RF, Naive ̈ Bayes, remove any advantage of DNNs or RF versus using SVM and
and several types of DNN including feed-forward neural ridge regression.46 The work of Mayr et al.40 was reanalyzed in
networks, convolutional neural networks, and recurrent neural another study on the optimal way to benchmark studies that
networks for target prediction. Models were generated with a questioned AUCROC and instead proposed that the area
wide array of molecular descriptors and evaluated using ROC- under the precision−recall curve should be used as well. SVM
AUC. Feed-forward neural networks were found to perform was comparable to feed-forward neural networks.47 It is
better than all other methods followed by SVM.40 1067 human important to note that not all computational approaches using
targets in ChEMBL version 23 have been selected to train and machine learning models are successful, and this likely
test multitask and single task DNN models considering represents publication bias. One rare example described
sequence similarities of targets. Multitask learning out- using ChEMBL and REAXYS to generate SVM models used
performed single task learning when the targets are similar along with pharmacophores and generative topographic
(ROC > 0.8), but when this is not considered or they are mapping to select compounds as bromodomain BRD4 binders
411 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

with a 2.6 fold increase in hit rate relative to random one way to obtain a snapshot comparison of how the models
screening.48 perform.
We have previously described Assay Central which uses a To our knowledge, this also appears to be the largest scale
previously described Bayesian method and ECFP6 fingerprints comparison of such machine learning algorithms and metrics
that were made open source,61,72 and the underlying that we are aware of. As we have now shown, the ChEMBL
components are therefore publicly available. We have shown data set is also statistically different to drugs that are approved
that ECFP6 fingerprints compare favorably with other based on important molecular properties (Figure 3). This is
descriptors.54 We have compared Assay Central with other important because it illustrates the structural diversity of
machine learning methods for a relatively small number of ChEMBL and also the dataset diversity in general is much
targets such as drug-induced liver injury,52 rat acute oral larger than previous studies performing machine learning
toxicity, 53 estrogen receptor, 54,55 androgen receptor, 56 comparisons. This should therefore provide more confidence
GSK3β57 M. tuberculosis,58,59 non-nucleoside reverse tran- in the algorithm comparison results as it would suggest that we
scriptase, and whole cell HIV.60 In general, we have found are covering more diverse chemical property space. Future
from our earlier studies that while DNN generally performed studies could certainly evaluate additional descriptors,
the best for five-fold cross-validation, with external test sets, algorithms, and machine learning methods for evaluating the
this superiority was not observed and generally Assay Central resulting models. Such public datasets and databases as
performed comparably with SVM classification or DNN. Our ChEMBL will likely increase in size, and they will continue
evaluation of these algorithms is now expanded to well over to be utilized more extensively to generate machine learning
5000 CHEMBL datasets in this current study making it the models by many more scientists. In order for this to be the
largest performed one to date. We find that Assay Central case, we need to make such datasets accessible for machine
compares very well to SVM classification using the commonly learning and this requires some degree of curation. In this
used five-fold cross-validation when many metrics are study, we have demonstrated that minimally curated datasets
compared and visualized using several methods in order to (desalted, neutralized, and excluding any problematic com-
assess each algorithm. While this is only one form of leave out pounds) can produce useful machine learning models with
and it has limitations, it is certainly not as optimistic as leave- good statistics across a range of different algorithms. We have
one out and it provides a useful guide for model utility. Unlike termed this approach “auto-curation”. We would naturally
prior studies that have placed their emphasis on demonstrating expect that it would be advantageous to expend significant
the capabilities of DNN, we saw little or no advantage here effort to further manually curate datasets before using them for
prediction; however, there may be some value in such
where model tuning was performed as previously described.76
minimally curated models, and it may be impractical to
In this study, Assay Central had a slight advantage as the
manually curate thousands of models. While it is cost
algorithm was used to select the optimal classification cutoff61
prohibitive and virtually impossible to prospectively predict
for each of the >5000 datasets was based on that used for
all of the over 5000 ChEMBL models, we have selected several
Assay Central. SVM classification seems a very strong
to demonstrate their potential. We have therefore tested the
competitor overall producing comparable model metrics.
utility of these auto-curated and manually curated models for
Ideally, a standardized threshold would be applied to all PXR (Table S3) and hERG (Table S4). We noted that the
datasets but due to their heterogeneity and uncertainty in how model built with auto-curation (PXR/ADME/c50) predicted
that would skew datasets to be extremely unbalanced (i.e. several of the most active PXR activators including CPI1072,
some datasets at 1 μM threshold would be 90% active and CPI1073, and CPI1077. These molecules were also scored
others 2%); so in order to simplify the process, we opted for a highly with the manually curated model with the lowest cutoff.
calculated, albeit nonstandardized, threshold. One of the Similarly, the most potent inhibitors of hERG CPI1071,
reasons for this performance observed by the Bayesian CPI1223a, and CPI1286 were also predicted by the auto-
approach may be that in can handle unbalanced datasets well curated ChEMBL model; however, other compounds were
and in this study, no attempts were made to balance datasets. similarly scored indicating poor prediction capability. Similar
In such cases, it is likely that other methods would perform results were seen with manually curated models suggesting, in
better. We have not attempted to understand the effects of these two limited cases (for relatively small datasets), the
different algorithms with models of different sizes or degrees of benefits of different curation approaches. We would certainly
(im)balance. However, dataset balancing will also reduce the benefit from a much larger evaluation using many more
dataset size, and in some cases, this would likely make them examples to see whether there is a statistically relevant
too small to be of use. In our previous multiple dataset improvement in models after curation; however, the cost of
comparisons of multiple machine learning methods and doing this either internally or through an external organization
metrics, we limited our assessment to just eight datasets is currently cost prohibitive. In addition, an alternative
ranging in size from hundreds to hundreds of thousands of approach to validation is to use several of these auto-curated
molecules curated from different sources.76 A rank-normalized models to predict data that was previously manually curated
score across the various model metrics suggested in this case and used for training or testing models for HIV whole cell,
that DNN outperformed SVM during external testing with a HIV reverse transcriptase, and M.tuberculosis inhibition.58,60 In
validation set held out from model training.76 The current all cases, the ROC for these test sets was >0.7 (Figures S4 and
analysis represents a study that is orders of magnitude larger, S5). These results also suggest the potential of auto-curated
allowing us to draw statistically significant conclusions. It models (Figure 4). Efforts are ongoing to develop more
should also be noted that our prior comparisons of leave out advanced auto-curation approaches which might bring more of
validation groups have reported stability in the ROC obtained the benefits of full manual curation but with limited human
when large datasets are used, whether this is leave one out or input. This work builds on our recent comparisons of different
30−50% ×100 fold validations.80,81 five-fold cross-validation is machine learning methods and algorithms for various
412 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

individual projects52,53,55−60,82 to enable a very large scale Institutes of Health under award number R43ES031038. The
comparison. It also demonstrates that several widely used (and content is solely the responsibility of the authors and does not
openly accessible) machine learning methods consistently necessarily represent the official views of the National
outperform the others at classification. Assay Central uses a Institutes of Health.”
Bayesian approach which appears comparable to SVM Notes
classification (Figure 1B) and these in turn outperform all The authors declare the following competing financial
other algorithms used including the much hyped deep learning interest(s): S.E., D.H.F., E.M., K.M.Z., and T.R.L. work for
which is being more widely applied in pharmaceutical Collaborations Pharmaceuticals, Inc. F.U. has no conflicts of
research.29 This study may be enlightening as it echoes our interest.


earlier findings on individual datasets after extensive manual
curation.52,53,55−60,82 It also goes some way further in using ACKNOWLEDGMENTS
external validation with these methods for toxicology and drug
discovery properties. Now that we have generated such a vast Dr. Daniel P. Russo is kindly acknowledged for initially setting
array of over 5000 machine learning models, it presents up and providing code for large scale machine learning
opportunities for using them for predicting the potential of comparisons and considerable discussions on machine
new molecules to interact with targets of interest or avoiding learning. Valery Tkachenko is kindly acknowledged for
others (such as PXR and hERG) that may result in undesirable hardware and software support. Kushal Batra is kindly
effects. We will also be able to compare the prospective ability acknowledged for providing a script used in data analysis.
of all the machine learning methods which is also rarely Dr. Vadim Makarov is acknowledged for providing several
undertaken. This current study therefore represents a step compounds that were used in our hERG and PXR test sets. Dr.
toward using large scale machine learning models to assist in Alex M. Clark is thanked for Assay Central support. We have
creating a drug discovery pipeline.1 Our approach with models made use of the ChEMBL database and kindly acknowledge
created with ChEMBL could also be taken using other very the team at EBI that has over several years worked hard to
large databases such as PubChem or even considered by provide this valuable resource to the scientific community.
pharmaceutical companies to leverage their internal databases. Biovia are kindly acknowledged for providing Discovery


Studio.

*
ASSOCIATED CONTENT
sı Supporting Information
The Supporting Information is available free of charge at
■ ABBREVIATIONS
AUC, area under the curve; DNN, deep neural networks;
h t t p s : / / p u b s . a c s . o r g / d o i / 10 . 1 02 1 / a c s .m o l p h a rm a - kNN, k-nearest neighbors; QSAR, quantitative structure
ceut.0c01013. activity relationships; RF, random forest; ROC, receiver
operating characteristic; SVM, support vector machines


Separate collection of files for over 5000 datasets auto-
curated from ChEMBL (PDF) REFERENCES

■ AUTHOR INFORMATION
Corresponding Author
(1) Ekins, S.; Puhl, A. C.; Zorn, K. M.; Lane, T. R.; Russo, D. P.;
Klein, J. J.; Hickey, A. J.; Clark, A. M. Exploiting machine learning for
end-to-end drug discovery and development. Nat. Mater. 2019, 18,
Sean Ekins − Collaborations Pharmaceuticals, Inc., Raleigh, 435−441.
(2) Zhavoronkov, A.; Ivanenkov, Y. A.; Aliper, A.; Veselov, M. S.;
North Carolina 27606, United States; orcid.org/0000- Aladinskiy, V. A.; Aladinskaya, A. V.; Terentiev, V. A.; Polykovskiy, D.
0002-5691-5790; Phone: 215-687-1320; Email: sean@ A.; Kuznetsov, M. D.; Asadulaev, A.; Volkov, Y.; Zholus, A.;
collaborationspharma.com Shayakhmetov, R. R.; Zhebrak, A.; Minaeva, L. I.; Zagribelnyy, B.
A.; Lee, L. H.; Soll, R.; Madge, D.; Xing, L.; Guo, T.; Aspuru-Guzik,
Authors A. Deep learning enables rapid identification of potent DDR1 kinase
Thomas R. Lane − Collaborations Pharmaceuticals, Inc., inhibitors. Nat. Biotechnol. 2019, 37, 1038−1040.
Raleigh, North Carolina 27606, United States (3) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.;
Daniel H. Foil − Collaborations Pharmaceuticals, Inc., Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani,
Raleigh, North Carolina 27606, United States B.; Overington, J. P. ChEMBL: a large-scale bioactivity database for
Eni Minerali − Collaborations Pharmaceuticals, Inc., Raleigh, drug discovery. Nucleic Acids Res. 2012, 40, D1100−D1107.
North Carolina 27606, United States (4) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.;
Fabio Urbina − Department of Cell Biology and Physiology, Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; Wang, J.; Yu,
University of North Carolina at Chapel Hill, Chapel Hill, B.; Zhang, J.; Bryant, S. H. PubChem Substance and Compound
databases. Nucleic Acids Res. 2016, 44, D1202−D1213.
North Carolina 27599-7545, United States (5) Anon The PubChem Database. http://pubchem.ncbi.nlm.nih.
Kimberley M. Zorn − Collaborations Pharmaceuticals, Inc., gov/ (accessed October 12 2020).
Raleigh, North Carolina 27606, United States (6) Schmidhuber, J. Deep learning in neural networks: an overview.
Complete contact information is available at: Neural Network. 2015, 61, 85−117.
https://pubs.acs.org/10.1021/acs.molpharmaceut.0c01013 (7) Capuzzi, S. J.; Politi, R.; Isayev, O.; Farag, S.; Tropsha, A. QSAR
Modeling of Tox21 Challenge Stress Response and Nuclear Receptor
Signaling Toxicity Assays. Front. Environ. Sci. 2016, 4, 3.
Funding
(8) Russakovsky, O.; deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.;
We kindly acknowledge NIH funding: R44GM122196-02A1 Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; Fei-
from NIGMS, 3R43AT010585-01S1 from NCCAM and Fei, L. ImageNet large scale visual recognition challenge. 2014,
1R43ES031038-01 from NIEHS (PISean Ekins). “Research arXiv:1409.0575. https://arxiv.org/pdf/1409.0575.pdf.
reported in this publication was supported by the National (9) Bennet, K. P.; Campbell, C. Support vector machines: Hype or
Institute of Environmental Health Sciences of the National hallelujah? SIGKDD Explor. 2000, 2, 1−13.

413 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

(10) Christianini, N.; Shawe-Taylor, J. Support Vector Machines and (30) Baskin, II; Winkler, D.; Tetko, I. V. A renaissance of neural
Other Kernel-Based Learning Methods; Cambridge University Press: networks in drug discovery. Expet Opin. Drug Discov. 2016, 11, 785−
Cambridge, MA, 2000. 795.
(11) Chang, C. C.; Lin, C. J. LIBSVM: A Library for Support Vector (31) Wang, L.; Ma, C.; Wipf, P.; Liu, H.; Su, W.; Xie, X.-Q.
Machines, 2001. TargetHunter: an in silico target identification tool for predicting
(12) Lei, T.; Chen, F.; Liu, H.; Sun, H.; Kang, Y.; Li, D.; Li, Y.; Hou, therapeutic potential of small organic molecules based on chemo-
T. ADMET Evaluation in Drug Discovery. Part 17: Development of genomic database. AAPS J. 2013, 15, 395−406.
Quantitative and Qualitative Prediction Models for Chemical-Induced (32) Koutsoukas, A.; Lowe, R.; Kalantarmotamedi, Y.; Mussa, H. Y.;
Respiratory Toxicity. Mol. Pharm. 2017, 14, 2407−2421. Klaffke, W.; Mitchell, J. B. O.; Glen, R. C.; Bender, A. In silico target
(13) Kriegl, J. M.; Arnhold, T.; Beck, B.; Fox, T. A support vector predictions: defining a benchmarking data set and comparison of
machine approach to classify human cytochrome P450 3A4 inhibitors. performance of the multiclass Naive Bayes and Parzen-Rosenblatt
J. Comput. Aided Mol. Des. 2005, 19, 189−201. window. J. Chem. Inf. Model. 2013, 53, 1957−1966.
(14) Guangli, M.; Yiyu, C. Predicting Caco-2 permeability using (33) Cortes-Ciriano, I. Benchmarking the Predictive Power of
support vector machine and chemistry development kit. J. Pharm. Ligand Efficiency Indices in QSAR. J. Chem. Inf. Model. 2016, 56,
Pharmaceut. Sci. 2006, 9, 210−221. 1576−1587.
(15) Kortagere, S.; Chekmarev, D.; Welsh, W. J.; Ekins, S. Hybrid (34) Qureshi, A.; Kaur, G.; Kumar, M. AVCpred: an integrated web
scoring and classification approaches to predict human pregnane X server for prediction and design of antiviral compounds. Chem. Biol.
receptor activators. Pharm. Res. 2009, 26, 1001−1011. Drug Des. 2017, 89, 74−83.
(16) Shen, M.; Xiao, Y.; Golbraikh, A.; Gombar, V. K.; Tropsha, A. (35) Bieler, M.; Reutlinger, M.; Rodrigues, T.; Schneider, P.; Kriegl,
Development and validation of k-nearest neighbour QSPR models of J. M.; Schneider, G. Designing Multi-target Compound Libraries with
metabolic stability of drug candidates. J. Med. Chem. 2003, 46, 3013− Gaussian Process Models. Mol. Inf. 2016, 35, 192−198.
3020. (36) Huang, T.; Mi, H.; Lin, C. Y.; Zhao, L.; Zhong, L. L.; Liu, F. B.;
(17) Wang, S.; Sun, H.; Liu, H.; Li, D.; Li, Y.; Hou, T. ADMET Zhang, G.; Lu, A. P.; Bian, Z. X.; for, M. G. MOST: most-similar
Evaluation in Drug Discovery. 16. Predicting hERG Blockers by ligand based approach to target prediction. BMC Bioinf. 2017, 18,
Combining Multiple Pharmacophores and Machine Learning 165.
Approaches. Mol. Pharm. 2016, 13, 2855−2866. (37) Cortés-Ciriano, I.; Firth, N. C.; Bender, A.; Watson, O.
(18) Li, D.; Chen, L.; Li, Y.; Tian, S.; Sun, H.; Hou, T. ADMET Discovering Highly Potent Molecules from an Initial Set of Inactives
evaluation in drug discovery. 13. Development of in silico prediction Using Iterative Screening. J. Chem. Inf. Model. 2018, 58, 2000−2014.
models for P-glycoprotein substrates. Mol. Pharm. 2014, 11, 716−726. (38) Bosc, N.; Atkinson, F.; Felix, E.; Gaulton, A.; Hersey, A.; Leach,
(19) Nidhi; Glick, M.; Davies, J. W.; Jenkins, J. L. Prediction of A. R. Large scale comparison of QSAR and conformal prediction
biological targets for compounds using multiple-category Bayesian methods and their applications in drug discovery. J. Cheminf. 2019,
11, 4.
models trained on chemogenomics databases. J. Chem. Inf. Model.
(39) Lenselink, E. B.; Ten Dijke, N.; Bongers, B.; Papadatos, G.; van
2006, 46, 1124−1133.
Vlijmen, H. W. T.; Kowalczyk, W.; AP, I. J.; van Westen, G. J. P.
(20) Azzaoui, K.; Hamon, J.; Faller, B.; Whitebread, S.; Jacoby, E.;
Beyond the hype: deep neural networks outperform established
Bender, A.; Jenkins, J. L.; Urban, L. Modeling promiscuity based on in
methods using a ChEMBL bioactivity benchmark set. J. Cheminf.
vitro safety pharmacology profiling data. ChemMedChem 2007, 2,
2017, 9, 45.
874−880.
(40) Mayr, A.; Klambauer, G.; Unterthiner, T.; Steijaert, M.;
(21) Bender, A.; Scheiber, J.; Glick, M.; Davies, J. W.; Azzaoui, K.;
Wegner, J. K.; Ceulemans, H.; Clevert, D.-A.; Hochreiter, S. Large-
Hamon, J.; Urban, L.; Whitebread, S.; Jenkins, J. L. Analysis of scale comparison of machine learning methods for drug target
Pharmacology Data and the Prediction of Adverse Drug Reactions prediction on ChEMBL. Chem. Sci. 2018, 9, 5441−5451.
and Off-Target Effects from Chemical Structure. ChemMedChem (41) Lee, K.; Kim, D. In-Silico Molecular Binding Prediction for
2007, 2, 861−873. Human Drug Targets Using Deep Neural Multi-Task Learning. Genes
(22) Susnow, R. G.; Dixon, S. L. Use of robust classification 2019, 10, 906.
techniques for the prediction of human cytochrome P450 2D6 (42) Tejera, E.; Carrera, I.; Jimenes-Vargas, K.; Armijos-Jaramillo,
inhibition. J. Chem. Inf. Comput. Sci. 2003, 43, 1308−1315. V.; Sánchez-Rodríguez, A.; Cruz-Monteagudo, M.; Perez-Castillo, Y.
(23) Mitchell, J. B. O. Machine learning methods in chemo- Cell fishing: A similarity based approach and machine learning
informatics. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2014, 4, 468− strategy for multiple cell lines-compound sensitivity prediction. PLoS
481. One 2019, 14, No. e0223276.
(24) Wacker, S.; Noskov, S. Y. Performance of Machine Learning (43) Awale, M.; Reymond, J.-L. Polypharmacology Browser PPB2:
Algorithms for Qualitative and Quantitative Prediction Drug Blockade Target Prediction Combining Nearest Neighbors with Machine
of hERG1 channel. Comput. Toxicol. 2018, 6, 55−63. Learning. J. Chem. Inf. Model. 2019, 59, 10−17.
(25) Zhu, H.; Zhang, J.; Kim, M. T.; Boison, A.; Sedykh, A.; Moran, (44) Š kuta, C.; Cortés-Ciriano, I.; Dehaen, W.; Kříž, P.; van Westen,
K. Big data in chemical toxicity research: the use of high-throughput G. J. P.; Tetko, I. V.; Bender, A.; Svozil, D. QSAR-derived affinity
screening assays to identify potential toxicants. Chem. Res. Toxicol. fingerprints (part 1): fingerprint construction and modeling perform-
2014, 27, 1643−1651. ance for similarity searching, bioactivity classification and scaffold
(26) Clark, A. M.; Ekins, S. Open Source Bayesian Models: 2. hopping. J. Cheminf. 2020, 12, 39.
Mining A ″big dataset″ to create and validate models with ChEMBL. (45) Wan, F.; Zhu, Y.; Hu, H.; Dai, A.; Cai, X.; Chen, L.; Gong, H.;
J. Chem. Inf. Model. 2015, 55, 1246−1260. Xia, T.; Yang, D.; Wang, M.-W.; Zeng, J. DeepCPI: A Deep Learning-
(27) Ekins, S.; Clark, A. M.; Swamidass, S. J.; Litterman, N.; based Framework for Large-scale in silico Drug Screening. Genomics,
Williams, A. J. Bigger data, collaborative tools and the future of Proteomics Bioinf. 2019, 17, 478−495.
predictive drug discovery. J. Comput. Aided Mol. Des. 2014, 28, 997− (46) Watson, O. P.; Cortes-Ciriano, I.; Taylor, A. R.; Watson, J. A. A
1008. decision-theoretic approach to the evaluation of machine learning
(28) Ekins, S.; Freundlich, J. S.; Reynolds, R. C. Are Bigger Data Sets algorithms in computational drug discovery. Bioinformatics 2019, 35,
Better for Machine Learning? Fusing Single-Point and Dual-Event 4656−4663.
Dose Response Data for Mycobacterium tuberculosis. J. Chem. Inf. (47) Robinson, M. C.; Glen, R. C.; Lee, A. A. Validating the
Model. 2014, 54, 2157−2165. validation: reanalyzing a large-scale comparison of deep learning and
(29) Ekins, S. The Next Era: Deep Learning in Pharmaceutical machine learning models for bioactivity prediction. J. Comput. Aided
Research. Pharm. Res. 2016, 33, 2594−2603. Mol. Des. 2020, 34, 717−730.

414 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415
Molecular Pharmaceutics pubs.acs.org/molecularpharmaceutics Article

(48) Casciuc, I.; Horvath, D.; Gryniukova, A.; Tolmachova, K. A.; (65) Dalecki, A. G.; Zorn, K. M.; Clark, A. M.; Ekins, S.; Narmore,
Vasylchenko, O. V.; Borysko, P.; Moroz, Y. S.; Bajorath, J.; Varnek, A. W. T.; Tower, N.; Rasmussen, L.; Bostwick, R.; Kutsch, O.;
Pros and cons of virtual screening based on public ″Big Data″: In Wolschendorf, F. High-throughput screening and Bayesian machine
silico mining for new bromodomain inhibitors. Eur. J. Med. Chem. learning for copper-dependent inhibitors of Staphylococcus aureus.
2019, 165, 258−272. Metallomics 2019, 11, 696−706.
(49) Bediaga, H.; Arrasate, S.; Gonzál ez-Díaz, H. PTML (66) Ekins, S.; Gerlach, J.; Zorn, K. M.; Antonio, B. M.; Lin, Z.;
Combinatorial Model of ChEMBL Compounds Assays for Multiple Gerlach, A. Repurposing Approved Drugs as Inhibitors of Kv7.1 and
Types of Cancer. ACS Comb. Sci. 2018, 20, 621−632. Nav1.8 to Treat Pitt Hopkins Syndrome. Pharm. Res. 2019, 36, 137.
(50) Chen, Y.; Stork, C.; Hirte, S.; Kirchmair, J. NP-Scout: Machine (67) Ekins, S.; Mottin, M.; Ramos, P.; Sousa, B. K. P.; Neves, B. J.;
Learning Approach for the Quantification and Visualization of the Foil, D. H.; Zorn, K. M.; Braga, R. C.; Coffee, M.; Southan, C.; Puhl,
Natural Product-Likeness of Small Molecules. Biomolecules 2019, 9, A. C.; Andrade, C. H. Deja vu: Stimulating open drug discovery for
43. SARS-CoV-2. Drug Discov. Today 2020, 25, 928.
(51) Zin, P. P. K.; Williams, G. J.; Ekins, S. Cheminformatics (68) Hernandez, H. W.; Soeung, M.; Zorn, K. M.; Ashoura, N.;
Analysis and Modeling with MacrolactoneDB. Sci. Rep. 2020, 10, Mottin, M.; Andrade, C. H.; Caffrey, C. R.; de Siqueira-Neto, J. L.;
6284. Ekins, S. High Throughput and Computational Repurposing for
(52) Minerali, E.; Foil, D. H.; Zorn, K. M.; Lane, T. R.; Ekins, S. Neglected Diseases. Pharm. Res. 2018, 36, 27.
Comparing Machine Learning Algorithms for Predicting Drug- (69) Sandoval, P. J.; Zorn, K. M.; Clark, A. M.; Ekins, S.; Wright, S.
Induced Liver Injury (DILI). Mol. Pharm. 2020, 17, 2628−2637. H. Assessment of Substrate-Dependent Ligand Interactions at the
(53) Minerali, E.; Foil, D. H.; Zorn, K. M.; Ekins, S. Evaluation of Organic Cation Transporter OCT2 Using Six Model Substrates. Mol.
Assay Central Machine Learning Models for Rat Acute Oral Toxicity Pharmacol. 2018, 94, 1057−1068.
(70) Wang, P.-F.; Neiner, A.; Lane, T. R.; Zorn, K. M.; Ekins, S.;
Prediction. ACS Sustainable Chem. Eng. 2020, 8, 16020−16027.
Kharasch, E. D. Halogen Substitution Influences Ketamine Metabo-
(54) Zorn, K. M.; Foil, D. H.; Lane, T. R.; Russo, D. P.; Hillwalker,
lism by Cytochrome P450 2B6: In Vitro and Computational
W.; Feifarek, D. J.; Jones, F.; Klaren, W. D.; Brinkman, A. M.; Ekins,
Approaches. Mol. Pharm. 2019, 16, 898−906.
S. Machine Learning Models for Estrogen Receptor Bioactivity and
(71) Zorn, K. M.; Foil, D. H.; Lane, T. R.; Russo, D. P.; Hillwalker,
Endocrine Disruption Prediction. Environ. Sci. Technol. 2020, 54, W.; Feifarek, D. J.; Jones, F.; Klaren, W. D.; Brinkman, A. M.; Ekins,
12202−12213. S. Machine Learning Models for Estrogen Receptor Bioactivity and
(55) Russo, D. P.; Zorn, K. M.; Clark, A. M.; Zhu, H.; Ekins, S. Endocrine Disruption Prediction. Environ. Sci. Technol. 2020, 54,
Comparing Multiple Machine Learning Algorithms and Metrics for 12202−12213.
Estrogen Receptor Binding Prediction. Mol. Pharm. 2018, 15, 4361− (72) Clark, A. M.; Dole, K.; Coulon-Spektor, A.; McNutt, A.; Grass,
4370. G.; Freundlich, J. S.; Reynolds, R. C.; Ekins, S. Open Source Bayesian
(56) Zorn, K. M.; Foil, D. H.; Lane, T. R.; Hillwalker, W.; Feifarek, Models. 1. Application to ADME/Tox and Drug Discovery Datasets.
D. J.; Jones, F.; Klaren, W. D.; Brinkman, A. M.; Ekins, S. Comparison J. Chem. Inf. Model. 2015, 55, 1231−1245.
of Machine Learning Models for the Androgen Receptor. Environ. Sci. (73) Carletta, J. Assessing agreement on classification tasks: The
Technol. 2020, 54, 13690−13700. kappa statistic. Comput. Ling. 1996, 22, 249−254.
(57) Vignaux, P. A.; Minerali, E.; Foil, D. H.; Puhl, A. C.; Ekins, S. (74) Cohen, J. A coefficient of agreement for nominal scales. Educ.
Machine Learning for Discovery of GSK3β Inhibitors. ACS Omega Psychol. Meas. 1960, 20, 37−46.
2020, 5, 26551−26561. (75) Matthews, B. W. Comparison of the predicted and observed
(58) Lane, T.; Russo, D. P.; Zorn, K. M.; Clark, A. M.; Korotcov, A.; secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta
Tkachenko, V.; Reynolds, R. C.; Perryman, A. L.; Freundlich, J. S.; 1975, 405, 442−451.
Ekins, S. Comparing and Validating Machine Learning Models for (76) Korotcov, A.; Tkachenko, V.; Russo, D. P.; Ekins, S.
Mycobacterium tuberculosis Drug Discovery. Mol. Pharm. 2018, 15, Comparison of Deep Learning With Multiple Machine Learning
4346−4360. Methods and Metrics Using Diverse Drug Discovery Data Sets. Mol.
(59) Puhl, A. C.; Lane, T. R.; Vignaux, P. A.; Zorn, K. M.; Capodagli, Pharm. 2017, 14, 4462−4475.
G. C.; Neiditch, M. B.; Freundlich, J. S.; Ekins, S. Computational (77) Caruana, R.; Niculescu-Mizil, A. An empirical comparison of
Approaches to Identify Molecules binding to Mycobacterium supervised learning algorithms. 23rd International Conference on
Tuberculosis KasA. ACS Omega 2020, 5, 29935. . In Press Machine Learning: Pittsburgh, PA, 2006.
(60) Zorn, K. M.; Lane, T. R.; Russo, D. P.; Clark, A. M.; Makarov, (78) Siramshetty, V. B.; Eckert, O. A.; Gohlke, B.-O.; Goede, A.;
V.; Ekins, S. Multiple Machine Learning Comparisons of HIV Cell- Chen, Q.; Devarakonda, P.; Preissner, S.; Preissner, R. SuperDRUG2:
based and Reverse Transcriptase Data Sets. Mol. Pharm. 2019, 16, a one stop resource for approved/marketed drugs. Nucleic Acids Res.
1620−1632. 2018, 46, D1137−D1143.
(61) Clark, A. M.; Ekins, S. Open Source Bayesian Models. 2. (79) Piper, D. R.; Duff, S. R.; Eliason, H. C.; Frazee, W. J.; Frey, E.
Mining a ″Big Dataset″ To Create and Validate Models with A.; Fuerstenau-Sharp, M.; Jachec, C.; Marks, B. D.; Pollok, B. A.;
ChEMBL. J. Chem. Inf. Model. 2015, 55, 1246−1260. Shekhani, M. S.; Thompson, D. V.; Whitney, P.; Vogel, K. W.; Hess,
(62) Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, S. D. Development of the predictor HERG fluorescence polarization
J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibrián-Uhalte, assay using a membrane protein enrichment approach. Assay Drug
E.; Davies, M.; Dedman, N.; Karlsson, A.; Magariños, M. P.; Dev. Technol. 2008, 6, 213−223.
Overington, J. P.; Papadatos, G.; Smit, I.; Leach, A. R. The ChEMBL (80) Ekins, S.; Williams, A. J.; Xu, J. J. A predictive ligand-based
database in 2017. Nucleic Acids Res. 2017, 45, D945−D954. Bayesian model for human drug-induced liver injury. Drug Metab.
(63) Willighagen, E. L.; Mayfield, J. W.; Alvarsson, J.; Berg, A.; Dispos. 2010, 38, 2302−2308.
Carlsson, L.; Jeliazkova, N.; Kuhn, S.; Pluskal, T.; Rojas-Cherto, M.; (81) Zientek, M.; Stoner, C.; Ayscue, R.; Klug-McLeod, J.; Jiang, Y.;
Spjuth, O.; Torrance, G.; Evelo, C. T.; Guha, R.; Steinbeck, C. The West, M.; Collins, C.; Ekins, S. Integrated in silico-in vitro strategy for
Chemistry Development Kit (CDK) v2.0: atom typing, depiction, addressing cytochrome P450 3A4 time-dependent inhibition. Chem.
Res. Toxicol. 2010, 23, 664−676.
molecular formulas, and substructure searching. J. Cheminf. 2017, 9,
(82) Zorn, K. M.; Foil, D. H.; Lane, T. R.; Russo, D. P.; Hillwalker,
33.
W.; Feifarek, D. J.; Jones, F.; Klaren, W. D.; Brinkman, A. M.; Ekins,
(64) Anantpadma, M.; Lane, T.; Zorn, K. M.; Lingerfelt, M. A.;
S. Machine Learning Models for Estrogen Receptor Bioactivity and
Clark, A. M.; Freundlich, J. S.; Davey, R. A.; Madrid, P. B.; Ekins, S.
Endocrine Disruption Prediction. Environ. Sci. Technol. 2020, 54,
Ebola Virus Bayesian Machine Learning Models Enable New in Vitro 12202−12213.
Leads. ACS Omega 2019, 4, 2353−2361.

415 https://dx.doi.org/10.1021/acs.molpharmaceut.0c01013
Mol. Pharmaceutics 2021, 18, 403−415

You might also like