Identification of Active Molecules Against: Mycobacterium Tuberculosis Through Machine Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Briefings in Bioinformatics, 22(5), 2021, 1–15

https://doi.org/10.1093/bib/bbab068
Problem Solving Protocol

Identification of active molecules against

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


Mycobacterium tuberculosis through machine learning
Qing Ye † , Xin Chai† , Dejun Jiang, Liu Yang, Chao Shen, Xujun Zhang,
Dan Li, Dongsheng Cao and Tingjun Hou
Corresponding authors: Dan Li, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China. Tel.: +86-571-88208412;
E-mail: lidancps@zju.edu.cn; Dongsheng Cao, Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410003, P.R. China.
Tel.: +86-139-7488-0914; E-mail: oriental-cds@163.com; Tingjun Hou, College of Pharmaceutical Sciences and State Key Lab of CAD&CG, Zhejiang
University, Hangzhou, Zhejiang 310058, P. R. China. Tel.: +86-571-88208412; E-mail: tingjunhou@zju.edu.cn
† These authors have contributed equally.

Abstract
Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (Mtb) and it has been one of the top 10 causes
of death globally. Drug-resistant tuberculosis (XDR-TB), extensively resistant to the commonly used first-line drugs, has
emerged as a major challenge to TB treatment. Hence, it is quite necessary to discover novel drug candidates for TB
treatment. In this study, based on different types of molecular representations, four machine learning (ML) algorithms,
including support vector machine, random forest (RF), extreme gradient boosting (XGBoost) and deep neural networks
(DNN), were used to develop classification models to distinguish Mtb inhibitors from noninhibitors. The results demonstrate
that the XGBoost model exhibits the best prediction performance. Then, two consensus strategies were employed to
integrate the predictions from multiple models. The evaluation results illustrate that the consensus model by stacking the
RF, XGBoost and DNN predictions offers the best predictions with area under the receiver operating characteristic curve of
0.842 and 0.942 for the 10-fold cross-validated training set and external test set, respectively. Besides, the association
between the important descriptors and the bioactivities of molecules was interpreted by using the Shapley additive
explanations method. Finally, an online webserver called ChemTB (http://cadd.zju.edu.cn/chemtb/) was developed, and it
offers a freely available computational tool to detect potential Mtb inhibitors.

Qing Ye is now a postgraduate in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His research
interests mainly lie in the area of computer-aided drug design.
Xin Chai is currently a PhD student in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His
research interests mainly lie in the area of computer-aided drug design.
Dejun Jiang is currently a PhD student in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His
research interests mainly lie in the area of computer-aided drug design.
Liu Yang is now a postgraduate in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His research
interests mainly lie in the area of computer-aided drug design.
Chao Shen is currently a PhD student in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His
research interests mainly lie in the area of computer-aided drug design.
Xujun Zhang is now a postgraduate in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His
research interests mainly lie in the area of computer-aided drug design.
Dan Li is now an associate professor in the College of Pharmaceutical Sciences, Zhejiang University, China. Her research interests include design and
discovery of novel drug candidates through computer-aided drug design and biological evaluation.
Dongsheng Cao is currently a professor in the Xiangya School of Pharmaceutical Sciences at Central South University. More information can be found at
the website of his group: http://www.scbdd.com.
Tingjun Hou is currently a professor in the College of Pharmaceutical Sciences at Zhejiang University. More information can be found at the website of
his group: http://cadd.zju.edu.cn.
Submitted: 14 December 2020; Received (in revised form): 23 January 2021

© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

1
2 Ye et al.

Key words: Mycobacterium tuberculosis; machine learning; extreme gradient boosting; deep learning; model fusion

Introduction have been used for the prediction of Mtb inhibition activities,
no any algorithm clearly outperformed the others. Moreover,
Tuberculosis (TB) is a contagious disease in humans mainly the new ensemble learning algorithm, extreme gradient boost-
caused by Mycobacterium tuberculosis (Mtb), and it is one of the top ing (XGBoost), that has been successfully employed in ADMET
10 causes of death [1]. Globally, an estimated 1.7 billion people predictions and QSPR modeling has never been used in Mtb
are infected with Mtb and therefore at risk for developing active

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


inhibition prediction [15, 16]. Thus, there is a recognized need to
TB. Besides, people living with HIV are more likely to be attacked explore how well those state-of-the-art methods perform in the
by TB than others, and an estimated 8.6% of the incident TB prediction of Mtb inhibition activity. Furthermore, the reported
cases were among people living with HIV in 2018 [2, 3]. Although prediction models have never been integrated into any software
a number of drugs (e.g. rifampicin, isoniazid, pyrazinamide and package or online prediction system, which hinders the practical
ethambutol) have been developed for the treatment of TB, the applications of these models.
emergence of multidrug-resistant TB (MDR-TB, defined as TB In this study, with the aim to establish reliable classification
with resistance to at least isoniazid and rifampin) and extensive models for Mtb inhibition prediction and provide a freely acces-
drug-resistant TB (XDR-TB, defined as TB with resistance to sible platform for sharing our models, 24 models were developed
both isoniazid and rifampin, plus any fluoroquinolone and at using four ML algorithms (SVM, RF, XGBoost and DNN) based on
least one of the three injectable second-line drugs) would sig- six different types of molecular descriptors/fingerprints. Then,
nificantly compromise the therapeutic efficiency of these drugs to further improve the performance of the models, the strat-
[4–9]. According to the drug surveillance data in 2016, around egy by combining molecular descriptors and fingerprints and
600 000 people developed MDR-TB and 240 000 people died. Glob- two ensemble methods (voting and stacking) were employed
ally, 4.1% of new cases and 19% of previously treated cases were in model construction. The local outlier factor (LOF) algorithm
diagnosed with either MDR-TB or rifampicin-resistant TB (RR- was adopted to assess the application domain (AD) of the best
TB) in 2016 [3]. Obviously, there is quite urgent to develop new model. Furthermore, the essential molecular descriptors were
anti-TB drugs that are effective against drug-resistant TB stains. highlighted by the SHapley Additive exPlanations (SHAP) algo-
However, discovery of new anti-TB drugs is extremely difficult. rithm. Finally, a webserver for detecting potential Mtb inhibitors
The last first-line anti-TB drug rifampicin was licensed in 1963 was developed and is freely assessible at (http://cadd.zju.edu.cn/
[10]. Since then, a few drug candidates have been pushed into chemtb/).
clinical trials, and encouragingly two new drugs (i.e. Bedaquiline
and Delamanid) have been approved for the treatment of MDR-
TB. However, both drugs are associated with unfavorable side- Materials and methods
effects and are only recommended when any treatment regi-
men is not effective. Therefore, more efficient and cost-saving
Dataset preparation
strategies are extremely necessary to accelerate the design and In this study, the Mtb dataset was collected from the CHEMBL
discovery of new anti-TB drug therapeutics. database by searching three key CHEMBL Target IDs (360, 2111188
In the past decade, with the accumulation of extensive exper- and 2366634) and the related bioactivity data were downloaded
imental data, a number of computational models have been [17]. Then, to ensure the quality of the data, the dataset was
developed to predict Mtb-active molecules based on the quan- processed by the following three steps: (i) the molecules with-
titative structure–activity relationship (QSAR) methodology [11]. out bioactivity records and explicit chemical structures (SMILES
For example, in 2008, based on a dataset of 3779 compounds strings) were removed; (ii) the molecules with the bioactivity
having minimum inhibition concentration (MIC) values ranging measured by the MIC were selected, and the units of bioactivity
from 0.00316 nM to 4094 μM, Prathipati et al. established the (i.e. μg/ml, μM, nM) were all converted into μM; (iii) for a molecule
Bayesian classification models using 15 types of fingerprints, with multiple bioactivity records, if the bioactivity values are all
and the best classifier based on the ECFP12 fingerprints yielded more/less than the bioactivity threshold, the bioactivity record
a prediction accuracy of 0.73 on the test set of 2880 compounds was retained as the average of the bioactivity values, and oth-
[12]. In 2014, based on a dataset consisting of 773 compounds erwise, this molecule was removed. Here, according to previous
tested in the mouse TB infection model, Ekins et al. used three studies [12, 14], a reasonable threshold of 5 μM was defined
machine learning (ML) algorithms, including naïve Bayes (NB), to distinguish active and inactive compounds, and the whole
support vector machine (SVM) and recursive partitioning (RP), dataset contains 2424 Mtb inhibitors and 6094 Mtb noninhibitors.
to construct classification models, and the Bayesian model suc- The collected dataset was divided into three subsets including
cessfully predicted 8 out of 11 in vivo actives in the external test the training set, scaffold set and time set for model construction,
set, highlighting the reliability of the cost-effective predictor [13]. validation and evaluation, respectively. Moreover, to evaluate the
More recently, Lane et al. developed a set of classification models generalization ability of our models, the bioactivity data pub-
based on a Mtb dataset with 18 886 molecules using six ML lished in 2018 was extracted from the whole dataset to construct
algorithms, including Bernoulli naïve Bayes (BNB), linear logistic the time set for external model evaluation. Next, to ensure the
regression, AdaBoost Decision Tree, random forest (RF), SVM and similarity of the chemical space between the training set and
deep neural networks (DNN). The results demonstrate that DNN test set for accurate model evaluation, the remaining molecules
and SVM outperformed the other algorithms regardless of the were clustered by the Murcko scaffolds and 50% molecules were
descriptor type for training and cross-validation, but the BNB extracted randomly from each cluster to construct the scaffold
model yielded the best predictions for an external test set with set and the other 50% molecules were used as the training set. As
1171 molecules [14]. Although many different ML algorithms a result, the training set contains 4669 compounds (1363 positive
Identification of active molecules against Mtb through ML 3

and 3306 negative samples), the scaffold test set contains 3123 Random forest
compounds (877 positive and 2246 negative samples), and the
RF is a high efficient decision tree-based ensemble algorithm [32,
time test set contains 726 compounds (184 positive and 542
33]. Compared with decision tree, some perturbations such as
negative samples).
random feature subset selection were introduced to increase the
bias of the forest and decrease the corresponding variance while
enhance the predictive accuracy and generalization ability of RF.
Generation of molecular descriptors and fingerprints In the construction of RF, three main hyperparameters were opti-
mized, including the number of trees in the forest (n_estimators,
The selection of suitable molecular representations plays an from 50 to 300, interval = 10), the maximum depth of the tree
important role in the development of acceptable and robust pre- (max_depth, from 1 to 30, interval = 1), and the randomness

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


dictive QSAR models. To some extent, this depends upon the end of the bootstrapping of the samples used when building trees
point to be modeled. In this study, the bioactivity of a molecule is (random_state, from 1 to 100, interval = 1).
closely related with physicochemical properties and molecular
substructural fragments [18]. To comprehensively characterize
Extreme gradient boosting (XGBoost)
these chemical information of molecules, the following six dif-
ferent types of molecular descriptors and fingerprints that have XGBoost is an efficient implementation of the gradient boosting
been widely used in QSAR modeling were used to develop the framework and has shown extraordinary predictive power in
classification models [19]: (i) RDKit descriptors (RDKitDes): 200 many ML competitions [28]. Compared with traditional Gradient
RDKit 2D molecular descriptors that describe global molecular Boosted Regression Trees (GBRT) algorithms, XGBoost uses the
properties; (ii) AtomPairs fingerprints (PairsFP):1024 bits of atom clever penalization of individual trees and the trees are con-
pairs fingerprints that represent the topologies of molecules sequently allowed to have varying numbers of terminal nodes.
[20]; (iii) Morgan fingerprints (MorganFP): 1024 bits extended Moreover, GBRT only uses shrinkage to reduce the leaf weights,
connectivity fingerprints with bond diameter of four that rep- but XGBoost can also shrink them using penalization. In addi-
resent the circular atom neighborhoods based on two bond tion, XGBoost employs Newton boosting rather than gradient
length [21]; (iv) MACCS Keys (MACCS):166 MACCS structural frag- boosting. By doing this, XGBoost is likely to learn better tree
ments with frequency information to represent special sub- structures [34]. Three hyperparameters were optimized, includ-
structure occurrence [22]; (v) RDKit fingerprints (RDKitFP), 1024 ing the number of gradient boosted trees (n_estimators, from 50
bits of Daylight-like fingerprints based on hashing molecular to 300, interval = 10), the maximum depth of the tree (max_depth,
subgraphs; (vi) PubChem fingerprints (PubChemFP), 881 bits of from 1 to 30, interval = 1), and the balance of positive and nega-
fingerprints that represent element counts, type of ring systems, tive weights (scale_pos_weight, from 0.1 to 10, interval = 0.1).
atom pairing, atom environment, etc. The RDKitDes, PairsFP,
MACCS and RDKitFP were calculated by the open-source package Deep neural networks
RDKit, and the MorganFP and the PubChemFP were calculated by DNN has achieved remarkable success in drug discovery
PyBioMed [23, 24]. research over the past decade. Here, the classical multilayer
perceptron (MLP), a class of feedforward artificial neural network
(ANN), was used to build the classification model [35, 36]. An MLP
consists of at least three layers of nodes: an input layer, a hidden
ML algorithms
layer and an output layer. Except for the input nodes, each node
Four ML algorithms (i.e. SVM, RF, XGBoost and DNN) were is a neuron that uses a nonlinear activation function. Here three
used to develop the classification models for discriminating key hyperparameters were optimized, including the number of
TB inhibitors and TB noninhibitors. Besides, two consensus the units of each hidden layer (128, 256 and 512), optimization
methods (i.e. voting and stacking) were used to combine the algorithm (sgd, adam and lbfgs) and L2 regularization rate (alpha,
predictions of the four ML models [25, 26]. from 0.0001 to 0.1, interval = 0.0001).
The scikit-learn python package was used to build the SVM
and RF models, the XGBoost python package was used to develop
Model fusion
the XGBoost model [27, 28], and the Keras deep learning python
package was used to build the DNN models, with TensorFlow as Model fusion may improve the predictive performance of a
a backend [29]. single model by combining the predictions from multiple mod-
els [37, 38]. There are two main approaches for model fusion:
weighting and meta-learning. The weighting methods assign
different weights to different base models and then combine
Support vector machine
the predictions from the base models. The majority voting is
SVM is a discriminative classifier defined by a separating hyper- the simplest weighting method for classification as the selected
plane [30]. The SVM classifier is generated by a two-step process: class is the one with the most votes. Here, the rule of voting we
firstly, the sample data vector is projected into a very high- used is to combine multiple models by averaging the predicted
dimensional space, and then the algorithm finds the hyperplane probabilities and then predict the class labels. The meta-learning
with the most significant margin in the space to separate the methods refer to a process of learning from multiple learners.
data [30]. SVM is quite effective in high-dimensional space with Differ from standard ML models, a meta-learning model includes
relatively low number of samples, but it would cost a lot of time more than one learning stages. Stacking is probably the most
if the number of samples is enormous. In this study, the radial popular meta-learning technique, and it involves training a new
basis function (rbf) kernel was used and two hyperparameters ML model by combining the predictions of several base learners.
were optimized, including the regularization parameter (C, from First, the base learners are trained using the available train-
1 to 10, interval = 0.1), and the Kernel coefficient (gamma, from ing data, and then a meta algorithm is used to train a new
0.0001 to 0.1, interval = 0.0001) [31]. model based on the predictions of multiple base learners. In this
4 Ye et al.

study, the logistic regression algorithm was used in the stacked than their neighbors are considered to be outliers. According to
generalization. the definition of LOF, a value of approximately 1 indicates that
the molecule is comparable to its neighbors (and thus not an
outlier), a value below 1 indicates a denser region (which would
Model evaluation
be an inlier), whereas values significantly higher than 1 indicate
Here, the 10-fold cross-validation on the training set was used for outliers. LOF is defined by the following equations:
internal validation. To ensure that the percentage of the samples
for each class is approximately preserved in each training and  
reachability − distancek (A, B) = Max k − distance(B), d (A, B)
validation fold, the ‘stratified shuffle split’ strategy was used
to split the dataset for cross-validation. The random search   (6)
B∈Nk (A) reachability − distancek (A, B)
method, where each parameter is sampled from a distribution Irdk (A) = 1/ (7)

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


|Nk (A)|
over possible parameter values, was used to optimize hyper-  Irdk (A) 
B∈Nk (A) Irdk (B) B∈Nk (A) Irdk (B)
parameters [39]. It should be noted that the hyperparameter LOFk (A) = = (8)
|Nk (A)| | Nk (A) | •Irdk (A)
tuning for all models was guided by the area under the receiver
operating characteristic curve (AUC) through the 10-fold cross- where k-distance(A) is the distance of molecule A to the kth near-
validation on the training set. est neighbor, reachability−distancek (A, B) represents the true
Several binary classification evaluation metrics were used to distance between the two molecules calculated by the inverse
evaluate the performance of the classification models, including of the average reachability distance of the molecule A from its
accuracy, sensitivity (also called recall), specificity, F1-measure neighbors. The LOF is the ratio of the average local reachable
(F1), Matthews correlation coefficient (MCC) and AUC–ROC [40– density of the neighbors of point A to the local reachable density
42]. These metrics are defined as follows: of data point A, called the local relative density.

TN + TP Results and discussions


accuracy = (1)
TN + TP + FP + FN
Dataset analysis
TP
sensitivity = (2) To develop reliable and predictive QSAR models, the three
(TP + FN)
datasets were analyzed in the following three steps [48]. Firstly,
TN
specificity = (3) the data sizes and distributions of the training set, the scaffold
(TN + FP)
2 × precision × recall set and the time set were analyzed, and the corresponding
F1 =   (4) results are presented in Figure 2A. The results illustrate that the
precision + recall
TP × TN − FP × FN counts of the compounds for two different classes in the three
MCC = √ (5)
(TP + TN) (TP + FN) (TN + FP) (TN + FN) datasets are unbalanced and the inhibitor class contains fewer
samples. Then, to assess the reliability of the three test sets
where TP, TN, FP and FN are the numbers of true positives, true
for model validation, a two-dimensional principal component
negatives, false positives and false negatives, respectively.
analysis (PCA) was used to explore the chemical spaces of the
three datasets [49]. As shown in Figure 2B, it is apparent that
Model interpretation the chemical space distributions of the two test sets roughly
The SHAP method was used to interpret our models. SHAP is a fall within that of the training set, indicating that the scaffold
game theoretic approach to explain the output of ML models set and the time set are suitable to evaluate the prediction
[43, 44]. By assigning each feature an importance value for a performance of QSAR models. Finally, to assess the chemical
particular prediction, SHAP is able to measure the identification diversity of our datasets, the Murcko scaffolds were extracted
of a new class of additive feature importance and reveal a from the molecules [50, 51]. It should be noted that the training
unique solution in this class with a set of desirable properties. set and the scaffold set were merged for scaffold analysis since
In this study, TreeExplainer, an explanation method in SHAP for they share the same scaffolds. As a result, 2472 different Murcko
trees that enables the tractable computation of optimal local scaffolds were extracted from the merged dataset and 288
explanations, was used. The local explanations were calculated different Murcko scaffolds were extracted from the time set. For
by two steps: (i) a local explanation based on assigning a numeric the merged dataset, 58% scaffolds are unique and 86% scaffolds
measure of credit to each input feature; (ii) by combining many contain no more than five molecules, implying that the chemical
local explanations, a global structure can be represented while diversity of the training set was enough to construct models for
retaining local faithfulness to the original model. predicting structurally diverse compounds. For the time set,
61% scaffolds are unique and 91% scaffolds contain less than
five molecules. The high chemical diversity indicated that it
Model AD is suitable for the time set to detect the generalization ability
Due to the limited structural diversity of the molecules in the of the built models. To further explore the chemical structures
training set, it is necessary to define an AD for QSAR models of the datasets, the most frequent scaffolds from the merged
according to the principles of OECD [45, 46]. The AD of a QSAR dataset and time set were extracted and shown in Figure 2C.
model is defined as the response and chemical structure space For the merged dataset, each top-counted scaffold can be found
of the molecules in the training set. In this study, the AD of the in both inhibitors and noninhibitors molecules. However, the
stacking model was defined by the LOF algorithm [47]. The LOF inhibitors and noninhibitors for each scaffold are unbalanced. As
is based on a concept of a local density, where locality is given a result, the training and prediction for such compounds may be
by the k nearest neighbors whose distance is used to estimate difficult. For the time set, it can be observed that the substructure
the density. By comparing the local density of an object to the pyridine is contained in both the merged dataset and the time
local densities of its neighbors, one can identify the regions of set since numerous pyridine molecules are potent antitubercular
similar density, and the points with a substantially lower density agents [52].
Identification of active molecules against Mtb through ML 5

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


Figure 1. Model construction pipeline.

Performance comparison of different ML algorithms indicated by an AUC of 0.846. However, it should be noted that
the improvement is not significant.
Firstly, 24 models were developed based on the six different
To evaluate the stability and predictive ability of the models,
types of molecular representation (RDKitDes, MACCSFP, Mor-
the scaffold test set was randomly extracted from the prepared
ganFP, PairsFP, PubChemFP and RDKitFP) using four ML algo-
dataset based on the molecule scaffolds for 100 times and the
rithms (SVM, RF, XGBoost and DNN). The model performance
remaining data was used as the training set. The 100 evaluation
(AUC) of the 10-fold cross-validation for the training set is sum-
results of the five models based on RDKitDes + MorganFP are
marized in Figure 3A. Generally, most models present good dis-
shown in Figure 5. Overall, all the models perform well, with the
crimination capability with the AUC values ranging from 0.801
AUC values all higher than 0.91. Based on the average AUC values
to 0.832. The XGBoost model based on MorganFP and RDKitFP
and the other evaluation metrics, it is obvious that the stacking
performs the best with AUC = 0.832. Then, to further improve
model outperforms the other four individual models, with an
model performance, RDKitDes and a fingerprint set (MACCSFP,
average AUC = 0.935 and ACC = 0.878 for the scaffold test set.
MorganFP, PairsFP, PubChemFP or RDKitFP) were combined as the
Interestingly, the performance of the DNN model varies dramat-
molecular features for model building since they are different
ically with the variation of the scaffold test set, indicating that
types of molecular representations. Based on the five types of
the DNN model is quite sensitive to the structural composition
combined features, 20 models were constructed using four ML
of dataset.
algorithms (SVM, RF, XGBoost and DNN). As shown in Figure 3B,
compared with the models based on a single set of descriptors or
fingerprints, those based on the combination of descriptors and
fingerprints offer more accurate predictions (AUC = 0.812–0.843). Model interpretation
For the five combined molecular features, the models based on To gain a deeper insight into the built models, the contributions
RDKitDes+ MorganFP perform the best, with an average AUC of of the molecular descriptors in each model were calculated by
0.832. The performance of the classification models based on the SHAP method [44]. Due to the good interpretation ability and
RDKitDes + MorganFP is summarized in Table 1. For the models relatively high predictive performance, the XGBoost model based
based on RDKitDes + MorganFP, the XGBoost model perform the on RDKitDes + MorganFP was analyzed. The top 10 features and
best (AUC = 0.843). Here, it can be recognized that the predictive the corresponding SHAP values are shown in Figure 6. A more
ability of each model toward noninhibitors (specificity > 90%) detailed descriptions of the 2D molecular descriptors in the
is much better than that toward inhibitors (sensitivity < 60%), top 10 features are listed in Table 2. As shown in Figure 6A,
which was potentially caused by the imbalance of the training the descriptors that characterize the number of pyridine rings
set. In addition, it is observed that the tree-based model owns (fr_pyridine) and the number of halogens (fr_halogen) have high
higher specificity, whereas the SVM and DNN models provide contributions to the XGBoost model, suggesting that a higher
higher sensitivity. Therefore, to achieve more accurate predic- number of pyridine rings and halogens will increase the pre-
tions, we combined different ML classifiers by two model fusion dictive probability of a molecule to be Mtb inhibitor. For the
methods (voting and stacking). The ensemble models were built inhibitors in the training set, nearly 41% (554 of 1363) and 59%
based on RDKitDes+MorganFP. For each fusion method, we tried (798 of 1363) molecules contain the substructures pyridine rings
11 different combination strategies by changing the number and and halogens, respectively. These interpretations explained by
type of ML algorithms. As shown in Figure 4, the stacking models SHAP are consistent with experimental observations [53–55]. The
are slightly better than the voting models. For the fusion models pyridine fused systems were reported to show promising anti-
built with different algorithms, the stacking model based on the TB activity against replicating Mtb H37Rv and the substitution
combination of SVM, RF, XGBoost and DNN performs the best, of a halogen group, especially fluorine, on the phenyl ring in
6 Ye et al.

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


Figure 2. Dataset analysis. (A) The bioactivity class distribution of the training set, scaffold test set, time set. (B) Principal component analysis (PCA) plot of the training
set scaffold test set, time set based on the Morgan fingerprints. The horizontal and vertical coordinates represent the two main directions of maximum variance in the
datasets. (C) Six most common molecular scaffolds in the datasets and the corresponding bioactivity distribution.

Table 1. Performance of the classification models based on different ML algorithms

ACC Sensitivity Specificity F1 MCC AUC

SVM 10-fold validation 0.814 0.581 0.909 0.759 0.527 0.828


Scaffold set 0.873 0.710 0.937 0.870 0.691 0.920
RF 10-fold validation 0.808 0.511 0.931 0.740 0.505 0.840
Scaffold set 0.864 0.626 0.956 0.856 0.645 0.922
XGBoost 10-fold validation 0.807 0.474 0.943 0.730 0.495 0.843
Scaffold set 0.863 0.624 0.956 0.856 0.643 0.921
DNN 10-fold validation 0.804 0.531 0.917 0.740 0.499 0.817
Scaffold set 0.862 0.659 0.941 0.856 0.643 0.917
Voting 10-fold validation 0.813 0.521 0.931 0.618 0.515 0.845
Scaffold set 0.877 0.660 0.961 0.871 0.682 0.942
Stacking 10-fold validation 0.813 0.540 0.924 0.750 0.518 0.846
Scaffold set 0.878 0.685 0.954 0.872 0.684 0.942
Identification of active molecules against Mtb through ML 7

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


Figure 3. Model performance (ROC_AUC) for 10-fold cross-validated training set. (A) Based on six individual molecular representations. (B) Based on five feature
combinations of molecular descriptor and fingerprints.

Figure 4. Performance (AUC_ROC) of different types of fusion models for the 10-fold cross-validated training set. All the fusion models were built on the combination
of Des and Morgan.

thiacetazone can enhance anti-TB activity of this compound EState-VSA that intends to encode the topology and electronic
[53–55]. Indeed, a relatively high proportion (538 of 1363) of the environment of molecular fragments play important roles in the
inhibitors in the training set have the fluorine atoms. How- prediction of the XGBoost models. Specifically, a higher value of
ever, it should be noted that 614 noninhibitors in the training EState-VSA descriptor increases the probability of a molecule to
set also have the fluorine atoms and 144 have the substructure be predicted as a inhibitor, indicating the importance of energy
of bitvector 43. This phenomenon suggests that the impact of metabolism in Mtb. To our knowledge, Mtb contains a diverse
a certain substructure is not absolute and the global molecular range of multidrug transporters, many of which are dependent
property still need to be taken into consideration for predicting on the proton motive force (PMF) or the availability of ATP [56].
Mtb inhibitors. For 2D molecular descriptors, it can be seen that This result further illustrates that energy metabolism and ATP
8 Ye et al.

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022

Figure 5. Performance of the classification models on 100 test sets. Each test set was randomly extracted from the dataset according to the molecular scaffold and the
remaining data was used as the training set. Sampling of the same data is allowed in the dataset extraction.

production through the PMF that is established by the electron each feature was calculated and the top-ranked features with
transport chain are critical in determining the drug susceptibility specific value are shown in Figure 6B. Ciprofloxacin is a syn-
of Mtb. thetic broad spectrum fluoroquinolone antibiotic, which binds
Except molecular descriptors, molecular fingerprints also to and inhibits bacterial DNA gyrase, an enzyme essential for
play an important role in the prediction of a XGBoost model. To DNA replication [57]. Two substructures, the cyclopropyl (Morgan
illustrate the relationship between molecular fingerprints and 338) and the fluorine atom (Morgan 904) shown in Figure 6D,
model output, two representative compounds, ciprofloxacin and make important contributions in the prediction of Ciprofloxacin
analogs of PA-824, were carefully analyzed. The contribution of and are considered to be potent modifications on the quinolone
Identification of active molecules against Mtb through ML 9

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


Figure 6. Importance of the representative molecular descriptors (top 20) based on the XGBoost model. (A) The SHAP values for each molecular descriptor. (B) Ten
most important molecular descriptors toward CIPROFLOXACIN. (C) Ten most important molecular descriptors toward the analogs of PA-824. (D) Important molecular
substructures of CIPROFLOXACIN. (E) Important molecular substructures of the analogs of PA-824 explained.

Table 2. Important descriptors identified by SHAP for the XGBoost model based on Des + Morgan

Molecular descriptors Description

fr_pyridine Number of pyridine rings


MinAbsPartialCharge Minimal absolute partial charge
fr_halogen Number of halogens
FpDensityMorgan Morgan fingerprint
Estate_VSA MOE-type descriptors using EState indices and surface area contributions
PEOE_VSA Sum of vi where qi is less than −0.30
SMR_VSA Sum of vi such that Ri is (0.40 ≤ x < 0.50). Ri denotes the contribution to molar
refractivity for atom i as calculated in the SMR descriptor
Chi3v Molecular connectivity indexes, characterize the structural attributes of molecule.
MolLogP Solvation energy. In the Potential Setup panel, the term Wildman-Crippen LogP value
Kappa3 Hall-Kier Kappa3 value, intended to capture the overall aspects of molecular shape
qed Weighted sum of ADS mapped properties that stand for quantitative estimation of
drug-likeness
Min/MaxAbs EstateIndex Minimum/maximum absolute EState index. An electrotopological-state index for
atoms in a molecule
10 Ye et al.

Table 3. Performance of the ML models developed based on Des + Morgan for the time set

ACC Sensitivity Specificity F1 MCC AUC

SVM 0.775 0.386 0.908 0.758 0.343 0.741


RF 0.764 0.266 0.934 0.731 0.270 0.747
XGBoost 0.771 0.283 0.937 0.739 0.296 0.749
DNN 0.738 0.413 0.849 0.731 0.276 0.717
Voting 0.795 0.375 0.938 0.773 0.389 0.750
Stacking 0.795 0.380 0.935 0.773 0.388 0.750

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


molecules in the structure–activity relationships (SAR) study of around 90% molecules in the time set contain new molecular
quinolone molecules [58]. Another example, the analogs of PA- scaffolds. Therefore, it is quite possible that the molecules with
824, were reported to inhibit the synthesis of protein and cell wall the new scaffolds in the external test set cannot be reliably
lipid via nitroimidazopyrans, in which the important scaffolds predicted by the model developed based on the training set.
nitroimidazopyrans was digged out by the XGBoost model and As shown in Table 4, it is apparent that the sensitivity of the
presented as Morgan 43 and Morgan 880 [59, 60]. molecules with the new scaffolds in the time set is apparently
In conclusion, the feature analysis provides an assistance lower than that of the molecules with the existing scaffolds in
reference for lead compound selection and proves that the built the time set, indicating that the generalization ability of the
XGBoost models are logically reliable to a certain extent. Further- model becomes much worse for the molecules with unfamiliar
more, the explanation of molecular fingerprint bits given by the scaffolds.
SHAP values reveal some significant substructures for further Then, we analyzed the relationship between molecule bioac-
lead compound optimization. tivity (represented as MIC), the similarity between each molecule
in the test set and the most similar molecule in the new-
training set, and the predicted probability given by the stacking
Model performance on the external test sets and AD model [61, 62]. Herein, the similarity was evaluated by the Tan-
To further evaluate the generalization ability of our methods, imoto coefficient based on the Morgan fingerprints. As shown
we retrained the six models (SVM, RF, XGBoost, DNN, voting and in Figure 7A, for the molecules with new scaffolds, the sim-
stacking) using the dataset consisting of the previous training set ilarities were mainly distributed in the range of 0.3–0.5 and
and the scaffold set based on RDKitDes + Morgan. The predictive the corresponding predicted probabilities were mainly lower
performance of the models was assessed by the external test set than 0.3, whereas for the molecules with existing scaffolds,
(called the time set) that contains 726 compounds reported in the similarities are mainly distributed in the range of 0.7–0.9
2018. The predictive performance of the four individual models and the corresponding predicted probabilities are mainly higher
and two ensemble models is summarized in Table 3. than 0.5. Besides, as shown in Figure 7B, with the increase of
As shown in Table 3, the performance of the models on the the similarity, the sensitivity of the stacking model increases
time set (an average ACC value of 0.773 and an average AUC and the specificity decreases. These results suggest that the
value of 0.742) is worse than that on the scaffold set. Com- molecules that are similar to the training set are more likely to be
parison of the six models illustrates that the ensemble models predicted as inhibitors. As shown in Figure 7C, the compounds
(voting and stacking) are more stable and reliable than the with the bioactivities close to the threshold (5 μM) are more
four individual models (SVM, RF, XGBoost and DNN) according difficult to be predicted accurately, which may be contributed
to multiple evaluation metrics. In detail, the overall predictive from the imprecise bioactivity values of compounds since the
performances of the four individual models are quite similar, but measured experimental MIC values from different laboratories
those toward the positive and negative samples are still quite are incomparable and as a result the mislabeled compounds
different. For example, the RF and XGBoost models can correctly may bring confusions in model construction. Hence, it is quite
detect more true negatives with higher specificity values of 0.934 necessary to define an appropriate applicability domain for the
and 0.937, respectively, whereas the SVM model can correctly stacking model.
detect more true positives with a sensitivity value of 0.386. As We applied the LOF to estimate the outliers in the external
to the DNN model, it can successfully detect more true positives test set. After obtaining the LOF value of each molecule, we
and gives the highest sensitivity value of 0.413. However, due calculated the accuracy based on the predictions of molecules
to the decreased discrimination capability for noninhibitors, the with different LOF thresholds (Figure 8A). Since the molecules
performance of the DNN model on the time set was not satis- with LOF close to 1 are few, the curve shows dramatic swings
factory. The stacking model by combining the SVM, RF, XGBoost in the range of 0.9–1.5. Then, with the increase of LOF, the
and DNN models gives the best classification capability for accuracy gradually decreases. Therefore, we decided to set
the time set (ACC = 0.795, sensitivity = 0.380, specificity = 0.935, the LOF threshold to be 1.5 and the results show that 482
F1 = 0.773,and AUC = 0.750). However, it can be observed that the molecules are outliers in this condition. After removing these
predictive performances of the stacking model for the inhibitors outliers, the accuracy of the stacking model could reach to
and noninhibitors in the time set are quite different. Compared 0.861, nearly improving 0.06 compared with the accuracy for the
with the predictions on the scaffold set, we can see that the original test set. Therefore, it is reliable to check how accurately
sensitivity to the time set is about 30% lower, whereas the the query molecules are predicted by the stacking model via
specificity is only about 2% lower. Besides, we observed that the LOF method. Then, four inhibitors (Figure 8B) and four
73 molecules in the time set contain the same molecular scaf- nonhibitors (Figure 8C) outside the model AD were selected to
folds to some molecules in the new-training set whereas 653 explore why these molecules cannot be predicted successfully.
molecules in the time set contain new scaffolds, implying that The misclassification of the compounds CHEMBL4171231 and
Identification of active molecules against Mtb through ML 11

Table 4. Impact of molecular scaffolds on the predictive performance of the stacking model

Positives Negatives Sensitivity Specificity

Existing scaffolds 34 39 0.853 0.590


New scaffolds 150 503 0.273 0.962
Sum 184 542 0.380 0.935

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


Figure 7. Impact of similarity on model performance. (A) Relationship among molecule bioactivity (represented as MIC values), similarity between each molecule in the
test set and most similar molecules in new-training set, and the predicted probability given by the stacking model. (B) The model performance indicated by sensitivity
and specificity with different similarity thresholds (from 0 to 0.9, interval = 0.1). (C) The difference of bioactivity distribution of compounds in the test set between true
predictions and false predictions.

CHEMBL4173210 may be caused by the molecular scaffolds of Webserver for the identification of Mtb inhibitors
hybridized 1H-benzo[d]imidazoles and 3,4-dihydroquinazolin- To share our models with other chemists and pharmacologists,
4-ones since these substructures do not exist in the training set we developed a webserver called ChemTB ((http://cadd.zju.edu.
[63]. And the compounds CHEMBL4289962 and CHEMBL4285907 cn/chemtb/) to detect potential Mtb inhibitors. ChemTB was
may be misclassified due to the substructure of (3Z)-3- developed via Django using the python package. The webserver
(hydroxyimino)-2,3-dihydro-1H-indol-2-one [64]. As for the false mainly includes two functions: similarity search and bioactivity
negatives, the misclassification of the four noninhibitors may prediction (Figure 9). For the similarity search, the similarity can
be caused by their special spatial configuration that cannot be be calculated via two metrics (Tanimoto and Dice) based on three
well characterized by molecular fingerprints. types of molecular fingerprints (MACCS, Morgan and AtomPair
12 Ye et al.

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


Figure 8. Analysis of misclassified molecules by the stacking model. (A) Relationship between LOF values of molecules in time set and the accuracy calculated by the
molecules within the corresponding LOF values. (B) Misclassified inhibitors in the external test set by the stacking model. (C) Misclassified noninhibitors in the time
set by the stacking model.

fingerprints). As a result, three most similar compounds in our returns the AD detection results and the bioactivity prediction.
datasets and the related information are presented. For the Except the functions introduced above, the datasets used in
bioactivity prediction, the stacking model with excellent perfor- this study and the pretrained models are also available in this
mance was employed in the webserver. To broaden the AD of website.
our model, the model implemented by ChemTB was retrained
based on the Morgan fingerprints using all the available datasets
Conclusion
used in our study (including 2424 inhibitors and 6094 nonin-
hibitors). According to the 10-fold cross-validation (AUC = 0.90 In our study, we collected a bioactivity dataset toward Mtb H37Rv
and ACC = 0.85), the retrained model shows excellent predictive from the CHEMBL database and built a set of models using four
performance. ChemTB accepts a query molecule for bioactivity ML algorithms (i.e. SVM, RF, XGBoost and DNN) based on six types
prediction by drawing its structure from the JSME molecular of molecular descriptor and fingerprints. The stacking model by
editor or by inputting its SMILES. The open-source cheminfor- combing the predictions from the SVM, RF, XGBoost and DNN
matics tool, RDKit, was used to process molecules and calculate models based on the 2D descriptors and Morgan fingerprints
molecular descriptors. Then, the LOF was used to test whether performs the best, with AUC of 0.846 for the 10-fold cross-
a query molecule is located in the AD. Finally, the webserver validation, AUC of 0.942 for the scaffold set and AUC of 0.750
Identification of active molecules against Mtb through ML 13

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


Figure 9. Website schematic diagram containing similarity search and bioactivity prediction.

and accuracy of 0.795 for the external test set (the time set). 81,773,632) and Natural Science Foundation of Zhejiang
Given the applicability domain value and prediction accuracy Province (LZ19H300001).
requirements, the credible prediction results need to satisfy the
following requirements for the query molecule: (i) owning the
similarity value higher than 0.7 compared with the molecules
in the training set; (ii) having the LOF value higher than 1.5; References
(iii) avoiding molecules with stereoisomerism. The important 1. Orme I, Secrist J, Anathan S, et al. Search for new drugs
molecular features for classification were analyzed by the SHAP for treatment of tuberculosis. Antimicrob Agents Chemother
method. The essential molecular descriptors and fingerprints 2001;45:1943–6.
highlighted by SHAP are generally consistent with the expert 2. Reid MJA, Arinaminpathy N, Bloom A, et al. Building a
experience, suggesting the reliability of the built models. Finally, tuberculosis-free world: the lancet commission on tubercu-
a freely accessible webserver, ChemTB, was developed for the losis. Lancet 2019;393:1331–84.
implementation of the online prediction by the well-trained 3. World Health Organization (2019). Global Tuberculosis Report
models. 2019. Geneva: World Health Organization, 2019.
4. Abubakar I, Zignol M, Falzon D, et al. Tuberculosis 2013:5
Key Points drug-resistant tuberculosis: time for visionary political lead-
• Four machine learning algorithms were used to ership. Lancet Infect Dis 2013;13:529–39.
5. Zumla AI, Gillespie SH, Hoelscher M, et al. New
develop the classification models to identify potential
antituberculosis drugs, regimens, and adjunct therapies:
Mtb inhibitors.
• The evaluation results illustrate that the XGBoost needs, advances, and future prospects. Lancet Infect Dis
2014;14:327–40.
model exhibits the best prediction performance.
• Two consensus strategies were employed to further 6. Halsey NA, Coberly JS, Desormeaux J, et al. Randomised
trial of isoniazid versus rifampicin and pyrazinamide
improve the predictive ability and robustness of the
for prevention of tuberculosis in HIV-1 infection. Lancet
classification models.
• We developed a freely available computational web- 1998;351:786–92.
7. Goble M, Iseman MD, Madsen LA, et al. Treatment of 171
server to predict potential Mtb inhibitors.
patients with pulmonary tuberculosis resistant to isoniazid
and Rifampin. N Engl J Med 1993;328:527–32.
8. Zhang Y, Wade MM, Scorpio A, et al. Mode of action of
Funding
pyrazinamide: disruption of mycobacterium tuberculosis
Key R&D Program of Zhejiang Province (2020C03010), membrane transport and energetics by pyrazinoic acid. J
National Natural Science Foundation of China (21,575,128, Antimicrob Chemother 2003;52:790–5.
14 Ye et al.

9. Telenti A, Philipp WJ, Sreevatsan S, et al. The emb operon, 31. Camps-Valls G, Bruzzone L. Kernel-based methods for
a gene cluster of mycobacterium tuberculosis involved in hyperspectral image classification. IEEE Trans Geosci Remote
resistance to ethambutol. Nat Med 1997;3:567–70. Sens 2005;43:1351–62.
10. Opromolla DV, Lima Lde S, Caprara G. Rifamycin SV in 32. Breiman L. Random forests. Mach Learn 2001;45:5–32.
the treatment of lepromatous leprosy. Lepr Rev 1965;36: 33. Svetnik V, Liaw A, Tong C, et al. Random forest: a clas-
123–131. sification and regression tool for compound classifica-
11. Lewis RA, Wood D. Modern 2D QSAR for drug discovery. tion and QSAR modeling. J Chem Inf Comput Sci 2003;43:
Wiley Interdisciplinary Reviews-Computational Molecular Science 1947–58.
2014;4:505–22. 34. Mitchell R, Frank E. Accelerating the XGBoost algorithm
12. Prathipati P, Ma NL, Keller TH. Global Bayesian models for using GPU computing. Peerj Comput Sci 2017;3:e127.
the prioritization of antitubercular agents. J Chem Inf Model 35. Attali JG, Pages G. Approximations of functions by a

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


2008;48:2362–70. multilayer perceptron: a new approach. Neural Netw
13. Ekins S, Pottorf R, Reynolds RC, et al. Looking back to 1997;10:1069–81.
the future: predicting in vivo efficacy of small molecules 36. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature
versus mycobacterium tuberculosis. J Chem Inf Model 2015;521:436–44.
2014;54:1070–82. 37. Sagi O, Rokach L. Ensemble learning: a survey. WIREs Data
14. Lane T, Russo DP, Zorn KM, et al. Comparing and validating Min Knowl Discovery 2018;8:e1249.
machine learning models for mycobacterium tuberculosis 38. Kittler J. Multiple classifier systems. Soft Computing Approach
drug discovery. Mol Pharm 2018;15:4346–60. to Pattern Recognition and Image Processing 2002; 3-22.
15. Lei TL, Sun HY, Kang Y, et al. ADMET evaluation in drug dis- 39. Bergstra J, Bengio YJJMLR. Random search for hyper-
covery. 18. Reliable prediction of chemical-induced urinary parameter. Optimization 2012;13:281–305.
tract toxicity by boosting machine learning-approaches. Mol 40. Fawcett T. An introduction to ROC analysis. Pattern Recogn
Pharm 2017;14:3935–53. Lett 2006;27:861–74.
16. Sheridan RP, Wang WM, Liaw A, et al. Extreme gradient 41. Powers D. Evaluation: from precision, recall and F-factor to
boosting as a method for quantitative structure-activity ROC, Informedness, Markedness & Correlation. Mach Learn
relationships. J Chem Inf Model 2016;56:2353–60. Technol 2008;2:37-63.
17. Mendez D, Gaulton A, Bento AP, et al. ChEMBL: towards direct 42. Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for
deposition of bioassay data. Nucleic Acids Res 2019;47:D930– imbalanced data using Matthews correlation coefficient
40. metric. Plos One 2017;12:e0177678.
18. Klekota J, Roth FP. Chemical substructures that enrich for 43. Lundberg SM, Erion G, Chen H, et al. From local explanations
biological activity. Bioinformatics 2008;24:2518–25. to global understanding with explainable AI for trees. Nat
19. Duan J, Dixon SL, Lowrie JF, et al. Analysis and comparison Mach Intell 2020;2:56–67.
of 2D fingerprints: insights into database screening perfor- 44. Lundberg SM, Lee SI. A unified approach to interpreting
mance using eight fingerprint methods. J Mol Graph Model model predictions. Adv Neural Inform Process Syst 2017;30:
2010;29:157–70. 30.
20. Carhart RE, Smith DH, Venkataraghavan R. Atom pairs 45. Jaworska JS, Comber M, Auer C, et al. Summary of a
as molecular-features in structure activity studies - def- workshop on regulatory acceptance of (Q)SARs for human
inition and applications. J Chem Inf Comput Sci 1985;25: health and environmental endpoints. Environ Health Perspect
64–73. 2003;111:1358–60.
21. Rogers D, Hahn M. Extended-connectivity fingerprints. J 46. Gramatica P. Principles of QSAR models validation: internal
Chem Inf Model 2010;50:742–54. and external. Qsar Combinatorial Ence 2007;26:694–701.
22. Weininger D, Weininger A, Weininger JL. SMILES. 2. Algo- 47. Breunig MM, Kriegel HP, Ng RT, et al. LOF: identifying density-
rithm for generation of unique SMILES notation. J Chem Inf based local outliers. Sigmod Record 2000;29:93–104.
Comput Sci 1989;29:97–101. 48. Tropsha A. Best practices for QSAR model development,
23. Landrum G. RDKit: Open-Source Cheminformatics Software. validation, and exploitation. Mol Inf 2010;29:476–88.
http://www.rdkit.org/, https://github.com/rdkit/rdkit, 2016, 49. Wold S, Esbensen K, PJC G, et al. Principal component anal-
Accessed 17 Nov 2019. ysis. Chemometrics and intelligent laboratory systems 1987;2:
24. Dong J, Yao ZJ, Zhang L, et al. PyBioMed: a python library 37-52.
for various molecular representations of chemicals, proteins 50. Bemis GW, Murcko MA. The properties of known drugs. 1.
and DNAs and their interactions. J Chem 2018; 10:1-11. Molecular frameworks. J Med Chem 1996;39:2887–93.
25. Wolpert DH. Stacked generalization. Neural Netw 51. Shelat AA, Guy RK. Scaffold composition and biologi-
1992;5:241–59. cal relevance of screening libraries. Nat Chem Biol 2007;3:
26. Dietterich TG. Ensemble methods in machine learning. Mul- 442–6.
tiple Classifier Systems 1857;2000:1–15. 52. Chaudhari KS, Patel HM, Surana SJ. Pyridines: multidrug-
27. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: resistant tuberculosis (MDR-TB) inhibitors. Indian J Tuberc
machine learning in python. J Mach Learn Res 2011;12: 2017;64:119–28.
2825–30. 53. Abrahams KA, Cox JAG, Spivey VL, et al. Identification of
28. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting Sys- novel Imidazo[1,2-a]pyridine inhibitors targeting M. tuber-
tem. arXiv e-prints 2016; https://ui.adsabs.harvard.edu/#a culosis QcrB. Plos One 2012;7:e52951.
bs/2016arXiv160302754C (1 March 2020, date last accessed). 54. Esfahanizadeh M, Omidi K, Kauffman J, et al. Synthesis and
29. Abadi M. TensorFlow: learning functions at scale. Acm Sig- evaluation of new fluorinated anti-tubercular compounds.
plan Notices 2016;51:1–1. Iran J Pharm Res 2014;13:115–26.
30. Chang CC, Lin CJ. LIBSVM: a library for support vector 55. Dulla B, Wan BJ, Franzblau SG, et al. Construction and
machines. Acm Trans Intell Syst Technol 2011;2:1-27. functionalization of fused pyridine ring leading to novel
Identification of active molecules against Mtb through ML 15

compounds as potential antitubercular agents. Bioorg Med (trifluoromethoxy)benzyl]oxy}-6,7-dihydro-5H-imidazo


Chem Lett 2012;22:4629–35. [2,1-b][1,3]oxazine (PA-824). J Med Chem 2011;54:6563–85.
56. Black PA, Warren RM, Louw GE, et al. Energy metabolism 61. Baldi P, Nasr R. When is chemical similarity signifi-
and drug efflux in mycobacterium tuberculosis. Antimicrob cant? The statistical distribution of chemical similarity
Agents Chemother 2014;58:2491–503. scores and its extreme values. J Chem Inf Model 2010;50:
57. Campoli-Richards DM, Monk JP, Price A, et al. Ciprofloxacin. 1205–22.
Drugs 1988;35:373–447. 62. Sheridan RP, Feuston BP, Maiorov VN, et al. Similarity
58. Peterson LR. Quinolone molecular structure- to molecules in the training set is a good discriminator
activity relationships: what we have learned about for prediction accuracy in QSAR. J Chem Inf Comput Sci
improving antimicrobial activity. Clin Infect Dis 2001;33: 2004;44:1912–28.
S180–6. 63. Macchi FS, Pissinate K, Villela AD, et al. 1H-

Downloaded from https://academic.oup.com/bib/article/22/5/bbab068/6209685 by Zhejiang University user on 23 February 2022


59. Stover CK, Warrener P, VanDevanter DR, et al. A small- benzo[d]imidazoles and 3,4-dihydroquinazolin-4-ones:
molecule nitroimidazopyran drug candidate for the treat- design, synthesis and antitubercular activity. Eur J Med Chem
ment of tuberculosis. Nature 2000;405:962–6. 2018;155:153–64.
60. Thompson AM, Sutherland HS, Palmer BD, et al. Synthesis 64. Gao F, Yang H, Lu TY, et al. Design, synthesis and
and structure–activity relationships of varied ether linker anti-mycobacterial activity evaluation of benzofuran-isatin
analogues of the antitubercular drug (6S)-2-Nitro-6-{[4- hybrids. Eur J Med Chem 2018;159:277–81.

You might also like