Professional Documents
Culture Documents
Identification of Active Molecules Against: Mycobacterium Tuberculosis Through Machine Learning
Identification of Active Molecules Against: Mycobacterium Tuberculosis Through Machine Learning
Identification of Active Molecules Against: Mycobacterium Tuberculosis Through Machine Learning
https://doi.org/10.1093/bib/bbab068
Problem Solving Protocol
Abstract
Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (Mtb) and it has been one of the top 10 causes
of death globally. Drug-resistant tuberculosis (XDR-TB), extensively resistant to the commonly used first-line drugs, has
emerged as a major challenge to TB treatment. Hence, it is quite necessary to discover novel drug candidates for TB
treatment. In this study, based on different types of molecular representations, four machine learning (ML) algorithms,
including support vector machine, random forest (RF), extreme gradient boosting (XGBoost) and deep neural networks
(DNN), were used to develop classification models to distinguish Mtb inhibitors from noninhibitors. The results demonstrate
that the XGBoost model exhibits the best prediction performance. Then, two consensus strategies were employed to
integrate the predictions from multiple models. The evaluation results illustrate that the consensus model by stacking the
RF, XGBoost and DNN predictions offers the best predictions with area under the receiver operating characteristic curve of
0.842 and 0.942 for the 10-fold cross-validated training set and external test set, respectively. Besides, the association
between the important descriptors and the bioactivities of molecules was interpreted by using the Shapley additive
explanations method. Finally, an online webserver called ChemTB (http://cadd.zju.edu.cn/chemtb/) was developed, and it
offers a freely available computational tool to detect potential Mtb inhibitors.
Qing Ye is now a postgraduate in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His research
interests mainly lie in the area of computer-aided drug design.
Xin Chai is currently a PhD student in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His
research interests mainly lie in the area of computer-aided drug design.
Dejun Jiang is currently a PhD student in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His
research interests mainly lie in the area of computer-aided drug design.
Liu Yang is now a postgraduate in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His research
interests mainly lie in the area of computer-aided drug design.
Chao Shen is currently a PhD student in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His
research interests mainly lie in the area of computer-aided drug design.
Xujun Zhang is now a postgraduate in the College of Pharmaceutical Sciences at Zhejiang University under the supervision of Prof. Tingjun Hou. His
research interests mainly lie in the area of computer-aided drug design.
Dan Li is now an associate professor in the College of Pharmaceutical Sciences, Zhejiang University, China. Her research interests include design and
discovery of novel drug candidates through computer-aided drug design and biological evaluation.
Dongsheng Cao is currently a professor in the Xiangya School of Pharmaceutical Sciences at Central South University. More information can be found at
the website of his group: http://www.scbdd.com.
Tingjun Hou is currently a professor in the College of Pharmaceutical Sciences at Zhejiang University. More information can be found at the website of
his group: http://cadd.zju.edu.cn.
Submitted: 14 December 2020; Received (in revised form): 23 January 2021
© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
1
2 Ye et al.
Key words: Mycobacterium tuberculosis; machine learning; extreme gradient boosting; deep learning; model fusion
Introduction have been used for the prediction of Mtb inhibition activities,
no any algorithm clearly outperformed the others. Moreover,
Tuberculosis (TB) is a contagious disease in humans mainly the new ensemble learning algorithm, extreme gradient boost-
caused by Mycobacterium tuberculosis (Mtb), and it is one of the top ing (XGBoost), that has been successfully employed in ADMET
10 causes of death [1]. Globally, an estimated 1.7 billion people predictions and QSPR modeling has never been used in Mtb
are infected with Mtb and therefore at risk for developing active
and 3306 negative samples), the scaffold test set contains 3123 Random forest
compounds (877 positive and 2246 negative samples), and the
RF is a high efficient decision tree-based ensemble algorithm [32,
time test set contains 726 compounds (184 positive and 542
33]. Compared with decision tree, some perturbations such as
negative samples).
random feature subset selection were introduced to increase the
bias of the forest and decrease the corresponding variance while
enhance the predictive accuracy and generalization ability of RF.
Generation of molecular descriptors and fingerprints In the construction of RF, three main hyperparameters were opti-
mized, including the number of trees in the forest (n_estimators,
The selection of suitable molecular representations plays an from 50 to 300, interval = 10), the maximum depth of the tree
important role in the development of acceptable and robust pre- (max_depth, from 1 to 30, interval = 1), and the randomness
study, the logistic regression algorithm was used in the stacked than their neighbors are considered to be outliers. According to
generalization. the definition of LOF, a value of approximately 1 indicates that
the molecule is comparable to its neighbors (and thus not an
outlier), a value below 1 indicates a denser region (which would
Model evaluation
be an inlier), whereas values significantly higher than 1 indicate
Here, the 10-fold cross-validation on the training set was used for outliers. LOF is defined by the following equations:
internal validation. To ensure that the percentage of the samples
for each class is approximately preserved in each training and
reachability − distancek (A, B) = Max k − distance(B), d (A, B)
validation fold, the ‘stratified shuffle split’ strategy was used
to split the dataset for cross-validation. The random search (6)
B∈Nk (A) reachability − distancek (A, B)
method, where each parameter is sampled from a distribution Irdk (A) = 1/ (7)
Performance comparison of different ML algorithms indicated by an AUC of 0.846. However, it should be noted that
the improvement is not significant.
Firstly, 24 models were developed based on the six different
To evaluate the stability and predictive ability of the models,
types of molecular representation (RDKitDes, MACCSFP, Mor-
the scaffold test set was randomly extracted from the prepared
ganFP, PairsFP, PubChemFP and RDKitFP) using four ML algo-
dataset based on the molecule scaffolds for 100 times and the
rithms (SVM, RF, XGBoost and DNN). The model performance
remaining data was used as the training set. The 100 evaluation
(AUC) of the 10-fold cross-validation for the training set is sum-
results of the five models based on RDKitDes + MorganFP are
marized in Figure 3A. Generally, most models present good dis-
shown in Figure 5. Overall, all the models perform well, with the
crimination capability with the AUC values ranging from 0.801
AUC values all higher than 0.91. Based on the average AUC values
to 0.832. The XGBoost model based on MorganFP and RDKitFP
and the other evaluation metrics, it is obvious that the stacking
performs the best with AUC = 0.832. Then, to further improve
model outperforms the other four individual models, with an
model performance, RDKitDes and a fingerprint set (MACCSFP,
average AUC = 0.935 and ACC = 0.878 for the scaffold test set.
MorganFP, PairsFP, PubChemFP or RDKitFP) were combined as the
Interestingly, the performance of the DNN model varies dramat-
molecular features for model building since they are different
ically with the variation of the scaffold test set, indicating that
types of molecular representations. Based on the five types of
the DNN model is quite sensitive to the structural composition
combined features, 20 models were constructed using four ML
of dataset.
algorithms (SVM, RF, XGBoost and DNN). As shown in Figure 3B,
compared with the models based on a single set of descriptors or
fingerprints, those based on the combination of descriptors and
fingerprints offer more accurate predictions (AUC = 0.812–0.843). Model interpretation
For the five combined molecular features, the models based on To gain a deeper insight into the built models, the contributions
RDKitDes+ MorganFP perform the best, with an average AUC of of the molecular descriptors in each model were calculated by
0.832. The performance of the classification models based on the SHAP method [44]. Due to the good interpretation ability and
RDKitDes + MorganFP is summarized in Table 1. For the models relatively high predictive performance, the XGBoost model based
based on RDKitDes + MorganFP, the XGBoost model perform the on RDKitDes + MorganFP was analyzed. The top 10 features and
best (AUC = 0.843). Here, it can be recognized that the predictive the corresponding SHAP values are shown in Figure 6. A more
ability of each model toward noninhibitors (specificity > 90%) detailed descriptions of the 2D molecular descriptors in the
is much better than that toward inhibitors (sensitivity < 60%), top 10 features are listed in Table 2. As shown in Figure 6A,
which was potentially caused by the imbalance of the training the descriptors that characterize the number of pyridine rings
set. In addition, it is observed that the tree-based model owns (fr_pyridine) and the number of halogens (fr_halogen) have high
higher specificity, whereas the SVM and DNN models provide contributions to the XGBoost model, suggesting that a higher
higher sensitivity. Therefore, to achieve more accurate predic- number of pyridine rings and halogens will increase the pre-
tions, we combined different ML classifiers by two model fusion dictive probability of a molecule to be Mtb inhibitor. For the
methods (voting and stacking). The ensemble models were built inhibitors in the training set, nearly 41% (554 of 1363) and 59%
based on RDKitDes+MorganFP. For each fusion method, we tried (798 of 1363) molecules contain the substructures pyridine rings
11 different combination strategies by changing the number and and halogens, respectively. These interpretations explained by
type of ML algorithms. As shown in Figure 4, the stacking models SHAP are consistent with experimental observations [53–55]. The
are slightly better than the voting models. For the fusion models pyridine fused systems were reported to show promising anti-
built with different algorithms, the stacking model based on the TB activity against replicating Mtb H37Rv and the substitution
combination of SVM, RF, XGBoost and DNN performs the best, of a halogen group, especially fluorine, on the phenyl ring in
6 Ye et al.
Figure 4. Performance (AUC_ROC) of different types of fusion models for the 10-fold cross-validated training set. All the fusion models were built on the combination
of Des and Morgan.
thiacetazone can enhance anti-TB activity of this compound EState-VSA that intends to encode the topology and electronic
[53–55]. Indeed, a relatively high proportion (538 of 1363) of the environment of molecular fragments play important roles in the
inhibitors in the training set have the fluorine atoms. How- prediction of the XGBoost models. Specifically, a higher value of
ever, it should be noted that 614 noninhibitors in the training EState-VSA descriptor increases the probability of a molecule to
set also have the fluorine atoms and 144 have the substructure be predicted as a inhibitor, indicating the importance of energy
of bitvector 43. This phenomenon suggests that the impact of metabolism in Mtb. To our knowledge, Mtb contains a diverse
a certain substructure is not absolute and the global molecular range of multidrug transporters, many of which are dependent
property still need to be taken into consideration for predicting on the proton motive force (PMF) or the availability of ATP [56].
Mtb inhibitors. For 2D molecular descriptors, it can be seen that This result further illustrates that energy metabolism and ATP
8 Ye et al.
Figure 5. Performance of the classification models on 100 test sets. Each test set was randomly extracted from the dataset according to the molecular scaffold and the
remaining data was used as the training set. Sampling of the same data is allowed in the dataset extraction.
production through the PMF that is established by the electron each feature was calculated and the top-ranked features with
transport chain are critical in determining the drug susceptibility specific value are shown in Figure 6B. Ciprofloxacin is a syn-
of Mtb. thetic broad spectrum fluoroquinolone antibiotic, which binds
Except molecular descriptors, molecular fingerprints also to and inhibits bacterial DNA gyrase, an enzyme essential for
play an important role in the prediction of a XGBoost model. To DNA replication [57]. Two substructures, the cyclopropyl (Morgan
illustrate the relationship between molecular fingerprints and 338) and the fluorine atom (Morgan 904) shown in Figure 6D,
model output, two representative compounds, ciprofloxacin and make important contributions in the prediction of Ciprofloxacin
analogs of PA-824, were carefully analyzed. The contribution of and are considered to be potent modifications on the quinolone
Identification of active molecules against Mtb through ML 9
Table 2. Important descriptors identified by SHAP for the XGBoost model based on Des + Morgan
Table 3. Performance of the ML models developed based on Des + Morgan for the time set
Table 4. Impact of molecular scaffolds on the predictive performance of the stacking model
CHEMBL4173210 may be caused by the molecular scaffolds of Webserver for the identification of Mtb inhibitors
hybridized 1H-benzo[d]imidazoles and 3,4-dihydroquinazolin- To share our models with other chemists and pharmacologists,
4-ones since these substructures do not exist in the training set we developed a webserver called ChemTB ((http://cadd.zju.edu.
[63]. And the compounds CHEMBL4289962 and CHEMBL4285907 cn/chemtb/) to detect potential Mtb inhibitors. ChemTB was
may be misclassified due to the substructure of (3Z)-3- developed via Django using the python package. The webserver
(hydroxyimino)-2,3-dihydro-1H-indol-2-one [64]. As for the false mainly includes two functions: similarity search and bioactivity
negatives, the misclassification of the four noninhibitors may prediction (Figure 9). For the similarity search, the similarity can
be caused by their special spatial configuration that cannot be be calculated via two metrics (Tanimoto and Dice) based on three
well characterized by molecular fingerprints. types of molecular fingerprints (MACCS, Morgan and AtomPair
12 Ye et al.
fingerprints). As a result, three most similar compounds in our returns the AD detection results and the bioactivity prediction.
datasets and the related information are presented. For the Except the functions introduced above, the datasets used in
bioactivity prediction, the stacking model with excellent perfor- this study and the pretrained models are also available in this
mance was employed in the webserver. To broaden the AD of website.
our model, the model implemented by ChemTB was retrained
based on the Morgan fingerprints using all the available datasets
Conclusion
used in our study (including 2424 inhibitors and 6094 nonin-
hibitors). According to the 10-fold cross-validation (AUC = 0.90 In our study, we collected a bioactivity dataset toward Mtb H37Rv
and ACC = 0.85), the retrained model shows excellent predictive from the CHEMBL database and built a set of models using four
performance. ChemTB accepts a query molecule for bioactivity ML algorithms (i.e. SVM, RF, XGBoost and DNN) based on six types
prediction by drawing its structure from the JSME molecular of molecular descriptor and fingerprints. The stacking model by
editor or by inputting its SMILES. The open-source cheminfor- combing the predictions from the SVM, RF, XGBoost and DNN
matics tool, RDKit, was used to process molecules and calculate models based on the 2D descriptors and Morgan fingerprints
molecular descriptors. Then, the LOF was used to test whether performs the best, with AUC of 0.846 for the 10-fold cross-
a query molecule is located in the AD. Finally, the webserver validation, AUC of 0.942 for the scaffold set and AUC of 0.750
Identification of active molecules against Mtb through ML 13
and accuracy of 0.795 for the external test set (the time set). 81,773,632) and Natural Science Foundation of Zhejiang
Given the applicability domain value and prediction accuracy Province (LZ19H300001).
requirements, the credible prediction results need to satisfy the
following requirements for the query molecule: (i) owning the
similarity value higher than 0.7 compared with the molecules
in the training set; (ii) having the LOF value higher than 1.5; References
(iii) avoiding molecules with stereoisomerism. The important 1. Orme I, Secrist J, Anathan S, et al. Search for new drugs
molecular features for classification were analyzed by the SHAP for treatment of tuberculosis. Antimicrob Agents Chemother
method. The essential molecular descriptors and fingerprints 2001;45:1943–6.
highlighted by SHAP are generally consistent with the expert 2. Reid MJA, Arinaminpathy N, Bloom A, et al. Building a
experience, suggesting the reliability of the built models. Finally, tuberculosis-free world: the lancet commission on tubercu-
a freely accessible webserver, ChemTB, was developed for the losis. Lancet 2019;393:1331–84.
implementation of the online prediction by the well-trained 3. World Health Organization (2019). Global Tuberculosis Report
models. 2019. Geneva: World Health Organization, 2019.
4. Abubakar I, Zignol M, Falzon D, et al. Tuberculosis 2013:5
Key Points drug-resistant tuberculosis: time for visionary political lead-
• Four machine learning algorithms were used to ership. Lancet Infect Dis 2013;13:529–39.
5. Zumla AI, Gillespie SH, Hoelscher M, et al. New
develop the classification models to identify potential
antituberculosis drugs, regimens, and adjunct therapies:
Mtb inhibitors.
• The evaluation results illustrate that the XGBoost needs, advances, and future prospects. Lancet Infect Dis
2014;14:327–40.
model exhibits the best prediction performance.
• Two consensus strategies were employed to further 6. Halsey NA, Coberly JS, Desormeaux J, et al. Randomised
trial of isoniazid versus rifampicin and pyrazinamide
improve the predictive ability and robustness of the
for prevention of tuberculosis in HIV-1 infection. Lancet
classification models.
• We developed a freely available computational web- 1998;351:786–92.
7. Goble M, Iseman MD, Madsen LA, et al. Treatment of 171
server to predict potential Mtb inhibitors.
patients with pulmonary tuberculosis resistant to isoniazid
and Rifampin. N Engl J Med 1993;328:527–32.
8. Zhang Y, Wade MM, Scorpio A, et al. Mode of action of
Funding
pyrazinamide: disruption of mycobacterium tuberculosis
Key R&D Program of Zhejiang Province (2020C03010), membrane transport and energetics by pyrazinoic acid. J
National Natural Science Foundation of China (21,575,128, Antimicrob Chemother 2003;52:790–5.
14 Ye et al.
9. Telenti A, Philipp WJ, Sreevatsan S, et al. The emb operon, 31. Camps-Valls G, Bruzzone L. Kernel-based methods for
a gene cluster of mycobacterium tuberculosis involved in hyperspectral image classification. IEEE Trans Geosci Remote
resistance to ethambutol. Nat Med 1997;3:567–70. Sens 2005;43:1351–62.
10. Opromolla DV, Lima Lde S, Caprara G. Rifamycin SV in 32. Breiman L. Random forests. Mach Learn 2001;45:5–32.
the treatment of lepromatous leprosy. Lepr Rev 1965;36: 33. Svetnik V, Liaw A, Tong C, et al. Random forest: a clas-
123–131. sification and regression tool for compound classifica-
11. Lewis RA, Wood D. Modern 2D QSAR for drug discovery. tion and QSAR modeling. J Chem Inf Comput Sci 2003;43:
Wiley Interdisciplinary Reviews-Computational Molecular Science 1947–58.
2014;4:505–22. 34. Mitchell R, Frank E. Accelerating the XGBoost algorithm
12. Prathipati P, Ma NL, Keller TH. Global Bayesian models for using GPU computing. Peerj Comput Sci 2017;3:e127.
the prioritization of antitubercular agents. J Chem Inf Model 35. Attali JG, Pages G. Approximations of functions by a