Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Subscriber access provided by CMU Libraries - http://library.cmich.

edu

Computational Chemistry
AutoPH4: An automated method for generating
pharmacophore models from protein binding pockets
Siduo Jiang, Miklos Feher, Chris I Williams, Brian Cole, and David E. Shaw
J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.0c00121 • Publication Date (Web): 08 Jul 2020
Downloaded from pubs.acs.org on July 8, 2020

Just Accepted

“Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted
online prior to technical editing, formatting for publication and author proofing. The American Chemical
Society provides “Just Accepted” as a service to the research community to expedite the dissemination
of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in
full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully
peer reviewed, but should not be considered the official version of record. They are citable by the
Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore,
the “Just Accepted” Web site may not include all articles that will be published in the journal. After
a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web
site and published as an ASAP article. Note that technical editing may introduce minor changes
to the manuscript text and/or graphics which could affect content, and all legal disclaimers and
ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or
consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W.,


Washington, DC 20036
Published by American Chemical Society. Copyright © American Chemical Society.
However, no copyright claim is made to original U.S. Government works, or works
produced by employees of any Commonwealth realm Crown government in the course
of their duties.
Page 1 of 47 Journal of Chemical Information and Modeling

1
2
3
4
5 AutoPH4: An Automated Method for Generating
6
7 Pharmacophore Models from Protein Binding Pockets
8
9
10
Siduo Jiang,1 Miklos Feher,1,† Chris Williams,2 Brian Cole,1 and David E. Shaw1,3,†
11
12
13
14
15
16
17 1 D.
18 E. Shaw Research, New York, NY 10036, USA.
19
20
21 2 Chemical Computing Group, Montreal, QC H3A 2R7, Canada.
22
23
24
3 Department of Biochemistry and Molecular Biophysics, Columbia University,
25
26
27 New York, NY 10032, USA.
28
29
30
31
32
33
34
35 † To whom correspondence should be addressed.
36
37 Miklos Feher
38
E-mail: Miklos.Feher@DEShawResearch.com
39
40 Phone: (212) 478-0788
41
42 Fax: (212) 845-1788
43
44
45 David E. Shaw
46 E-mail: David.Shaw@DEShawResearch.com
47
48 Phone: (212) 478-0260
49
50 Fax: (212) 845-1286
51
52
53
54
55
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 2 of 47

1
2
3
4 Abstract
5
6
7 Pharmacophore models are widely used in computational drug discovery (e.g., in the virtual
8
9 screening of drug molecules) to capture essential information about interactions between ligands
10
11 and a target protein. Generating pharmacophore models from protein structures is typically a
12
13 manual process, but there has been growing interest in automated pharmacophore generation
14
15 methods. Automation makes feasible the processing of large numbers of protein conformations,
16
17 such as those generated by MD simulations, and thus may help achieve the longstanding goal of
18
19 incorporating protein flexibility into virtual screening workflows. Here we present AutoPH4, a
20
21 new automated method for generating pharmacophore models based on protein structures; we
22
23 show that a virtual screening workflow incorporating AutoPH4 ranks compounds more
24
25
accurately than any other pharmacophore-based virtual screening workflow for which results on
26
a public benchmark have been reported. The strong performance of the virtual screening
27
28 workflow indicates that its AutoPH4 component generates high-quality pharmacophores, making
29
30 AutoPH4 promising for use in future virtual screening workflows as well, such as ones that use
31
32 conformations generated by MD simulations.
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
1
56
57
58
59
60 ACS Paragon Plus Environment
Page 3 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Introduction
5
6
7 Pharmacophore models describe the three-dimensional arrangement of generalized features (such
8
9 as hydrogen bond donors and acceptors) that are thought to be important for the binding of
10
11 ligands to a protein of interest. Such models are widely recognized to be useful in various
12
13 computational approaches to drug discovery.1,2 In a pharmacophore-based virtual screen, for
14 tiềm năng
15 example, the pharmacophore model serves as a template against which prospective ligands are
16
17 matched, and facilitates ligand positioning within the binding pocket. It is possible to derive
18
19 pharmacophore models either from a set of known active ligands or directly from a structure of
20
21 the target protein.3 The latter models, so-called “structure-based pharmacophore models,” have
22
23 the potential to be particularly important for binding sites to which few or no ligands are known
24
25
to bind, such as sites on novel targets or previously unexplored allosteric sites on established
26
targets. The protein structures used to generate these pharmacophore models are generally single
27
28 conformations obtained experimentally or from computational modeling, but there is growing
29
30 interest in generating pharmacophore models using multiple conformations, such as from the
31
32 large number of conformations produced by molecular dynamics (MD) simulations (which is a
33
34 promising approach to incorporating protein flexibility into virtual screening).4
35
36
37
38 Structure-based pharmacophore models can be generated manually in many popular modeling
39
40 suites.5 In cases in which there is no cognate ligand (so-called “apo” cases) this can be very
41
42
difficult,2 and even when there is a cognate ligand (so-called “holo” cases), it is a slow process—
43
making it infeasible to generate pharmacophore models for large numbers of structures—and
44
45 introduces subjectivity. Partially automated approaches, such as MCSS-derived site points,6
46
47 GRID-based methods,7,8 and methods that use hydration-site analysis9 or interaction field
48
49 maps,10,11 are potentially faster and less subjective, but still involve multiple manual steps and do
50
51 not escape the problems inherent to manual methods. Fully automated methods include
52
53
54
55
2
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 4 of 47

1
2
3
4 ePharmacophores,12,13 Pocket,14 LigandScout,15,16 IChem/Shaper2,17 SILCS,18 and T2F.19
5
6 Although these methods have shown promise, only LigandScout and IChem/Shaper2 have been
7
8 tested on a publicly available benchmark dataset, and their overall virtual screening performance
9
10 was lower than that of established docking-based methods.15–17
11
12
13 Here we present AutoPH4, a new, fully automated method for generating structure-based
14
15 pharmacophore models, and evaluate its performance in a virtual screening workflow. For a
16
17 given target structure, AutoPH4 first generates pharmacophore features using information from
18
19 fields (such as electrostatic potential20 and contact preference maps21) calculated around protein
20
21 residues in the binding pocket. In the second stage of computation, these putative features are
22
23 then validated with molecular probes, which are small fragments positioned on the predicted
24
25 features to evaluate whether the expected interactions can form (and are not, for instance,
26
27 prevented by steric clashes). Any putative features not confirmed by the molecular probes are
28
29 removed.
30
31
32 To test the quality of the resulting pharmacophore models, we used these models for ligand
33
34 placement and scoring in a virtual screening workflow and evaluated how well the workflow
35
36 could distinguish between true active molecules and decoy molecules (which have similar
37
38 physical properties to the actives but dissimilar chemical structures) in the Enhanced Directory
39
40 of Useful Decoys (DUD-E),22,23 a standard benchmark dataset. The performance of AutoPH4 on
41
42 this test was substantially better than that of LigandScout,16 which reported results for the holo
43
44 case, and IChem/Shaper2,17 which reported results for the apo case. The high-quality,
45
46 automatically generated pharmacophore models produced by AutoPH4 make it an appealing
47
48
method for use in future virtual screening workflows as well, such as those that use large
49
numbers of conformations from MD simulations.
50
51
52
53
54
55
3
56
57
58
59
60 ACS Paragon Plus Environment
Page 5 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Results and Discussion
5
6
7 We present our results in two main parts: First we show the results of AutoPH4’s internal
8
9 validation process. Next, we present the results obtained on the DUD-E benchmark test set for a
10
11 virtual screening workflow we developed in which AutoPH4 is used in ligand placement and
12
13 scoring.
14
15
16
17
18 Pharmacophore model generation and validation
19
20
21
22
The first step of AutoPH4 (Table 1) uses physics-based fields (e.g., Poisson-Boltzmann
23
electrostatics), knowledge-based fields (e.g., contact statistics of residues from the PDB), and
24
25 empirical fields (e.g., spherical field from a point charge) to propose pharmacophore features.
26
27 Choices of and settings for these fields were optimized using a training set (Methods; Tables S1
28
29 and S2) and validated against alternative metrics on 9,654 receptor complexes covering eight
30
31 target classes (listed on the x-axes of Figure 1). This dataset is substantial in size, and the target
32
33 classes exhibit different levels of diversity in binding-site interactions. (Carbonic anhydrases
34
35 share similar principal interactions near the zinc atom, for example, whereas kinases and nuclear
36
37 receptors are more diverse.) We describe the results of the validation process for this dataset
38
39 using both AutoPH4’s holo and apo options (which make use of or ignore information about a
40
41 bound ligand, respectively). AutoPH4 also has a “fragment” option, which is relevant when the
42
43 binding site is partially occupied by a small ligand, and we also describe results illustrating this
44
45
option.
46
47
48
49
50
51
52
53
54
55
4
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 6 of 47

1
2
3
4 Validation of the field-based feature generation step in AutoPH4 for holo cases
5
6
7
8
In holo cases a structure is available of the target protein with a cognate ligand in its binding
9
pocket. AutoPH4’s holo option is designed for such cases, and uses structural information about
10
11 both the receptor and its cognate ligand to derive a pharmacophore model.
12
13
14
15 We assessed the quality of the pharmacophore models proposed by the field-based feature
16
17 generation step (with the holo option) using several criteria (Figure 1A). We first asked whether
18
19 by using each pharmacophore model, the cognate ligand matched the derived pharmacophore
20
21 after performing a pharmacophore search in place (i.e., without translations, rotations, or
22
23 conformational changes). Encouragingly, 9,630 out of 9,654 models passed this so-called “hit
24
25
test” (most of the 24 failures were from a class of trypsin-like serine proteases containing ligands
26
with extended carbon chains, and resulted from incorrect placement of hydrophobic features).
27
28
29
30 We further validated pharmacophore features proposed by the field-based method by assessing
31
32 the rate of false positives (i.e., features appearing when in fact there is no interaction) and false
33
34 negatives (i.e., features missing from the pharmacophore model despite an identifiable
35
36 interaction with the protein), ultimately finding them to be low. We determined these rates by
37
38 comparing features generated by the field-based method with either probe scans or EHT-based
39
40 interaction detection (Table 1). The false-positive rate was determined by using probe scans to
41
42
establish whether an acceptor, donor, or aromatic feature placed within the pharmacophore
43
sphere could indeed interact with the protein. We found that for only 2% of complexes did the
44
45 pharmacophore model contain a non-interacting acceptor or donor, and in only 1% did the model
46
47 contain a non-interacting aromatic ring. The false-negative rate was determined by assessing
48
49 how often an interaction identified by EHT-based interaction detection (as defined by MOE’s
50
51 hydrogen bond and H- detection algorithm using the cognate ligand) was missing from field-
52
53
54
55
5
56
57
58
59
60 ACS Paragon Plus Environment
Page 7 of 47 Journal of Chemical Information and Modeling

1
2
3
4 based generation. We found that the rate was on average 4% for acceptors and donors (measured
5
6 together) and 3% for aromatic interactions. We saw no major differences among the eight target
7
8 classes. Although the rates of false positives and false negatives were found to be low, when
9
10 they do appear, AutoPH4 uses any such discrepancies to refine the pharmacophore models by
11
12 removing features determined to be false positives and adds those that were false negatives to
13
14
create a final pharmacophore model.
15
16
17
18 Validation of the field-based feature generation step in AutoPH4 for apo cases
19
20
21 Apo cases lack a cognate ligand to guide pharmacophore generation, so the binding-site location
22
23 and size are often unknown. The apo option in AutoPH4 accommodates such cases. In a real-
24
25 world application of AutoPH4, the extent of the binding site would be determined
26
27 computationally using a site-finding method.24–26 In order to facilitate direct comparison of the
28
29 apo and holo options in AutoPH4, however, we used the same set of complexes as for the holo
30
31 option, and defined the binding site as all residues within 4.5 Å of the cognate ligand, so the set
32
33 of pocket atoms was identical to our test of the holo option. This choice also allows us to avoid
34
35 problems associated with choosing proteins that were crystallized without a ligand (such as large
36
37 conformational rearrangements and hydrophobic collapse).27 Beyond determining the location
38
39
of the pocket, no ligand features or observed ligand-receptor interactions were used for
40
pharmacophore feature generation with the apo option.
41
42
43
44 As expected, the proposed pharmacophore features were somewhat worse with the apo option
45
46 (Figure 1B) than with the holo option. According to probe scans, in 15% of the studied
47
48 complexes the initial pharmacophore models contained at least one donor feature that could not
49
50 interact with the receptor, although false-positive rates for acceptors and aromatic features were
51
52
53
54
55
6
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 8 of 47

1
2
3
4 low, at 4.2% and 0.6%, respectively. All such false positives, however, were eliminated from the
5
6 final pharmacophore model.
7
8
9
The false-negative rate—that is, the fraction of pharmacophore models with missing features that
10
11 could be potentially utilized by ligands—is difficult to determine without enumerating all
12
13 possible ligands and their interactions with the binding site, but we obtained a lower bound by
14
15 calculating the fraction of pharmacophore models that were missing features of the cognate
16
17 ligand. (It is important to remember that features that are not present in the cognate ligand but
18
19 validated by probe scans often represent unprecedented receptor interactions and provide a basis
20
21 for novel chemistry.) On average, 22% of the pharmacophore models missed aromatic rings on
22
23 the cognate that were expected to interact with the protein, and 14% missed acceptor or donor
24
25 features. Some molecules also had more than one missing feature of the same type. The field-
26
27 based feature generation step, for example, had a higher acceptor/donor false-negative rate for
28
29 some target classes with highly polar binding sites, such as ionotropic glutamates, in which some
30
31
neighboring features were erroneously combined by the clustering algorithm. (Although
32
performance for individual target classes could potentially be improved by fine-tuning
33
34 parameters of the applied fields and changing the clustering criteria, we used the default settings
35
36 throughout this work.) In summary, although the performance for apo pharmacophore
37
38 generation was somewhat worse than for holo, it still appeared sufficiently promising to evaluate
39
40 it in our virtual screening workflow.
41
42
43
44 Application of AutoPH4 to a fragment case
45
46
47
48 In addition to its holo and apo options, AutoPH4 also has a “fragment” option, designed for use
49
50 when only small fragments (which typically have only a few features) are known to bind in the
51
52 pocket and the aim is to find lead- or drug-size molecules that fill the pocket. This tool may be
53
54
55
7
56
57
58
59
60 ACS Paragon Plus Environment
Page 9 of 47 Journal of Chemical Information and Modeling

1
2
3
4 of use in fragment-based drug discovery, which has become a routine method in the
5
6 pharmaceutical industry to provide leads for previously intractable biological targets.28 Our
7
8 fragment option in AutoPH4 is a hybrid approach that divides the site into two, treating the
9
10 region with the fragment using the holo option and the remainder using the apo option.
11
12
13 Here we give a brief illustrative example of the generation of a pharmacophore model with the
14
15 fragment option. We applied AutoPH4 to an example fragment case, protein kinase B, for which
16
17 both a fragment (Figure 2A) and a larger, more active inhibitor (Figure 2B) have been
18
19 crystallized.29 We used the fragment option to derive a pharmacophore model for a crystal
20
21 structure of protein kinase B with the fragment bound (PDB code 2UW3), and found that the
22
23 model included an acceptor and donor interaction corresponding to the hinge binding of the
24
25 pyrazole group, as well as two aromatic groups and a hydrophobe (Figure 2C). When we applied
26
27 AutoPH4 to this same structure using the apo option rather than the fragment one, we found that
28
29 although the correct donor, acceptor, and aromatic features were positioned on the pyrazole
30
31
group, the aromatic feature expected at the benzene ring of the fragment was instead modeled as
32
a more generic hydrophobe (Figure 2D). The aromatic feature was obtained from the interacting
33
34 fragment and thus was captured using the fragment option. For comparison, we also show the
35
36 pharmacophore model derived with the holo option (Figure 2E), which misses the features away
37
38 from the fragment.
39
40
41
42 Our fragment-derived pharmacophore model correctly predicted important features of the 18-nM
43
44 inhibitor ligand (PDB code 2UW7), which contains the 80-M fragment as its core interacting
45
46 group (Figure 2F),29 including a donor that can interact with the carboxylate of Glu127, the
47
48
aromatic feature at the benzene ring, and both hydrophobes, one of which is centered on the
49
chlorine atom. The pharmacophore model also predicted two features not exhibited by the
50
51 crystalized ligand: an acceptor that can interact with the Lys72 side chain, and a donor that can
52
53
54
55
8
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 10 of 47

1
2
3
4 interact with the backbone carbonyl of Gly55. Such features could point to opportunities for
5
6 further ligand optimization.
7
8
9
10
11 Performance of an AutoPH4-based virtual screening workflow on the DUD-E test set
12
13
14 To evaluate the performance of AutoPH4 in a virtual screening context, we embedded it in a
15
16 virtual screening workflow and applied that workflow to 94 complexes in DUD-E, a widely used
17
18 dataset of active molecules and decoys for evaluating the performance of virtual screening
19
20 methods22,23 (Methods). We applied the workflow with the pharmacophore models
21
22 automatically generated using the holo option in AutoPH4, then again with models generated
23
24 using the apo option. Table 2 summarizes the overall performance on the 94 DUD-E complexes;
25
26 further details for individual complexes and target families are provided in the SI.
27
28
29
In the following sections, we also make comparisons to other methods. Such comparisons pose
30
31 certain difficulties. It is well known, for example, that DUD-E and other benchmarking datasets
32
33 contain certain biases that may affect the performance of certain virtual screening methods more
34
35 than others,30–32 and we describe an additional bias in the SI. In some studies the DUD-E-
36
37 selected PDB structure is replaced with a different one for a subset of complexes, and in many
38
39 studies, including ours (see Methods) only a subset of the DUD-E complexes is studied, further
40
41 complicating comparison. It is also worth bearing in mind that holo pharmacophores utilize
42
43 some information about cognate binding that apo pharmacophores (and most docking programs)
44
45 do not, potentially at a cost of reduced diversity of hits. Clearly, making rigorous comparisons
46
47 between methods is very difficult, but we believe that such comparisons can be valuable in
48
49 giving a rough sense of how well different methods perform.
50
51
52
53
54
55
9
56
57
58
59
60 ACS Paragon Plus Environment
Page 11 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Workflow performance with the holo option in AutoPH4
5
6
7
8
For our virtual screening workflow using AutoPH4 with the holo option, we found area under the
9
curve (AUC) values greater than 0.5 for all of the DUD-E complexes except one (gcr); 81.9% of
10
11 the complexes had an AUC > 0.7, while 13.8% had an AUC > 0.9. (Note that random
12
13 performance on the DUD-E dataset yields an expected AUC that is somewhat in excess of 0.5,
14
15 rather than exactly 0.5; see the section “Potential source of bias in DUD-E datasets due to the
16
17 higher number of forms for actives than decoys” in the SI). The median enrichment (see
18
19 Methods) for early (0.5%) recovery was 34.4, indicating that the workflow performed
20
21 particularly well in identifying actives in the early stages of screening.
22
23
24
25
Our holo virtual screening results are superior to those published for LigandScout.16 (We
26
compared our results to the LigandScout method based on X-ray structures, denoted as “PDB” in
27
28 Wieder et al.16) LigandScout was evaluated on DUD-E receptor systems;16 on the 38 of these
29
30 that were also in our evaluation set, the median AUC was 0.51 for LigandScout and 0.79 with
31
32 our workflow on the same set. For the other two automated holo methods, Pocket and
33
34 ePharmacophores, only the latter has been tested in a virtual screening workflow, with its
35
36 enrichment factors (EF 1%) for 30 targets reported to be higher than those of GlideSP.12 An
37
38 apples-to-apples comparison of ePharmacophores results to AutoPH4 is not possible, however,
39
40 because the set of actives used in the published ePharmacophores tests is not publically
41
42 available.
43
44
45 Our workflow with the holo option also performs well in comparison to docking (superior as
46
47 judged by some metrics, and comparable by others, based on comparisons of published results on
48
49 the DUD-E benchmark dataset; see Table 3). The median Boltzmann-enhanced discrimination
50
51 of the receiver operating characteristic (BEDROC) score at α = 80.5 (see Methods) was 0.40 on
52
53
54
55
10
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 12 of 47

1
2
3
4 the same 94 DUD-E complexes, compared to 0.40 with GOLD, 0.37 with Glide, 0.21 with
5
6 Surflex, and 0.22 with FlexX.30 By this metric, our AutoPH4-based workflow thus performed
7
8 similarly to GOLD and Glide, and better than FlexX and Surflex. Median enrichment data
9
10 indicate that our workflow outperformed all four docking programs at 0.5–8% recoveries. Glide
11
12 and AutoDock Vina have shown median AUCs of 0.79 and 0.68, respectively, on a set of 22
13
14
targets.33 Our AutoPH4-based workflow produced a median AUC of 0.76 on the same target set.
15
On another set of 34 DUD-E complexes, a median AUC of 0.72 has been obtained with
16
17 AutoDock Vina using the MM-GBSA binding energy.34 Our workflow, which includes an
18
19 equivalent MM-GBVI component, produced a median AUC of 0.78 on this 34-complex set. Our
20
21 workflow results were also better than published DOCK 3.735 results. Further details, including
22
23 confidence intervals, are provided in Table 3 and Figures S1–S7.
24
25
26
27 These comparisons indicate that our workflow with AutoPH4 using the holo option performs as
28
29 well as or better than established virtual screening tools on the DUD-E test set. In agreement
30
31
with other findings,36 a major advantage of the pharmacophore-based workflow over docking
32
appears to be in early enrichment: The median enrichment at 0.5% recovery for the AutoPH4-
33
34 based workflow was nearly twice as high as that of the best-performing docking methods, Glide
35
36 and GOLD, and about four times as high as that of FlexX and Surflex. Moreover, compound
37
38 rankings obtained using the AutoPH4-based workflow had little or no correlation with rankings
39
40 obtained using Glide, GOLD, and ECFP4 fingerprint similarity, as tested for three of the DUD-E
41
42 targets (Tables S4–S6). This suggests that docking and fingerprint similarity are almost
43
44 orthogonal to pharmacophore models, and using them (especially the former) in combination
45
46 with pharmacophores should improve the diversity of detected active compounds.
47
48
49
50
51
52
53
54
55
11
56
57
58
59
60 ACS Paragon Plus Environment
Page 13 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Workflow performance with the apo option in AutoPH4
5
6
7
8
Our workflow with the apo option for AutoPH4, in which pharmacophore models are generated
9
using no ligand information other than the position of the binding site, yielded a median AUC of
10
11 0.735 on the 94-complex DUD-E set (Table 2 and SI), compared to 0.786 with the holo option.
12
13 There is a considerable drop in early enrichment metrics compared to the holo option, but despite
14
15 this, the apo-option early enrichment results are still quite comparable to commercial docking
16
17 methods (Table 3). In particular, the median BEDROC scores30 at α = 80.5 indicate that apo
18
19 pharmacophores perform worse than Glide or GOLD, and comparably to FlexX and Surflex,
20
21 whereas enrichment factors at 0.5% of the database show equivalent performance to Glide and
22
23 GOLD.
24
25
26
AutoPH4 performed better than IChem/Shaper2, the only other apo pharmacophore program
27
28 with published virtual screening performance data: On a dataset of 20 targets, the two methods
29
30 have similar AUCs, but the BEDROC and enrichment results were significantly better with
31
32 AutoPH4 (see Figure S7). (We used the best-performing IChem/Shaper2 method for this
33
34 comparison, referred to in Table 2 of Tran-Nguyen et al.17 as “Shaper2 TotE PLP.”)
35
36
37
38 For a few particular target structures, our workflow performed better with the apo option in
39
40 AutoPH4 than with the holo option; this occurred when the cognate ligand was missing a feature
41
42
exhibited by most other actives. The majority of known inhibitors of Factor Xa (PDB code
43
3KL6, DUD-E code fa7), for example, have a positively charged group to interact with the
44
45 carboxylate of Asp189,37 but this feature is missing from the cognate ligand. As a result, the
46
47 pharmacophore model generated by AutoPH4 with the holo option lacked a cation/donor feature
48
49 that was present in the model generated with the apo option, and we found that the latter
50
51
52
53
54
55
12
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 14 of 47

1
2
3
4 outperformed the former in the virtual screening workflow (the AUC was 0.767 for the holo
5
6 option and 0.883 for the apo option).
7
8
9
The above case (and other similar cases) indicates that information from the apo option in
10
11 AutoPH4 may sometimes add value to pharmacophore models generated with the holo option.
12
13 Although we do not pursue this avenue here, it may thus be possible to construct a scheme in
14
15 which the holo and apo methods are used in combination, and which provides performance
16
17 superior to that achievable with either individual option.
18
19
20
21
22 Conclusions
23
24
25
26 We have described AutoPH4, an automated method for generating pharmacophore models from
27
28 the structure of a protein or protein-ligand complex. The overall performance of the virtual
29
30 screening workflow incorporating AutoPH4 on the DUD-E benchmark test set of actives and
31
32 decoys was better than that of other pharmacophore-based workflows for which direct
33
34
comparison was possible, and gave particularly good results in terms of the key metric of early
35
enrichment. As expected, performance was better when AutoPH4 considered the cognate ligand
36
37 in the pharmacophore model generation process than when it ignored it. Nevertheless,
38
39 comparison to several state-of-the-art docking-based virtual screening methods in the literature
40
41 showed that even in the latter (apo) case, AutoPH4-based screening is comparable to these
42
43 methods.
44
45
46
47 It is worth noting that it takes tens of seconds on a single core for AutoPH4 to generate a
48
49 pharmacophore model for a single protein-ligand complex, and that the process of generating
50
51
pharmacophores for multiple complexes is trivially parallelizable; using 1500 cores of our
52
53
54
55
13
56
57
58
59
60 ACS Paragon Plus Environment
Page 15 of 47 Journal of Chemical Information and Modeling

1
2
3
4 commodity cluster (see Methods), for example, we generated pharmacophore models for the full
5
6 validation dataset (9,654 complexes) in minutes with both the holo and apo options. Its ability to
7
8 rapidly generate high-quality pharmacophore models makes AutoPH4 highly appropriate for
9
10 future virtual screening workflows that make use of the large numbers of protein conformations
11
12 generated from molecular dynamics trajectories to account for protein flexibility.
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
14
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 16 of 47

1
2
3
4 Materials and Methods
5
6
7
8
9 Generation of pharmacophore models in AutoPH4
10
11
12 The pharmacophore features used in AutoPH4 correspond to a subset of the types defined in the
13
14 Unified scheme in the Molecular Operating Environment (MOE) drug discovery software
15
16 platform.38 The types we use are hydrogen bond donor (Don); hydrogen bond acceptor (Acc);
17
18 cation (Cat); anion (Ani); aromatic center or non-aromatic -system ring in which each atom is
19
20 sp2 hybridized (Aro); and hydrophobic atoms and hydrophobic centers (Hyd), as well as their
21
22 Boolean combinations using the “and” and “or” operators. We treated metal ligators as
23
24 hydrogen bond acceptors in AutoPH4, and the normal projections from aromatic rings, normally
25
26 considered in MOE, were neglected in AutoPH4. In addition, we modeled the protein volume
27
28
that is not accessible by the ligands as part of the pharmacophore model using an excluded
29
volume. In pharmacophore searches we used partial matching, meaning that only a subset of
30
31 features had to be matched in order to recover the ligand.
32
33
34
35 The pharmacophore generation process has three main components: field-based feature
36
37 generation, probe based feature validation, and the identification of missing features in the holo
38
39 case. A schematic diagram of the process is shown in Figure 3.
40
41
42
43 The use of fields to propose pharmacophore features in AutoPH4
44
45
46
47
AutoPH4 uses fields—three-dimensional grids representing the spatial distribution of different
48
properties (Table 1)—to propose pharmacophore features. The fields available to AutoPH4 are
49
50 as follows.
51
52
53
54
55
15
56
57
58
59
60 ACS Paragon Plus Environment
Page 17 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Electrostatic maps20 combine non-linear Poisson-Boltzmann electrostatics with van der Waals
5
6 probe energies to produce potential energy iso-surfaces for mobile oxygen and hydrogen
7
8 particles carrying partial charges qO and qH, subject to spatial Lennard-Jones van der Waals
9
10 potentials uO and uH. The electrostatic potential is solved for using the equation
11
12
― [𝑞𝑂𝜑 + 𝑢𝑂] 𝑘𝑇 ― [𝑞𝐻𝜑 + 𝑢𝐻] 𝑘𝑇
13 ∇ ∙ ∇𝜑 + 𝐶𝑞𝑂𝑒 +2𝐶𝑞𝐻𝑒 +𝜌 = 0,
14
15
16
17 where φ denotes the electrostatic field, C the bulk concentration of water, and ρ the charge
18
19 distribution of a macromolecule under consideration. The function qO φ+uO describes the
20
21 potential energy landscape of the implicit “oxygen” particle, and qH φ+uH the potential energy
22
23 landscape of the implicit “hydrogen” particle. These two potentials represent the free energy–
24
25
minimizing energy landscapes of the screening solvent particles. An additional “hydrophobic”
26
potential, −(qO+qH)φ+uC, is defined (where uC is the Lennard-Jones potential for carbon). The
27
28 resulting maps are in kcal mol−1.
29
30
31
32 Non-bonded contact preference maps21 are probability densities that indicate the likelihood of
33
34 a contact between a protein receptor and either a polar or hydrophobic ligand atom. The method
35
36 is knowledge-based and similar to X-Site39 and SuperStar.40 Non-bonded contact data from
37
38 protein-ligand crystallographic structures are used to fit statistical spatial distribution functions
39
40 that encode the probabilities of observing non-bonded atoms contacting one another. The
41
42
distributions are used to compute the likelihood of observing polar and hydrophobic ligand
43
atoms in regions in and around the protein pocket. The contact preference levels are reported as
44
45 percent probabilities. Regions of high probability suggest regions with potentially important
46
47 protein-ligand contacts.
48
49
50
51
52
53
54
55
16
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 18 of 47

1
2
3
4 Interaction potential maps in MOE38 represent the potential energy surface between a united-
5
6 atom probe and the receptor, similar to GRID.41,42 The probes represent different chemical
7
8 functionalities. The probe potential is a combination of electrostatic, van der Waals, and
9
10 geometric terms (for hydrogen bond donors and acceptors), which differ depending on the
11
12 chosen probe. The probes used here were Na+ and N1+ to represent charged groups or donors,
13
14
and O to represent acceptors.
15
16
17 Receptor annotation points represent geometrically favorable locations in the binding pocket
18
19 for interacting ligand pharmacophore features. In MOE,38 acceptor points are generated by
20
21 projecting vectors along polar hydrogen bonds to a point at the appropriate distance for a ligand
22
23 acceptor heavy atom. Donor points are generated by projecting vectors along ideal lone pair
24
25 angles to a point at the ideal distance for a hydrogen bond–donating group on the ligand.
26
27 Annotation points residing inside the protein’s van der Waals volume are discarded. The
28
29 receptor annotation point field is represented by a collection of spheres around each annotation
30
31
point. The acceptor receptor annotation field is augmented with a spherical field around metal
32
cations (the omni-metal field) that predicts whether ligand acceptor groups will interact with
33
34 metals.
35
36
37
38 The omni-cation field is a combination of the polar contact preferences around the N+ group in
39
40 lysine and the polar and hydrophobic contact preferences21 around the guanidinium group in
41
42 arginine. The field is meant to help detect cation-π interactions. The omni-acceptor field is an
43
44 omnidirectional distance field around carbonyl, carboxylate, and hydroxyl oxygens in the
45
46 receptor structure.
47
48
49 Ligand annotation points are points in space with attached labels that reflect the presence of
50
51 ligand pharmacophore features such as donors, acceptors, aromatic rings, and hydrophobic
52
53
54
55
17
56
57
58
59
60 ACS Paragon Plus Environment
Page 19 of 47 Journal of Chemical Information and Modeling

1
2
3
4 groups. We used the Unified scheme in MOE to determine the types and locations of ligand
5
6 annotation points. In this scheme, atom typing is accomplished using pattern-matching rules,
7
8 and point locations are either atom-centered, functional group–centered (e.g., ring centroids or
9
10 alkyl chain centroids), or projected into space along polar hydrogens and lone pair vectors. The
11
12 ligand annotation point field is represented by a sphere around each ligand annotation point. We
13
14
considered only the atom-centered and functional group–centered annotation points here.
15
16
17 The ligand distance and dummy distance fields are spherical regions around the ligand atoms
18
19 or dummy atoms. (Dummy atoms are used to indicate the extent of available space in the
20
21 pocket; they are obtained from, for example, site-finding methods or a set of docked ligands, and
22
23 are used by AutoPH4 with the apo and fragment options.) The distance cutoff for each field was
24
25 set to 1 Å. The atoms used for the ligand distance field are the same as those used for the ligand
26
27 annotation field.
28
29
30 The FFT map uses a fast Fourier transform (FFT) modified from protein-protein docking in
31
32 MOE,38 which was developed to explore protein conformations and poses efficiently. For each
33
34 benzene probe, a set of rotations is generated; and for each rotation, a grid representing the
35
36 electrostatic and van der Waals fields is created. The protein interaction energy for all
37
38 translations of a given rotation is computed using an FFT convolution. The minimum energy of
39
40 all rotations at each protein grid-point is recorded, and the resulting field is the minimum
41
42 interaction energy between the receptor and probe.
43
44
45
46 Placement and size of pharmacophore features
47
48
49
50
In AutoPH4, pharmacophore features are initially placed where the volumes of selected fields
51
overlap, but their final placement can change after clustering. With the holo option, the features
52
53
54
55
18
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 20 of 47

1
2
3
4 of the model are determined from the protein structure, with ligand information used only in their
5
6 placement or removal (features are placed to coincide with corresponding ligand features). The
7
8 default feature size is 1.2 Å for Acc and Don features, 1.4 Å for aromatic centers, and 1.6 Å for
9
10 hydrophobes. Hydrophobic features closer than 1 Å to each other are clustered into a single
11
12 feature with an increased radius but a maximum size below 3 Å. A Boolean “and” connection is
13
14
automatically defined for certain overlapping features (e.g., “Don and Cat” or “Don and Acc”) if
15
the detected fields indicate that a charged hydrogen bond donor or simultaneous acceptor-donor
16
17 feature is required.
18
19
20
21 Apo pharmacophore models are potentially problematic, because any real molecule can interact
22
23 with at most a subset of the dozens of potential donor and acceptor groups in a druggable protein
24
25 pocket, and it may be impractical to use pharmacophore models with a large number of features
26
27 in a pharmacophore search due to the computational cost.43 Clustering similar features is thus
28
29 key in this case, to reduce the overall complexity of the query. With the apo option, features are
30
31
created by covering contiguous overlapping regions arising from different fields with spheres
32
whose radii depend on the feature type (1.4 Å for Acc and Don, and 1.6 Å for Aro and Hyd).
33
34 Contiguous field regions tend to be small, and in most cases a single sphere centered on the
35
36 region can encompass the entire region. For larger and/or irregular field regions that require
37
38 more than one sphere to encompass the region, features are clustered together. Overlapping Don
39
40 and Acc spheres are combined into one feature using the “or” Boolean operation. Hydrophobe
41
42 spheres within 1.5 Å of each other are clustered into a single Hyd feature with an increased
43
44 radius (though not exceeding 3 Å).
45
46
47
48
We note that although for consistency all results presented here include hydrophobes, there are
49
reasons to leave out hydrophobic features altogether when using the apo option, especially for
50
51 targets with a small binding site and a large number of features. First, the relatively large
52
53
54
55
19
56
57
58
59
60 ACS Paragon Plus Environment
Page 21 of 47 Journal of Chemical Information and Modeling

1
2
3
4 number of features in some apo applications will reduce the relative weight of important
5
6 features, and removing hydrophobes should thus lead to some improvement in virtual screening
7
8 performance in such cases. Second, in binding sites crowded with pharmacophore features,
9
10 hydrophobes can potentially lead to worse ligand placement. Finally, search speeds are also
11
12 greatly improved if hydrophobes are left out.
13
14
15
16 Optimization of field settings for pharmacophore generation
17
18
19 We determined the fields available to AutoPH4 (in Table 1 and described above) based on their
20
21 expected usefulness for pharmacophore discovery. The choices of fields and settings governing
22
23 them were subsequently optimized on a 126-compound diverse subset of the kinase database.
24
25 The adjustable parameters (e.g., the size and distribution of the fields) were varied to
26
27 simultaneously minimize the false negative rate, as determined by MOE’s non-bonded
28
29 interaction detection, and the false-positive rate, as determined by our probes. For each varied
30
31 parameter, five to ten possible values were considered on a grid. This optimization was
32
33 performed for the acceptor, donor, and aromatic pharmacophore features, each for both the holo
34
35 and apo options. We then validated the best one or two parameter settings (Table S2) using a
36
37 diverse set of 9,654 targets as detailed below.
38
39
40
41
42
Validation of pharmacophore feature generation on a diverse set of targets
43
44
45 The optimized parameters for pharmacophore generation were validated on 9,654 protein
46
47 complexes spanning eight target classes. The protein structures were taken from MOE project
48
49 databases, and structures with no cognate ligands were removed. These databases required no
50
51 further preparation aside from the calculation of force field charges using the MMFF94x force
52
53
54
55
20
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 22 of 47

1
2
3
4 field in MOE. The two metalloenzyme target complexes were an exception; these were
5
6 processed with MOE’s Protonate3D algorithm using an artificially high pKa value for histidine
7
8 residues in order to produce the correct tautomers around metal ions.
9
10
11 To validate the field-based feature generation step in AutoPH4, we first determined whether the
12
13 generated pharmacophore models could recover the cognate ligands without varying the ligand
14
15 position and conformation; this corresponds to a pharmacophore search using absolute positions
16
17 in MOE. This test is only meaningful with the holo option, because with the apo option a
18
19 number of pharmacophore features are generated that are not covered by the cognate ligand.
20
21 Next, the false-negative rate was assessed using non-bonded contact detection in MOE, whereby
22
23 we determined whether all detected hydrogen bonds between the cognate and protein originated
24
25 from atoms for which an Acc or Don feature was generated, and whether all H- interactions
26
27 originated from Aro centers.
28
29
30 Hydrogen bond and H- detection was performed in MOE by determining the acceptor and
31
32 donor strengths from extended Hückel calculations44 (the cutoffs were −0.2 kcal mol−1 for
33
34 hydrogen bonds and −0.5 kcal mol−1 for arene interactions). The false-positive rate was then
35
36 determined by translating and rotating fragment probes fully in the pharmacophore sphere and
37
38 checking whether they were capable of hydrogen bonding or making H- interactions with the
39
40 protein. Positions in which the probe clashed with the receptor (>2 kcal mol−1 repulsion using
41
42 MOE’s ContactEnergy function) were ignored. An NH4+ probe was used for Don features
43
44 (7 × 100 positions tested), formaldehyde for Acc features (7 × 1000 positions tested), and a
45
46 benzene ring for Aro centers (7 × 200 positions tested). We expected the probe method to detect
47
48
false positives equally well for holo and apo options, because it does not depend on information
49
from the cognate ligand.
50
51
52
53
54
55
21
56
57
58
59
60 ACS Paragon Plus Environment
Page 23 of 47 Journal of Chemical Information and Modeling

1
2
3
4 The final pharmacophore models were produced by reconciling the field and probe results. This
5
6 was done by removing any features generated from the fields that the probe results indicated
7
8 were false positives and, in the holo case, adding false negatives found by H-bond and H-
9
10 interaction detection. All pharmacophore models for the later benchmark test of our AutoPH4-
11
12 based virtual screening workflow were automatically generated using this procedure.
13
14
15
16
17
The fragment use-case
18
19
20 The fragment (partially filled site) option in AutoPH4 is a mixture of the holo and apo options:
21
22 The higher-quality holo settings are applied in the vicinity of the existing fragment or ligand, and
23
24 apo settings are applied in other parts of the site. This option is most useful when the binding
25
26 site of a fragment is confirmed and a pharmacophore model is desired to help with growing the
27
28 fragment to enhance complementarity to the receptor. In our example application to protein
29
30 kinase B, the extent of the binding site was determined using SiteFinder in MOE and passed to
31
32 AutoPH4 in the form of dummy atoms.
33
34
35
36
37
Execution of automated pharmacophore model generation with AutoPH4
38
39
40 Automated pharmacophore model generation was performed using command line scripts to
41
42 allow full parallelization on a computing cluster. Operation times given here are for a single
43
44 core of an Intel 3.60GHz Xeon X5687 processor. Each operation was measured as an average
45
46 over the 126 kinase receptors used in the optimization of the field generation step; this is
47
48 representative of the typical speed per complex. With the holo option, pharmacophore model
49
50 generation required an average of 22.4 seconds per complex; this went up to 76.3 seconds when
51
52 combined with the removal of false-positive features detected by probe scans and addition of
53
54
55
22
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 24 of 47

1
2
3
4 missed cognate ligand features. With the apo option, pharmacophore model generation required
5
6 an average of 27.8 seconds per complex; this went up to 92.7 seconds when combined with
7
8 probe-based feature removal. Generating pharmacophores for multiple complexes is trivially
9
10 parallelizable on a cluster. Although the fragment option was not validated on the diverse set,
11
12 we expect that it would take approximately the sum of the time required for the holo and apo
13
14
options.
15
16
17
18
Benchmark test of our AutoPH4-based virtual screening workflow using the DUD-E dataset
19
20
21
22
23 A virtual screening workflow based on AutoPH4
24
25
26 We developed a virtual screening workflow that used AutoPH4 to generate pharmacophore
27
28 models for the placement and scoring steps. The pharmacophore models included
29
30 pharmacophore features and an excluded volume from the binding-site residues. AutoPH4 was
31
32 embedded in the workflow such that the pharmacophore model generation was entirely
33
34 automated with both the holo and apo options; no entry contained any manual modifications.
35
36
37
Molecules first went through a pharmacophore search in MOE that required at least one polar or
38
39 aromatic feature of the model to be satisfied; the few molecules that failed this test were
40
41 discarded (ranked bottom) and the remainder placed in the pocket. We used a normalized
42
43 pharmacophore score representing the proportion of features hit (with a value of 20
44
45 corresponding all features being hit and 0 corresponding to none being hit). (The root mean
46
47 square distance (RMSD) between the pharmacophore model and features of the placed molecule
48
49 was not used in ranking or scoring.) Molecule poses that hit the highest number of
50
51 pharmacophore features were selected, and similar poses were then removed using detection of
52
53
54
55
23
56
57
58
59
60 ACS Paragon Plus Environment
Page 25 of 47 Journal of Chemical Information and Modeling

1
2
3
4 hydrogen bond and hydrophobic interaction patterns implemented in MOE’s docking engine.38
5
6 The top poses (at most 20) were selected and refined with molecular mechanics minimization
7
8 inside the rigid protein binding site using the MMFF94x force field. R-field solvation terms and
9
10 their energies were evaluated at the minimized position using the MM/GBVI method45 in MOE.
11
12 The MM/GBVI energy and pharmacophore score were summed to give the final score for each
13
14
pose. When comparing molecules, only the top-scoring pose for each was considered.
15
16
17 It is worth noting that in terms of the studied performance measures, the workflow performed
18
19 sufficiently well using just the pharmacophore score (Table S7). If needed, the workflow could
20
21 thus be run without minimizations and MM/GBVI scoring, which would conserve resources.
22
23 Additionally, although all tests discussed here were performed using the MMFF94x force field,
24
25 the overall performance was similar with the alternative Amber14-EHT force field in MOE
26
27 (Table S7).
28
29
30
31 Preparation of DUD-E complexes
32
33
34
Receptor and ligand structures were downloaded from the DUD-E website.23 Active and decoy
35
36 sets were downloaded and used as .sdf files (except for fgfr1 and fa10, which were reconstructed
37
38 from .ism files) for consistency and to take advantage of stereochemistry and protomer
39
40 expansions of the compounds available on the DUD-E website. Omega46,47 was used for
41
42 conformer generation, creating 250 conformers per ligand; otherwise, default settings were used.
43
44 Compounds that failed due to strict stereo settings were enumerated with the OEFlipper API and
45
46 de-duplicated as part of the same 250-conformer ensemble using the OESliceEnsemble API.
47
48
49
50
From the 102 targets, we eliminated eight prior to the test for the following reasons: There were
51
major clashes between the ligand and the receptor that could not easily be relieved (aofb, pgh1,
52
53
54
55
24
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 26 of 47

1
2
3
4 and hivinit, corresponding to PDB codes 1S3B, 2OYU, and 3NF7, respectively); the ligand was
5
6 a strong Michael acceptor that would be likely to bind covalently (mp2k1, PDB code 3EQH); the
7
8 ligand had a 21-carbon alkyl (heneicosan) chain in an unlikely conformation (rxra, PDB code
9
10 1MV9); the ligand was an unstable cyclic ketal (mcr, PDB code 2AA2); or the only hydrogen
11
12 bond between the ligand and the protein was water mediated (ital and pygm, corresponding to
13
14
PDB codes 2ICA and 1C8K, respectively). The remaining 94 crystal structures were used in the
15
test. Most receptor complexes (except ace, adrb1, akt2, ampc, cah2, casp3, cdk2, dyr, fabp4,
16
17 fkb1a, hivpr, hs90a, and parp1) were then re-downloaded from the PDB48 and prepared using
18
19 MOE’s structure preparation and Protonate3D tools.
20
21
22
23 Although it was not the purpose of this study, the quality of pharmacophore placement for the
24
25 DUD-E cognate ligands was high overall. In the majority of the cases in our test set, the placed
26
27 cognate ligand exhibited a binding mode similar to that in the X-ray structure, with a median
28
29 RMSD of 0.76 Å (distribution in SI). Six of the 91 ligands had an RMSD above 2 Å. The ligand
30
31
with the highest RMSD (3.3 Å, DUD-E code xiap, PDB code 3HL5) was a typical ligand in that
32
its buried portion was placed in almost perfect agreement with the X-ray structure, and the two
33
34 poses showed the same hydrogen bonding pattern. The other half of the ligand, however, is fully
35
36 solvated, and this was where the two conformations differed, giving rise to the high RMSD value
37
38 (more detail in SI).
39
40
41
42 Metrics for workflow performance on the DUD-E benchmark set
43
44
45
46 Area under the curve (AUC) values were obtained from receiver operating characteristic (ROC)
47
48 curves49 using the corresponding Scikit-learn functions.50,51 Enrichment52 (the ratio of actives to
49
50 all compounds in a selected fraction of the ranked dataset divided by the ratio of actives to all
51
52 compounds in the entire dataset, such that an enrichment value of one corresponds to random
53
54
55
25
56
57
58
59
60 ACS Paragon Plus Environment
Page 27 of 47 Journal of Chemical Information and Modeling

1
2
3
4 selection) was calculated using in-house Python tools and reported for different fractions of the
5
6 ranked dataset. An additional measure of early enrichment is the Boltzmann-enhanced
7
8 discrimination of the receiver operating characteristic (BEDROC) score,53 which contains an
9
10 adjustable parameter, α. For consistency of comparison with the literature, we used α values of
11
12 321, 80.5, and 20, calculated with RDKit’s Scoring Module.54 With these values, 80% of the
13
14
BEDROC scores are accounted for by 0.5%, 2%, and 8% of the top-ranked molecules,
15
respectively. Confidence intervals for AUC, enrichment values, and BEDROC scores were
16
17 calculated using bootstrapping55 and are presented here as notched box-plots,55 with significance
18
19 evaluated according to recent guidelines.56
20
21
22
23 Rank correlations were calculated using Kendall’s B, a measure developed to handle ties,57,58
24
25 using the corresponding SciPy module.59 (Ties are arguably important when comparing virtual
26
27 screening results, because compounds that do not dock into the binding pocket do not receive a
28
29 score and hence will all have the same rank.) We calculated rank correlations both for the entire
30
31
dataset (actives and decoys together) and for actives only. All reported rank correlations were
32
significant (p < 0.05) and the majority were highly significant (p << 0.001).
33
34
35
36
37
38 Note
39
40
41 The script to generate pharmacophore models using the AutoPH4 method is available on the
42
43 SVL Exchange (https://svl.chemcomp.com/).
44
45
46
47
48
49
50
51
52
53
54
55
26
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 28 of 47

1
2
3
4 Supporting Information
5
6
7 Supporting Information includes the settings used for automated pharmacophore generation, rank
8
9 correlations between our pharmacophore method and three virtual screening methods,
10
11 comparison of results using MMFF94x and Amber14-EHT force fields using boxplots,
12
13 comparison to docking methods using boxplots, the distribution of RMSD values for cognate
14
15 placement, and the description of the potential bias in DUD-E due to multiple forms of actives.
16
17
18
19
20 Acknowledgments
21
22
23
24 The authors thank Michael Eastwood for helpful discussions, Jessica McGillen and Berkman
25
26 Frank for editorial assistance, and Peter Skopp for pointing out the potential for bias in DUD-E
27
28
tests due to multiple actives.
29
30
31 This study was conducted and funded internally by D. E. Shaw Research, of which D.E.S. is the
32
33 sole beneficial owner and Chief Scientist, and with which S.J., M.F., B.C., and D.E.S. are
34
35 affiliated.
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
27
56
57
58
59
60 ACS Paragon Plus Environment
Page 29 of 47 Journal of Chemical Information and Modeling

1
2
3
4 References
5
6
7 1. Leach, A.R.; Gillet, V.J.; Lewis, R.A.; Taylor, R. Three-dimensional pharmacophore
8
methods in drug discovery. J. Med. Chem. 2010, 53, 539–558.
9
10
11
12 2. Yang, S.Y. Pharmacophore modeling and applications in drug discovery: challenges and
13 recent advances. Drug Disc. Today 2010, 15, 444–450.
14
15
16 3. Sanders, M.P.A.; McGuire, R.; Roumen, L.; de Esch, I.J.P.; de Vlieg, J.; Klomp, J.P.G.;
17
18 de Graaf, C. From the protein’s perspective: the benefits and challenges of protein
19 structure-based pharmacophore modeling. Med. Chem. Comm. 2012, 3, 28–38.
20
21
22
4. Śledź, P.; Caflisch, A. Protein structure-based drug design: from docking to molecular
23
24 dynamics. Curr. Opinion Struct. Biol. 2018, 48, 93–102.
25
26
27 5. Sanders, M.P.A.; Barbosa, A.J.; Zarzycka, B.; Nicolaes, G.A.; Klomp, J.P.; de Vlieg, J.;
28
29 Del Rio, A. Comparative analysis of pharmacophore screening tools. J. Chem. Inf.
30 Model. 2012, 52, 1607–1620.
31
32
33 6. Joseph-McCarthy, D.; Alvarez, J.C. Automated generation of MCSS-derived
34
35 pharmacophoric DOCK site points for searching multiconformation databases. Proteins
36 2003, 51, 189–202.
37
38
39
7. Ortuso, F.; Langer, T.; Alcaro, S. GBPM: GRID-based pharmacophore model: concept
40
41 and application studies to protein-protein recognition. Bioinform. 2006, 22, 1449–1455.
42
43
44 8. Tintori, C.; Corradi, V.; Magnani, M.; Manetti, F.; Botta, M. Targets looking for drugs: a
45
46 multistep computational protocol for the development of structure-based pharmacophores
47 and their applications for hit discovery. J. Chem. Inf. Model. 2008, 48, 2166–2179.
48
49
50 9. Hu, B.; Lill, M.A. Protein pharmacophore selection using hydration-site analysis. J.
51
52 Chem. Inf. Model. 2012, 52, 1046–1060.
53
54
55
28
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 30 of 47

1
2
3
4 10. Kaalia, R.; Kumar, A.; Srinivasan, A.; Ghosh, I. An ab initio method for designing
5 multi‐target specific pharmacophores using complementary interaction field of aspartic
6
7 proteases. Mol. Inf. 2015, 34, 380–393.
8
9
10 11. Hu, B.; Lill, M.A. Exploring the potential of protein-based pharmacophore models in
11 ligand pose prediction and ranking. J. Chem. Inf. Model. 2013, 53, 1179–1190.
12
13
14
12. Salam, N.K.; Nuti, R.; Sherman, W. Novel method for generating structure-based
15
16 pharmacophores using energetic analysis. J. Chem. Inf. Model. 2009, 49, 2356–2368.
17
18
19 13. Loving, K.; Salam, N.K.; Sherman, W. Energetic analysis of fragment docking and
20
21 application to structure-based pharmacophore hypothesis generation. J. Comput. Aided
22 Mol. Des. 2009, 23, 541–554.
23
24
25 14. Chen, J.; Lai, L. Pocket v. 2: further developments on receptor-based pharmacophore
26
27 modeling. J. Chem. Inf. Model. 2006, 46, 2684–2691.
28
29
30 15. Wolber, G.; Langer, T. LigandScout: 3D pharmacophores derived from protein-bound
31
ligands and their use as virtual screening filters. J. Chem. Inf. Model. 2005, 45, 160–169.
32
33
34
35 16. Wieder, M.; Garon, A.; Perricone, U.; Boresch, S.; Seidel, T.; Almerico, A.M.; Langer,
36 T. Common hits approach: combining pharmacophore modeling and molecular dynamics
37
38 simulations. J. Chem. Inf. Model. 2017, 57, 365–385.
39
40
41 17. Tran-Nguyen, V.K.; Da Silva, F.; Bret, G.; Rognan, D. All in one: cavity detection,
42 druggability estimate, cavity-based pharmacophore perception, and virtual screening. J.
43
44 Chem. Inf. Model. 2019, 59, 573–585.
45
46
47 18. Yu, W.; Lakkaraju, S.K.; Raman, E.P.; Fang, L.; MacKerell, A.D. Pharmacophore
48
modeling using site-identification by ligand competitive saturation (SILCS) with multiple
49
50 probe molecules. J. Chem. Inf. Model. 2015, 55, 407–420.
51
52
53
54
55
29
56
57
58
59
60 ACS Paragon Plus Environment
Page 31 of 47 Journal of Chemical Information and Modeling

1
2
3
4 19. Mortier, J.; Dhakal, P.; Volkamer, A. Truly target-focused pharmacophore modeling: a
5 novel tool for mapping intermolecular surfaces. Molecules 2018, 23, 1959.
6
7
8 20. Labute, P. Electrostatic maps. Molecular Operating Environment (MOE), version
9
10 2015.10. Chemical Computing Group, Montreal, Canada, 2015.
11
12
13 21. Labute, P. Contact preference maps. Molecular Operating Environment (MOE), version
14
2015.10. Chemical Computing Group, Montreal, Canada, 2015.
15
16
17
18 22. Mysinger, M.M.; Carchia, M.; Irwin, J.J.; Shoichet, B.K. Directory of Useful Decoys,
19 Enhanced (DUD-E): better ligands and decoys for better benchmarking. J. Med. Chem.
20
21 2012, 55, 6582–6594.
22
23
24 23. DUD-E, a database of useful decoys. http://dude.docking.org/ (accessed February 23,
25 2017).
26
27
28
24. Schmidtke, P.; Souaille, C.; Estienne, F.; Baurin, N.; Kroemer, R.T. Large-scale
29
30 comparison of four binding site detection algorithms. J. Chem. Inf. Model. 2010, 50,
31
2191–2200.
32
33
34
35 25. Pérot, S.; Sperandio, O.; Miteva, M.A.; Camproux, A.C.; Villoutreix, B.O. Druggable
36 pockets and binding site centric chemical space: a paradigm shift in drug discovery. Drug
37
38 Disc. Today 2010, 15, 656–67.
39
40
41 26. Soga, S.; Shirai, H.; Kobori, M.; Hirayama, N. Use of amino acid composition to predict
42 ligand-binding sites. J. Chem. Inf. Model. 2007, 47, 400–406.
43
44
45
27. Davis, A.M.; Teague, S.J. Hydrogen bonding, hydrophobic interactions, and failure of the
46
47 rigid receptor hypothesis. Angew. Chem. 1999, 38, 736–749.
48
49
50 28. Murray, C.W.; Rees, D.C. The rise of fragment-based drug discovery. Nat. Chem. 2009,
51
52 1, 187–192.
53
54
55
30
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 32 of 47

1
2
3
4 29. Saxty, G.; Woodhead, S.J.; Berdini, V.; Davies, T.G.; Verdonk, M.L.; Wyatt, P.G.;
5 Boyle, R.G.; Barford, D.; Downham, R.; Garrett, M.D.; Carr, R.A. Identification of
6
7 inhibitors of protein kinase B using fragment-based lead discovery. J. Med. Chem. 2007,
8 50, 2293–2296.
9
10
11 30. Chaput, L.; Martinez-Sanz, J.; Saettel, N.; Mouawad L. Benchmark of four popular
12
13 virtual screening programs: construction of the active/decoy dataset remains a major
14 determinant of measured performance. J. Cheminformatics 2016, 8, 56–72.
15
16
17
31. Lagarde, N.; Zagury, J. F.; Montes, M. Benchmarking data sets for the evaluation of
18
19 virtual ligand screening methods: review and perspectives. J. Chem. Inf. Model. 2015, 55,
20
1297–1307.
21
22
23
24 32. Réau, M.; Lagenfeld, F; Zagury, J. F.; Lagarde, N.; Montes, M. Decoys selection in
25 benchmarking datasets: overview and perspectives. Front. Pharmacol. 2018, 9, 11.
26
27
28 33. Ruiz-Carmona, S.; Alvarez-Garcia, D.; Foloppe, N.; Garmendia-Doval, A.B.; Juhos, S.;
29
30 Schmidtke, P.; Barril, X.; Hubbard, R.E.; Morley, S.D. rDock: a fast, versatile and open
31 source program for docking ligands to proteins and nucleic acids. PLoS Comput. Biol.
32
33 2014, 10, e1003571.
34
35
36 34. Zhang, X.; Wong, S.E.; Lightstone, F.C. Toward fully automated high performance
37
computing drug discovery: a massively parallel virtual screening pipeline for docking and
38
39 molecular mechanics/generalized Born surface area rescoring to improve enrichment. J.
40
Chem. Inf. Model. 2014, 54, 324–337.
41
42
43
44 35. Coleman, R.G.; Sterling, T.; Weiss, D.R. SAMPL4 & DOCK3.7: lessons for automated
45 docking procedures. J. Comput. Aided Mol. Des. 2014, 28, 201–209.
46
47
48 36. Chen, Z.; Li, H.L.; Zhang, Q.J.; Bao, X.G.; Yu, K.Q.; Luo, X.M.; Zhu, W.L.; Jiang, H.L.
49
50 Pharmacophore-based virtual screening versus docking-based virtual screening: a
51 benchmark comparison against eight targets. Acta Pharmacol. Sin. 2009, 30, 1694–1708.
52
53
54
55
31
56
57
58
59
60 ACS Paragon Plus Environment
Page 33 of 47 Journal of Chemical Information and Modeling

1
2
3
4 37. Perzborn, E.; Roehrig, S.; Straub, A.; Kubitza,, D.; Misselwitz, F. The discovery and
5 development of rivaroxaban, an oral, direct factor Xa inhibitor. Nat. Rev. Drug Discov.
6
7 2011, 10, 61–75.
8
9
10 38. Molecular Operating Environment (MOE), version 2015.10. Chemical Computing
11 Group, Montreal, Canada, 2015.
12
13
14
39. Laskowski, R.A.; Thornton, J.M.; Humblet, C.; Singh, J. X-SITE: Use of empirically
15
16 derived atomic packing preferences to identify favorable interaction regions in the
17
binding sites of proteins. J. Mol. Biol. 1996, 259, 175–201.
18
19
20
21 40. Nissink, J.W.M.; Verdonk, M.L.; Klebe, G. Simple knowledge-based descriptors to
22 predict protein-ligand interactions. Methodology and validation. J. Comput. Aided Mol.
23
24 Des. 2000, 14, 787–803.
25
26
27 41. Goodford, P.J. A computational procedure for determining energetically favorable
28 binding sites on biologically important macromolecules. J. Med. Chem. 1985, 28, 849–
29
30 857.
31
32
33 42. Boobbyer, D.N.; Goodford, P.J.; McWhinnie, P.M.; Wade, R.C. New hydrogen-bond
34
potentials for use in determining energetically favorable binding sites on molecules of
35
36 known structure. J. Med. Chem. 1989, 32, 1083–1094.
37
38
39 43. Kolossvary, I.; Guida, W.C. Compare Conformer: a program for the rapid comparison of
40
41 molecular conformers based on interatomic distances and torsion angles. J. Chem. Inf.
42 Comp. Sci. 1992, 32, 191–199.
43
44
45 44. Gerber, P.R. Charge distribution from a simple molecular orbital type calculation and
46
47 non-bonding interaction terms in the force field MAB. J. Comput. Aided Mol. Des. 1998,
48 12, 37–51.
49
50
51
52
53
54
55
32
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 34 of 47

1
2
3
4 45. Labute, P. The generalized Born/volume integral implicit solvent model: estimation of
5 the free energy of hydration using London dispersion instead of atomic surface area. J.
6
7 Comput. Chem. 2008, 29, 1693–1698.
8
9
10 46. Hawkins, P.C.D.; Skillman, A.G.; Warren, G.L.; Ellingson, B.A.; Stahl, M.T. OMEGA,
11 version 2.5.1.4, OpenEye Scientific Software, Santa Fe, NM.
12
13
14
47. Hawkins, P.C.; Skillman, A.G.; Warren, G.L.; Ellingson, B.A.; Stahl, M.T. Conformer
15
16 generation with OMEGA: algorithm and validation using high quality structures from the
17
Protein Databank and Cambridge Structural Database J. Chem. Inf. Model. 2010, 50,
18
19 572–584.
20
21
22 48. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.;
23
24 Shindyalov, I.N.; Bourne, P.E. The protein data bank. Nucleic Acids Res. 2000, 28, 235–
25 242.
26
27
28 49. Triballeau, N.; Acher, F.; Brabet, I.; Pin, J.P.; Bertrand, H.O. Virtual screening workflow
29
30 development guided by the “receiver operating characteristic” curve approach.
31 Application to high-throughput docking on metabotropic glutamate receptor subtype 4. J.
32
33 Med. Chem. 2005, 48, 2534–2547.
34
35
36 50. SciKit-learn, version 0.19.2. http://scikit-learn.org/stable/ (accessed April 27, 2017).
37
38
39 51. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel,
40
41 M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J. Scikit-learn: machine
42 learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
43
44
45 52. Pearlman, D. A.; Charifson, P. S. Improved scoring of ligand-protein interactions using
46
47 OWFEG free energy grids. J. Med. Chem. 2001, 44, 502–511.
48
49
50 53. Truchon, J.-F.; Bayly, C.I. Evaluating virtual screening methods: good and bad metrics
51
for the “early recognition” problem. J. Chem. Inf. Model. 2007, 47, 488–508.
52
53
54
55
33
56
57
58
59
60 ACS Paragon Plus Environment
Page 35 of 47 Journal of Chemical Information and Modeling

1
2
3
4 54. RDKit: Open-source cheminformatics, module scoring.
5 http://www.rdkit.org/Python_Docs/rdkit.ML.Scoring.Scoring-module.html (accessed
6
7 May 12, 2017).
8
9
10 55. Nicholls, A. Confidence limits, error bars and method comparison in molecular
11 modeling. Part 1: the calculation of confidence intervals. J. Comput. Aided Mol. Des.
12
13 2014, 28, 887–918.
14
15
16 56. Nicholls, A. Confidence limits, error bars and method comparison in molecular
17
modeling. Part 2: comparing methods. J. Comput. Aided Mol. Des. 2016, 30, 103–126.
18
19
20
21 57. Gibbons, J.D.; Fielden, J.D.G. Nonparametric Measures of Association. SAGE
22 Publications: California, 1993; pp. 3–15.
23
24
25 58. Pett, M.A. Nonparametric Statistics for Health Care Research: Statistics for Small
26
27 Samples and Unusual Distributions. 2nd Ed. SAGE Publications: Singapore, 2016; pp. 3–
28 15.
29
30
31
59. SciPy: Open source scientific tools for Python, version 1.0.0, scipy.stats.kendalltau.
32
33 www.scipy.org (accessed April 10, 2019).
34
35
36 60. Rohatgi, A. WebPlotDigitizer, version 3.12. http://arohatgi.info/WebPlotDigitizer
37
38 (accessed June 15, 2017).
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
34
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 36 of 47

1
2
3
4 Figures and Tables
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50 Figure 1. Validation of field-based pharmacophore models generated by AutoPH4.
51
52 Percentage of complexes satisfying a given criterion (y-axis) for different target classes (x-axis)
53
54
55
35
56
57
58
59
60 ACS Paragon Plus Environment
Page 37 of 47 Journal of Chemical Information and Modeling

1
2
3
4 with A) holo options and B) apo options. Criteria shown are whether hydrogen bond donor
5
6 (Don), hydrogen bond acceptor (Acc), and aromatic (Aro) features could be validated by probes;
7
8 whether all hydrogen bonds and aromatic interactions were found; and (in the holo case) whether
9
10 the cognate satisfied the pharmacophore model.
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
36
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 38 of 47

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45 Figure 2. Application of the fragment option in AutoPH4 to protein kinase B. Crystal
46
47 structures are available for protein kinase B with A) a fragment and B) a full cognate ligand. A
48
49 pharmacophore model was derived from a crystal structure of the fragment using the C)
50
51 fragment, D) apo, and E) holo option in AutoPH4; also shown is F) an overlay of the fragment-
52
53
54
55
37
56
57
58
59
60 ACS Paragon Plus Environment
Page 39 of 47 Journal of Chemical Information and Modeling

1
2
3
4 based pharmacophore (from C) on the full active ligand. Pharmacophore features shown are
5
6 hydrogen bond acceptors (blue), hydrogen bond donors (purple), aromatics (orange), and
7
8 hydrophobes (green).
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
38
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 40 of 47

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40 Figure 3. Schematic diagram of the AutoPH4 pharmacophore generation process.
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
39
56
57
58
59
60 ACS Paragon Plus Environment
Page 41 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Table 1. Fields and validation methods used in AutoPH4 (see Methods for more detailed
5
6 descriptions of the fields and validation methods, and Tables S1 and S2 for associated conditions
7
8 and settings).
9
10
11
12
Name Description Use
13
14 Electrostatic map Poisson-Boltzmann equation used to predict apo and holo feature
15 (PB)a electrostatically preferred locations of hydrophobes generation
16 and hydrogen bond acceptors and donors
17
18 Non-bonded Contours of probability densities indicating a apo and holo feature
19
contact preference percentage likelihood of a non-bonded contact with generation
20
21 mapa (CST) a particular ligand atom type, based on the Protein
22 Data Bank (PDB)
23
24 Receptor Points corresponding to pharmacophore features apo and holo feature
25 annotation pointsa projected from the receptor generation
26
27 Omni-cation field Field around the lysine and arginine side-chain N+ apo and holo feature
28
atoms derived from CST generation
29
30
31 Omni-acceptor Field around carbonyls and ether/alcohol oxygens, apo and holo feature
32 field which is similar to CST distribution but limited to generation
33 the central atoms of these groups
34
35 Omni-metal field Spherical fields around metal ions apo and holo feature
36 generation
37
38
39
Ligand annotation Spherical fields around ligand pharmacophore holo feature generation
40 pointsa annotation points as defined in MOE
41
42 Ligand distance Corresponds to a Conolly-type surface that defines holo feature generation
43 field the ligand volume
44
45 Dummy distance Corresponds to a Conolly-type surface that defines apo feature generation
46 field the volume around dummy atoms, generated by
47
SiteFinderb algorithms
48
49
50 Interaction Use of probes to calculate the three-term apo feature generation
51 potential mapa interaction energy on a rectilinear grid41,42
52
53
54
55
40
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 42 of 47

1
2
3
4 FFT map Coarse-grained, grid-based contact mapping of the apo feature generation
5 relevant part of the protein pocket using a fast
6 Fourier transform to rapidly evaluate all
7 translations; the technique is derived from protein-
8
9
protein dockingb
10
11 Interaction MOE extended Hückel (EHT) detection for validation for missing
12 detectiona hydrogen bond acceptors and donors and from a interactions in holo
13 geometric model for H- interaction
14
15 Probes Ammonium ion (for donors), formaldehyde (for feature validation for
16
17
acceptors), or benzene (for aromatics) is rotated both holo and apo
18 and translated in the pharmacophore feature sphere;
19 a feature is accepted if the probe has at least one
20 pose in which it does not clash with the receptor
21 and can establish a non-bonded contact (H-bond or
22 H-) between the probe and protein
23
24
25
26 a Fields present in MOE.38
27
28 b As implemented in MOE.38
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
41
56
57
58
59
60 ACS Paragon Plus Environment
Page 43 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Table 2. Overall virtual screening performance of our AutoPH4-based workflow on DUD-E
5
6 complexes (see Table S7 for performance details for each complex, as well as individual area
7
8 under the curve (AUC) values and enrichment curves). Confidence intervals were obtained from
9
10 bootstrapping.55 BEDROC: Boltzmann-enhanced discrimination of the receiver operating
11
12 characteristic; α: an adjustable parameter in the BEDROC score.
13
14
15
Mode Number of Median Targets exceeding Median Median Targets with
16
17 complexes AUCa AUC values enrichmentb BEDROC > 0.5
BEDROCc
18
19 holo 94 0.786 AUC > 0.5 34.40 0.59 57 (60.6%)
20
21 93 targets (98.9%) at 0.5% recovery with α = 321 with α = 321
22
23 AUC > 0.7 24.09 0.40 32 (34.0%)
24
25 79 targets (84.0 %) at 1% recovery with α = 80.5 with α = 80.5
26
27 AUC > 0.9 3.19 0.43 33 (35.1%)
28
29 13 targets (13.8%) at 20% recovery with α = 20 with α = 20
30
31 apo 94 0.735 AUC > 0.5 17.35 0.29 25 (26.6%)
32
33 91 targets (96.8%) at 0.5% recovery with α = 321 with α = 321
34
35 AUC > 0.7 13.66 0.24 14 (14.9%)
36
57 targets (60.6%) at 1% recovery with α = 80.5 with α = 80.5
37
38
39 AUC > 0.9 2.70 0.29 16 (17.0%)
40 5 targets (5.3%) at 20% recovery with α = 20 with α = 20
41
42
43
44 a Medians and 95% confidence intervals from bootstrapping are 0.786 (0.770–0.809) for the holo
45
46 option and 0.735 (0.704–0.754) for the apo option.
47
48 b For the three enrichment levels, medians and 95% confidence intervals from bootstrapping are
49
50 34.40 (30.21–41.24), 24.09 (21.39–29.50), and 3.19 (3.09–3.39), respectively, for the holo
51
52
53
54
55
42
56
57
58
59
60 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling Page 44 of 47

1
2
3
4 option; and 17.35 (11.55–21.24), 13.66 (8.88–15.42), and 2.71 (2.48–2.99), respectively, for the
5
6 apo option.
7
c For the three enrichment levels, medians and 95% confidence intervals from bootstrapping are
8
9
10 0.40 (0.37–0.44), 0.59 (0.51–0.65), and 0.43 (0.39–0.48), respectively, for the holo option; and
11
12 0.24 (0.16–0.29), 0.29 (0.21–0.38), and 0.29 (0.24–0.35) , respectively, for the apo option.
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
43
56
57
58
59
60 ACS Paragon Plus Environment
Page 45 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Table 3. Comparison of the virtual screening performance of our AutoPH4-based workflow with other methods. Values were
5
calculated for the same set of targets unless otherwise indicated. (See SI for additional boxplots.) AUC: area under the curve;
6
7 BEDROC: Boltzmann-enhanced discrimination of the receiver operating characteristic; α: adjustable parameter in the BEDROC
8
9 score; EF: enrichment factor.
10
11 Median Median % of targets Median EF Median Median EF Median Median AUCf Median Median Median Median
12 AUCa BEDROC BEDROC of at 0.5 / 2 / 8% AUCd at 1 / 20% AUCe (n = 94) EF at 1% AUCg BEDROCg EF at 1%
(n = 38) scores at AutoPH4 is recoveryc (n = 22) recoveryd (n = 34) recoveryf (n = 10) (α = 20) recoveryg
13
α = 80.5b highest of (n = 22) (n = 94)
14 (n = 94)
(n = 94) the five
15
methods
16
17 (n = 94)
18 AutoPH4 0.79 0.40 33.0 34.4 / 15.8 / 6.0 0.76 16.1 /2.9 0.78 0.785 24.1 0.748 0.36 17.70
19 holo
20
21 AutoPH4 0.75 0.24 17.4 / 9.4 / 4.5 0.74 9.3 / 2.8 0.73 0.735 13.7 0.660 0.20 5.73
22 apo
23
LigandScout 0.51
24
25 holo
26 GOLD 0.40 28.7 18.4 / 11.5 / 5.7
27
28 Glide 0.37 25.5 18.5 / 10.2 / 5.2 0.79 23.5 / 3.3
29 FlexX 0.22 7.4 9.1 / 6.1 / 4.2
30
31 Surflex 0.21 5.3 8.3 / 5.9 / 3.6 0.760 0.29 11.57
32 Autodock Vina 0.68 9.3 / 2.3 0.69 / 0.72
33
34 DOCKh 0.667 / 0.738 7.9 / 11.3 0.720
35 IChem/Shaper2 0.635 0.14 1.75
36
apo
37
38
39
40
44
41
42
43
44
45 ACS Paragon Plus Environment
46
47
Journal of Chemical Information and Modeling Page 46 of 47

1
2
3
4 a Complexes common to both this study and the paper by Wieder et al.16 were evaluated, and the data based on X-ray structures
5
6 (marked “PDB” in Wieder et al.) was used. In the latter, three DUD-E complexes were replaced by other complexes of the same
7
8 target from the PDB to improve performance, whereas we used the original DUD-E complexes.
9
b For docking method data, see the SI of the paper by Chaput et al.30
10
11 c
12 Enrichment values were calculated by digitizing Figure 1 of the paper by Chaput et al.30 using WebPlotDigitizer.60 Published
13
14
enrichment values for the docking programs were calculated using a formula30 different from the common definition52 used here; to
15
obtain the commonly used enrichment values, the published data were divided by the recovery rate expressed as a percentage (i.e.,
16
17 0.5%, 2%, and 8%). Note that the quoted enrichment data from Chaput et al.30 data relates to 102 targets, not 94.
18
19 d See the paper by Ruiz-Carmona et al.33 for details on the complexes evaluated.
20
21 e The first value for Autodock Vina is without minimization and the MM/GBSA calculation, and the second value is with both.34
22
23 f The first value listed refers to the default optimization and the second value to the optimized Chemgrid/sampling combination.35
24
25 g See the paper by Tran-Nguyen et al.17 for the complexes evaluated. The two quoted AutoPH4 values refer to holo/apo performance.
26
27 For IChem/Shaper2, the best-performing method (referred to in Table 2 of the reference as “Shaper2 TotE PLP”) is quoted. The
28
29 BEDROC and EF values were calculated by digitizing Figure 5 of the paper17 using WebPlotDigitizer.60
30
h Values reported are for DOCK version 3.7 except for the “Median AUC (n = 10)” column, for which DOCK version 3.6 was used.
31
32
33
34
35
36
37
38
39
40
45
41
42
43
44
45 ACS Paragon Plus Environment
46
47
Page 47 of 47 Journal of Chemical Information and Modeling

1
2
3
4 Table of Contents Graphic
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
46
41
42
43
44
45 ACS Paragon Plus Environment
46
47

You might also like