Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

REVIEWS Drug Discovery Today  Volume 21, Number 8  August 2016

a better understanding of the data in machine learning. Descriptor the basic definition of these descriptors. Mathematical represen-
selection is an important step for several reasons [10], including: (i) tation of these descriptors has to be invariant to the size of the
using only a few descriptors increases the interpretability and molecule and the number of atoms it contains to enable model
understanding of resulting models; (ii) It can reduce the risk of building with statistical methods. Molecular descriptors have
overfitting from noisy redundant molecular descriptors; (iii) it can become the most significant features used in QSAR/QSPR model-
provide faster and cost-effective models; and (4) it removes the ing. The information encoded by descriptors generally depends on
activity cliff. However, noisy, redundant, or irrelevant descriptors the kind of molecular representation and the defined algorithm for
should be removed in a way that the dimension of the input space its calculation. Some of these include: topological indices, and
Reviews  INFORMATICS

is reduced without any loss of significant information [3]. In this geometrical, constitutional and physicochemical descriptors.
review, we provide an update on, and a brief explanation of, Constitutional descriptors are simple, commonly used descriptors
commonly used descriptors, with a particular emphasis on their reflecting the molecular composition of a compound without any
selection approaches for the development of more reliable, pre- information about its topology. The most common constitution
dictable, and generalized QSAR models. descriptors are number of atoms, bond count, atom type, ring
count, and molecular weight (MW). These descriptors are inert to
Molecular descriptors any conformation change and, thus, do not distinguish among
Despite great advances in the field of drug design, the use of isomers.
descriptors to define the molecular structure of biologically active
compounds is the main method utilized to discover new lead Topological and geometrical descriptors
molecules. Descriptors are the chemical characteristic of a mole- Recent advances in strategies of lead discovery, drug designing,
cule in numerical form, used for QSAR/QSPR studies. Fig. 1 depicts virtual screening, combinatorial library design, and discrimination

Ge
al
om
c et
gi ri
l oBased on molecular graphs Calculated from the 3D c
o

Represents the connectivity coordinates of the atoms.

a
p

Capture the 3D information

l
of atoms in molecules.
To

Used for modeling biological regarding the molecular size,


physicochemical and shape, and atoms distribution.
pharmacokinetic properties. e.g., WHIM, MoRSE and
e.g., Wiener, Zagreb GETAWAY, etc.
connectivity indices, etc.
Thermod
l
tiona

Simple and commonly Thermodynamics descriptors


used descriptors reflecting are used to relate chemical
the chemical information structure to observed
of a molecule without any chemical behavior.
information of atom e.g HF (head of formation),
connectivity.
u

molRef(Molar refractivity)
y

e.g Atom and bond AlogP. etc.


tit

counts, MW. etc. Used to describe


am

electronic aspects of the


s

molecule or atoms bonds


n

ic

and moleculer fragments.


o

e.g Dipolemoment, HOMO, LUMO


C

energy, etc.

Electronic

Drug Discovery Today

FIGURE 1
Representation of molecular descriptors used in quantitative structure–activity relation (QSAR) modeling.

1292 www.drugdiscoverytoday.com
Drug Discovery Today  Volume 21, Number 8  August 2016 REVIEWS

between database searching emphasize the role of topological from atomic van der Waals areas and their overlap on the molec-
descriptors in drug discovery. Topological indices (TIs) are 2D ular surface. Despite their high information content, these descrip-
descriptors that consider the internal atomic arrangement of com- tors usually also have disadvantages. Geometrical descriptors
pounds [11]. They are derived from the topological representation require geometry optimization and, therefore, the overhead to
of molecules and can be considered as structure-explicit descriptors. calculate them. Thus, for flexible molecules that can have several
These indices encode information about molecular size, shape, molecule conformations, new information is available and can be
branching, presence of heteroatoms, and multiple bonds in nu- exploited. However, this leads to the problem that complexity can
merical form. These TIs represent the connectivity of atoms in the increase significantly. Moreover, most of these descriptors (grid-
molecules by the nature of chemical bonds. They have a significant based descriptors) need alignment rules to achieve molecule com-

Reviews  INFORMATICS
role in the modeling of different physicochemical properties, bio- parability. Several classes of descriptor can be distinguished within
logical activities, and pharmacokinetic properties. A topological the set of geometrical descriptors [22]. We have listed some of the
representation of molecule is presented as a molecular graph. In commonly used topological and geometrical descriptors along
mathematical terms, this graph is denoted as G = (V,E), where V is a with their formal description in Table 1. Most of the computa-
set of vertices that correspond to the molecule atoms and E is a set of tional resources have gained significant popularity among
elements representing the binary relationship between pairs of researchers for the definition of these descriptors and are consid-
vertices. These chemical graphs represent a non-numeric form of ered as simple means of viewing molecular structures. Several
the molecular structure; however, numerical translation of graph is programs for calculating different molecular descriptors are listed
essential for the calculation of topological descriptors [12]. A variety in Table 2.
of descriptors are available, among which the most commonly used
are the Wiener index [13], Connectivity indices [14], Kier shape Physicochemical descriptors
[15], Balaban J index [16], and Zagreb indices [17]. The main These are physical and chemical properties of a molecule that can
applications of these indices are to differentiate molecules based be estimated by examination of its 2D structure. These properties
on their size, degree of branching, flexibility, and overall shape. The have a major role in determining the concentration of drug in the
Wiener index (W) is the oldest molecular graph-based structure- body. Appropriate properties of a drug can increase its efficacy and,
descriptor and has become one of the most frequently used descrip- hence, its market value. Thus, studying these properties of a drug
tors in QSAR/QSPR studies [13]. Among the Tis, the most successful not only supports the safety profile of the drug, but also has a
descriptors are so-called ‘connective indices’. These descriptors are major role in assisting drug discovery by optimizing the com-
based on graph-theoretical invariants, which were introduced to pounds selected. Therefore, there is a need to draw attention to
compute the branching index of alkenes [14]. Kier and Hall extend- properties such as lipophilicity, solubility, and permeability, that
ed these indices and introduced valence connectivity indices to can ensure optimal potency, in addition to selecting the candidate
differentiate heteroatoms. These have now been applied success- compounds with appropriate physicochemical properties.
fully to a variety of physicochemical and biological activities [18]. The lipophilicity of a drug molecule refers to its affinity in a
Randic [19] suggested some features for TIs: (i) they must have lipophilic environment. It is a key property in the transport of
good correlation with at least one property; (ii) should have drugs in the body, including intestinal absorption, membrane
structure interpretation; (iii) should be simple and independent; permeability, protein binding, and distribution among different
(4) easy to apply to a local structure; (5) independent of experi- tissues [23]. A calculated log P is used as an assessment of lipophi-
mental properties; and (6) independent of others descriptors. licity in vivo, which reflects the key event of molecular desolvation
Fortunately, most of these features are found in the topological during transfer from aqueous phases to cell membranes and
descriptors. Hence, they have been prolifically applied in QSAR/ protein-binding sites [24]. The required lipophilicity values for
QSPR modeling to characterize the structural similarity or dissimi- drug candidates are general to a degree, specifically highlighting
larity of chemical compounds. the attraction of c log P values <5. Comparing the available oral
Over the past few decades, advances in computational drugs with compounds at earlier stages of development showed
approaches combined with contemporary methods have led to that high lipophilicity (>5) increased the chance of poor solubility
an array of novel descriptors. In 2007, Topological Maximum and in vivo toxicity. By contrast, a drug generally shows poor
Cross Correlation (TMACC) was generated from atomic properties ADMET properties in the vicinity of low lipophilicity [24]. The
determined by molecular topology [20]. These descriptors are accurate and efficient calculation of log P is crucial for virtual
based on concepts that are derived from autocorrelation descrip- screening because inaccurate values lead to imprecise results that
tors. Spowage et al., illustrated the interpretability of the TMACC can cause potentially promising compounds to be discarded.
descriptors by QSAR modeling of angiotensin-converting enzymes Various studies have been performed to show the relation between
(ACE) and dihydrofolate reductase (DHFR) inhibitors. Overall, lipophilicity and pharmacokinetic properties [24,25]. Waring [26]
TMACC revealed features that were specific for C domain-selective surveyed the relevant literature for methods predicting lipophili-
ACE inhibition, which was an improvement on previous QSAR city with ADMET parameters. The author showed that the optimal
studies [21]. range of lipophilicity lies between 1 and 3. Generally, two
Geometrical descriptors are calculated from the 3D coordinates methods have been used for calculating log P values: (i) the
of atoms in the given molecule. These descriptors are rich in fragment method, which is based on the contribution of function-
information and discrimination power for similar chemical struc- al groups and fragments attached to a nucleus of molecule for the
tures and molecule conformations compared with topological estimation of c log P value [27]. The most widely used fragment
descriptors [22]. Moreover, they also include information obtained method program is c log P, which contains a relatively small set of

www.drugdiscoverytoday.com 1293
REVIEWS Drug Discovery Today  Volume 21, Number 8  August 2016

TABLE 1
Descriptors used in QSAR/QSPR model building.
Year Index Author(s) Description Refs
Topological (2D) descriptors
1947 Wiener (W) Wiener Defined as the sum of edges in the shortest path in a chemical graph [11]
between all pairs of nonhydrogen atoms in a molecule
1971 Hosoya (Z) Hosoya The number of sets of nonadjacent bonds in a molecule, useful for [67]
building the QSAR and QSPR models describing the physical properties
1972 Zagreb indices Gutman The first topological indices used for the total p-energy of conjugated [17]
Reviews  INFORMATICS

molecules
1975 Connectivity Randic The first-order connectivity index is a bond additive molecular structural [14]
invariant. Such indices have wide applications in structure–property and
structure–activity studies
1975 Higher order connectivity Kier et al. Higher order connectivity indices are weight paths, where the higher [68]
weight has been assigned to terminal bonds and a lesser weight to the
less exposed inner bonds
1980 Autocorrelation indices Moreau-Broto First introduced to define a relation between atoms as a function of their [69]
spatial separation; have certain advantages for any QSAR/QSPR study:
fragment independent; invariant to roto-translation; encode the identity
and electronic attributes, including atom types, partial atomic charges,
electronegativity, and polarizability
1982 Balaban J Balaban Distance connectivity index, or average distance sum connectivity, is also [16]
one of the most discriminating molecular descriptors. Its value is
independent of molecular size or number of rings
1985 Kappa shape Kier Shape of molecules defined in terms of number of atoms and their [70]
bonding pattern. Present in various orders (e.g., first, second, third order,
etc.); first order index involves a count of single bonds, whereas second
order index involves a count of a two bond path, and so on
1993 Hyper-Wiener index WW Randic Hyper-Wiener index WW of a chemical tree T is defined as the sum of the [71]
products n1n2, over all pairs u, of vertices T
1994 Szeged (Sz) Gutman Obtained as a bond additive quantity in which bond contributions are [72]
given as the product of the number of atoms closer to each of the two
points of each bond
2001 Modified Hosoya index Z* Randic Frequency of occurrence of individual CC bonds in the patterns of disjoint [73]
bonds are considered
2001 Modified Wiener index W* Nikolic et al. Bond contributions determined using the reciprocal of the product of the [74]
number of atoms on each side of a bond
2003 Novel Wiener index Li et al. Obtained as a bond additive quantity in which bond contributions are [75]
given as the product of the number of atoms closer to each of the two
points of each bond
2007 Topological Maximum Melville et al. Generated from atomic properties determined by molecular topology. [20]
Cross Correlation (TMACC) These descriptors are based on concepts derived from autocorrelation
descriptors
2010 Augmented Zagreb index (AZI) Furtula et al. Based on the atom-bond connectivity (ABC) index, used for obtaining the [76]
extreme AZI values for chemical trees; applicable for tight upper and
lower bounds of chemical trees
2011 Superaugmented eccentric Gupta et al. Fourth-generation topological indices (TIs) exhibiting high discriminating [77]
distance sum connectivity power, low degeneracy, and high sensitivity toward both the presence
and relative position of heteroatom(s)
Geometrical (3D) descriptors
1977 3D-Molecular representation of Soltzberg and Wilkins MoRSE descriptors have been shown to have good modeling power for [78]
Structures based on Electron different biological and physiochemical properties and can be used for
diffraction (MoRSE) the simulation of infrared spectra
1994 Weighted Holistic Invariant Todeschini et al. WHIM descriptors are used to capture the relevant 3D information [22]
Molecular (WHIM) regarding the molecular size, shape, symmetry, and atom distribution;
have been used for modeling toxicology and various physicochemical
properties. At least ten different WHIM types (e.g., WSIZ, WSHA, WDEN) of
descriptors with different molecular features have been developed
1995 3D Autocorrelation Wagener et al. These descriptors are calculated by applying the autocorrelation function [79]
at distinct points on molecular surface. They are unique for a given
geometry, sensitive conformational change, and are invariant to roto-
translation
2002 GEometry, Topology, and Consonni et al. GETAWAY descriptors are based on spatial correlation formula that give [22]
Atom-Weights AssemblY weight to the atom to count atomic mass, van der Waals volume, and
(GETAWAY) electronegativity along with 3D information. Based on matrix operator
and information indices, seven GETAWAY descriptors have been reported
so far

1294 www.drugdiscoverytoday.com
Drug Discovery Today  Volume 21, Number 8  August 2016 REVIEWS

TABLE 2
Software for calculating the descriptors and fingerprints.
Software Descriptors Type of descriptors Web address Status
ACD/labs – log P, log S, log D, pKa www.acdlabs.com Commercial
ADAPT 260 Topological, geometrical, electronic, www.research.chem.psu.edu Freeware
physicochemical
ADAPT 260 Topological, geometrical, electronic, http://research.chem.psu.edu/pcjgroup/adapt.html Freeware
physicochemical

Reviews  INFORMATICS
ADMET predictor 297 Constitutional, functional group counts, www.simulations-plus.com Commercial
topological, E-state, 3D descriptors,
molecular patterns, acid-base ionization,
empirical estimates of quantum
ADRIANA. Code 1244 Constitutional, functional group counts, www.molecularnetworks.com Commercial
topological, E-state, Moriguchi, Meylan
flags, 3D descriptors, molecular patterns,
etc.
ALOGPS2.1 – log P, log S www.vcclab.org Freeware
CDK – Topological, geometrical, electronic, http://cdk.github.io Freeware
constitutional
ChemDes – Molecular descriptors www.scbdd.com/chemdes Webserver
(Freeware)
CODESSA 1500 Constitutional, topological, geometrical, www.codessa-pro.com Commercial
charge-related, semi-empirical,
thermodynamical
DRAGON 4885 Constitutional, topological, 2D- www.talete.mi.it Commercial
autocorrelations, geometrical, WHIM,
GETAWAY, RDF, functional groups, etc.
E-DRAGON – Molecular descriptors www.vcclab.org/lab/edragon/ Freeware
JOELib 40 Counting, topological, geometrical www.ra.cs.uni-tuebingen.de Freeware
properties, etc.
MODEL 3778 Molecular descriptors http://jing.cz3.nus.edu.sg/cgi-bin/model/model.cgi Webserve
(Freeware)
MOE 300 Topological, physical properties, structural www.chemcomp.com Commercial
keys, etc.
MOLCONN-Z 40 Topological www.edusoft-lc.com/molconn Commercial
MOLD2 779 1D, 2D www.fda.gov Freeware
MOLGEN-QSPR 707 Constitutional, topological, geometrical, www.molgen.molgenqspr.html Commercial
etc.
OEChem TK – 166-bit MACCS, LINGO, Circular, Path www.eyesopen.com –
(Daylight-like)
OpenBabel – MOLPRINT2D, 166-bit MACCS, Daylight www.openbabel.org Freeware
fingerprint (FP2), structural key
fingerprints
PADEL 1875 1D, 2D, 3D descriptors, molecular www.padel.nus.edu.sg Freeware
fingerprints
PowerMV 1000 Constitutional, atom pairs, fingerprints, www.niss.org/PowerMV Freeware
BCUT
PreADMET 955 Constitutional, topological, geometrical, http://preadmet.bmdrc.org Freeware
physicochemical, etc.
Sarchitect 1084 Constitutional, 2D, 3D www.strandls.com/sarchitect/index.html Commercial

fragment values obtained from accurately measured experimental calculation of log P, and mainly rely on fragment-based
log P data for a small set of simple molecules; and (ii) an atom- approaches (Table 1).
based approach, which involves estimation of log P based on the Given the complexity of the underlying physiological mecha-
classification of atoms into chemically distinct types and fitting of nisms, the accurate prediction of drug permeability is a challenge
the contributions to a data set of experimentally determined log P for in silico models. Several studies have been performed on in vitro
values [23,27]. Several software packages are available for the cellular permeability that showed its dependency on lipophilicity

www.drugdiscoverytoday.com 1295
REVIEWS Drug Discovery Today  Volume 21, Number 8  August 2016

along with other factors, such as molecular size, hydrogen bonds, for the selection of the descriptors. Such descriptor selection
hydrophilicity, and degree of ionization. These factors are known techniques have become an obvious requirement in development
to have a significant role in the intestinal absorption of a molecule. of the robust QSAR models. The interpretability and generality of
The BCS (Biopharmaceutics Classification System) for higher per- the models obtained by these methods are highly dependent on
meability is defined as >90% for higher intestinal absorption. If the statistical relations among the descriptors and target proper-
the absorption rate is less than the defended value, then the drug is ties. Thus, expert knowledge in the selection process is required to
considered to have low permeability. Molecular size is an impor- gain user confidence in the selected set of descriptors. Recently,
tant factor affecting intestinal absorption and biological activity. Martı́nez et al. developed a tool named Visual and Interactive
Reviews  INFORMATICS

Compounds with hydrophobicity (LogDpH7.4 = 0–2) and interme- DEscriptor ANalysis (VIDEAN) that combines statistical methods
diate size (MW = 500) are known to permeate the intestinal wall by with interactive visualizations for choosing a set of descriptors
passive transcellular diffusion [6,28]. A model for predicting intes- [37]. They used the expertise of chemists in an interactive visual
tinal permeability was developed using different physicochemical exploration of different pieces of data generated from statistical
descriptors. It showed that hydrogen bond donors and lipophili- and information theory-based metrics. The significance outcomes
city are significant in the prediction of human intestinal perme- using this method are: descriptors with discriminative informa-
ability [28]. In 2011, Alex et al. studied the impact of tion, redundant descriptors, and descriptors whose knowledge
intermolecular hydrogen bonding to improve membrane perme- helps reduce the uncertainty about the value of the target proper-
ability and absorption beyond the rule of five chemical spaces. It ty. In another study, Haggarty et al. applied metric space maps for
was hypothesized that intermolecular hydrogen bonds in drug molecular descriptors of deacetylase inhibitors by using the prin-
molecules shield polarity, facilitating improved membrane per- cipal component analysis (PCA) and 3D visualization and revealed
meability and intestinal absorption of cyclic as well as noncyclic that subspaces have different densities of biological activity [38].
beyond rule of 5 (BRo5) compounds [29]. MW has been correlated The results provide evidence that certain structural features are
with decreased permeability. Increased lipophilicity (c log P) typi- significant for activities of deacetylase inhibitors with respect to
cally improves permeability. However, increasing the c log P their biological targets. Three main strategies that can be used for
decreases the solubility and increases the promiscuity and toxicity the selection of relevant descriptors subset are the Filter, Wrapper,
of a drug [30]. Therefore, there is a need for permeability measure- and Embedded (or hybrid) methods. Fig. 3 provides the character-
ments that can be used for early screening sets or library designs, istics, advantages, and disadvantages along with examples of these
lead optimization, and preclinical candidate selection. To measure three strategies.
the permeability, new experimental and computational models Filters select subsets of descriptors as a preprocessing step,
are needed that are able to handle a large amount of data. Numer- independently of an induction algorithm. In this method, a
ous in silico-based algorithms and multivariate tools (i.e., PLS, feature relevance score is calculated, and low-scoring features
ANN, SVR, and RF) are currently used for the prediction of per- are removed [39]. Filter methods consider the feature indepen-
meability, as for solubility [31]. A review by Matsson et al. provides dently, or in terms of the dependent variable. The main advan-
the recent developments in the computationally prediction of the tages of using this method are that it is simple, quick, requiring
permeability [30]. little computational time, and independent of the classifier used;
Poor aqueous solubility is a major cause of failure for any drug however, it has the disadvantage that it does not interact with the
development process. It is estimated that around 70% of drugs in classifier. Several techniques, such as Euclidean distance, Mutual
development are poorly soluble, with 40% of those currently information (MI), and correlation methods, can be used as filter
approved also being poorly soluble. The solubility of a substance methods. MI is the most widely use concept from information
is the amount that can be dissolved in a given volume of solvent at theory and has been used to capture the relevance and redundancy
a specified temperature. A compound is considered to be highly among features. MI in combination with genetic algorithms has
soluble if its immediate dose dissolves in 250 ml in the pH 1–7.5. also been successfully applied to descriptor selection for QSAR [40].
Aqueous solubility has an important role in the release of drugs The disadvantage of this method is that it depends on the estima-
and their permeability through biological membranes, and their tion of the probabilities (estimated from training samples) for
transport and absorption. A QSAR study was conducted on 191 activity [41].
drug-like compounds from the AQVASOL database using a set of The Wrapper method selects the best features subset guided by
simple structural and physicochemical properties to predict the the error of the classifier function for a given subset. The perfor-
aqueous solubility [32]. The study concluded that solubility mance of the wrapper approach is better compared with that of the
decreases with increasing molecular size, rigidity, and lipophili- filter approach. However, the former method is expensive in terms
city. Several methods have been reported for the prediction of of computational complexity and time because each feature subset
aqueous solubility [33–35]. A review by Johnson and Zheng [36] is evaluated with the classifier algorithm used. Wrapper methods
discussed a significant role for, and progress in, computational only use a single classifier. Given that each classifier has its own
models for the prediction of aqueous solubility and absorption of biases, each will select different feature subsets. To address the
drug compounds. problem of classifier bias, a study investigating the effects of
different classifiers for wrapper feature selection revealed that
Descriptor selection methods the number and nature of classifiers greatly affect feature selection
To remove irrelevant descriptors, a selection criterion is required results [42]. A range of strategies, such as a random hill-climbing
that can measure the relevance of each selected descriptor with the algorithm, or heuristics, such as forward selection, backward
output of any classifier. Fig. 2 depicts the procedure that can be use elimination, and stepwise regression, can be used to add and

1296 www.drugdiscoverytoday.com

You might also like