Unexplored Therapeutic Opportunities in The Human Genome

ANALYSIS
Unexplored therapeutic opportunities

in the human genome
Tudor I. Oprea1,2,3,4*, Cristian G. Bologa1, Søren Brunak4, Allen Campbell5,
Gregory N. Gan2, Anna Gaulton6, Shawn M. Gomez7,8, Rajarshi Guha9, Anne Hersey6,
Jayme Holmes1, Ajit Jadhav9, Lars Juhl Jensen4, Gary L. Johnson8, Anneli Karlson6,20,
Andrew R. Leach6, Avi Ma’ayan10, Anna Malovannaya11, Subramani Mani1,
Stephen L. Mathias1, Michael T. McManus12, Terrence F. Meehan6, Christian von
Mering13, Daniel Muthas14, Dac-Trung Nguyen9, John P. Overington6,21,
George Papadatos6,22, Jun Qin11, Christian Reich15, Bryan L. Roth8,
Stephan C. Schürer16, Anton Simeonov9, Larry A. Sklar2,17,18, Noel Southall9,
Susumu Tomita19, Ilinca Tudose6,23, Oleg Ursu1, Dušica Vidović16, Anna Waller17,
David Westergaard4, Jeremy J. Yang1 and Gergely Zahoránszky-Köhalmi1,24
Abstract | A large proportion of biomedical research and the development of therapeutics is
focused on a small fraction of the human genome. In a strategic effort to map the knowledge gaps
around proteins encoded by the human genome and to promote the exploration of currently
understudied, but potentially druggable, proteins, the US National Institutes of Health launched
the Illuminating the Druggable Genome (IDG) initiative in 2014. In this article, we discuss how the
systematic collection and processing of a wide array of genomic, proteomic, chemical and
disease-related resource data by the IDG Knowledge Management Center have enabled the
development of evidence-based criteria for tracking the target development level (TDL) of
human proteins, which indicates a substantial knowledge deficit for approximately one out of
three proteins in the human proteome. We then present spotlights on the TDL categories as well
as key drug target classes, including G protein-coupled receptors, protein kinases and ion
channels, which illustrate the nature of the unexplored opportunities for biomedical research
and therapeutic development.
Target selection and prioritization are common goals for For the purposes of this article, we define knowledge
Drug
Externally administered, academic and commercial drug research organizations. as the consensus of information aggregated from dif-
possibly endogenous but While motivations differ, in all cases, the target selection ferent sources and information as structured data, with
mostly xenobiotic, substances task is fundamentally one of resource allocation in the a contextual layer that supports a broad range of data
that are administered to
face of incomplete information. Consequently, target analytics. Data have quantity, quality and dimensionality
patients in order to influence
the outcome of a disease,
selection strategies (and metric-based approaches to (for example, genomic knowledge is defined in relation
syndrome or condition. assess their success) remain complex 1 and are hindered to associations with distinct entities such as molecular
by multiple bottlenecks. Some bottlenecks pertain to the probes and disease concepts). Data, like facts, may also
data themselves, such as disjointed, disparate data and have an expiration date (Supplementary Box S1), and
metadata standards, data recording errors and accessi- thus knowledge is subject to change. Yet, within a given
bility issues; overcoming these issues will require human time frame, knowledge provides context for interpre-
and computational efforts and coordination across mul- tation and integration of emergent data, information
tiple communities. Another set of bottlenecks pertains to and models.
*e-mail:
toprea@salud.unm.edu the scientists involved. These include a tendency to focus Data-driven drug discovery strategies rely on the
on a small subset of well-known genes2 and the tendency integration of proprietary and internal data with
doi:10.1038/nrd.2018.14
Published online 23 Feb 2018; to avoid riskier research paths, driven by poor research third-party resources — both public databases, such
corrected online 23 Mar 2018 funding climates3. as PubMed, PubChem4, ChEMBL5 and The Cancer
NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 317

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
Drug targets Genome Atlas (TCGA6), and commercial databases, Illuminating the Druggable Genome. “The reluctance
Molecular entities present in such as Integrity. This integration requires fusion and to work on the unknown” (REF. 2) is inherent to the
living systems that, upon reconciliation of heterogeneous and sometimes con- scientific endeavour, partly due to our subconscious
interaction with therapeutic flicting data sources and types. Although many of tendency to choose research subjects more likely to
agents or their by-products,
result in modified biological
these resources are already partially interlinked, data confirm what we already know or believe8. In a deliber-
responses that lead to heterogeneity, complexity and incompleteness, as well ate, strategic attempt to map the knowledge gaps around
therapeutic outcomes. The as contextual information and metadata capture, pose potential drug targets and to prompt exploration of cur-
interaction between a drug and substantial barriers to reliable systematic analyses of all rently understudied but potentially druggable proteins,
its target leads, directly or
data required to address biomedical research questions, the US National Institutes of Health (NIH) launched the
indirectly, to observable clinical
outcomes.
such as target prioritization in drug discovery 1. Illuminating the Druggable Genome (IDG) initiative in
With the increasing scale and variety of data genera- 2014. As part of this broad, multimillion-dollar initiative,
Druggable genome tion, collection and curation in the biomedical sciences, the IDG Knowledge Management Center (KMC) aims
Originally defined by Hopkins there is an unmet need for in‑depth, accurate and truth- to systematize general and specific biomedical knowl-
and Groom as the set of genes
that encode proteins that
ful integration of multiple scientific domains across edge by processing a wide array of genomic, proteomic,
could be modulated by an disciplines. Once successful, these data and knowledge chemical and disease-related resources (BOX 1), with the
orally administered small integration efforts enable us to ask both global and explicit goal of supporting target hypothesis genera-
molecule, as estimated by fundamental questions about genes, proteins and the tion and subsequent knowledge creation, especially for
Lipinski’s ‘rule of five’
processes they are involved in. Integrated resources also genes and proteins that are not well studied.
guidelines.
allow us to address aspects of reproducibility 7 via con- In this article, we first define objective, evidence-based
cordance of similar data types from unrelated sources criteria for tracking target development levels (TDLs) for
and deficits in our knowledge of biological systems human proteins, using multiple sets of current knowl-
and their function. More generally, data integration edge. We discuss the data collected by the KMC on
facilitates our ability to quantify knowledge using an TDLs, which show the existence of a substantial knowl-
evidence-based approach. edge deficit concerning a large portion of the human
proteome (one out of three proteins). Reflecting the goal
of illuminating the druggable genome, we then present
spotlights on the TDL categories, as well as on key target
Author addresses classes, including G protein-coupled receptors (GPCRs),
protein kinases and ion channels.
1
Department of Internal Medicine, University of New Mexico School of Medicine,
Albuquerque, NM, USA.
Knowledge-based protein classification
2
UNM Comprehensive Cancer Center, Albuquerque, NM, USA.
3
Department of Rheumatology and Inflammation Research, Institute of Medicine, Target development levels. Most current protein clas-
Sahlgrenska Academy at University of Gothenburg, Gothenburg, Sweden. sification schemes are based on structural and func-
4
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical tional criteria. For any given protein, it is also possible
Sciences, University of Copenhagen, Copenhagen, Denmark. to identify associated drugs and chemical or biologic
5
IQVIA, Plymouth Meeting, PA, USA. modulators, and many types of experimental data can be
6
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), associated with the protein, including publications, pat-
Wellcome Genome Campus, Hinxton, Cambridge, UK. ents, gene expression data and experimental or modelled
7
Joint Department of Biomedical Engineering, University of North Carolina at Chapel 3D structures.
Hill and North Carolina State University, Chapel Hill, NC, USA. For target prioritization and therapeutic development,
8
Department of Pharmacology, University of North Carolina School of Medicine,
it is useful to understand the quantity and diversity of
Chapel Hill, NC, USA.
9
National Center for Advancing Translational Sciences (NCATS), National Institutes of data that are available for a given protein and to assign
Health (NIH), Rockville, MD, USA. a qualitative knowledge metric that characterizes the
10
Icahn School of Medicine at Mount Sinai, New York, NY, USA. degree to which a target is comparatively well studied or
11
Baylor College of Medicine, Houston, TX, USA. unstudied. To address this, we developed the TDL clas-
12
University of California, San Francisco, CA, USA. sification scheme, which categorizes proteins into four
13
Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland. groups — Tclin, Tchem, Tbio and Tdark — with respect to the
14
Respiratory, Inflammation and Autoimmunity Diseases, Innovative Medicines and depth of investigation from a clinical, chemical and bio-
Early Development Biotech Unit, AstraZeneca R&D Gothenburg, Mölndal, Sweden. logical standpoint (FIG. 1; TABLE 1). Except for Tclin, TDL
15
IQVIA, Cambridge, MA, USA. assignments were performed without human curation.
16
Department of Molecular and Cellular Pharmacology, Miller School of Medicine,
Formal definitions for the TDL categories are as follows.
University of Miami, Miami, FL, USA.
17
Center for Molecular Discovery, University of New Mexico Cancer Center, • Tclin (clinic) proteins are drug targets linked to at least
University of New Mexico, Albuquerque, NM, USA. one approved drug (that is, an active pharmaceuti-
18
Department of Pathology, University of New Mexico, Albuquerque, NM, USA. cal ingredient) by mechanism of action (MoA) (this
19
Yale School of Medicine, Yale University, New Haven, CT, USA. criterion supersedes any of the other parameters).
Present addresses: 20SciBite Limited, BioData Innovation Centre, Wellcome Genome Classification into this TDL category was achieved
Campus, Hinxton, Cambridge, UK. through exhaustive manual querying of primary lit-
21
Medicines Discovery Catapult, Alderley Edge, UK. erature and drug labels for MoA assignments with
22
GlaxoSmithKline, Stevenage, UK. respect to molecular (protein) targets9; drug targets
23
Google Germany GmbH, München, Germany. annotated as MoA-related proteins are categorized as
24
NIH-NCATS, Rockville, MD, USA.
Tclin (see further discussion below).
318 | MAY 2018 | VOLUME 17 www.nature.com/nrd

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
• Tchem (chemistry) proteins lack MoA-based links to counts per protein. ‘GO terms’ examines the distribution
approved drugs but are known to bind to small mole- of GO12 annotation counts per protein using data from
cules with high potency. The interactions between pro- UniProt 15. ‘R01 grants’ examines the distribution of text-
teins and small molecules (and sometimes approved mined R01 grant counts detected for each protein using
drugs) are usually studied in the context of a disease NIH RePORTER data (see below for further discus-
and often arise from medicinal chemistry efforts. For sion). ‘Patents’ examines the distribution of text-mined
inclusion in the Tchem category, we required the bio- granted patents for each protein using SureChEMBL16
activity of at least one small molecule to be above a data. Finally, the data availability score summarizes
specific cut-off chosen to include about 90% of the experimental information density per protein obtained
bioactivity values of drugs with a confirmed MoA from Harmonizome17 data — a resource developed
for a target from that protein family (Supplementary independently for the KMC that provides an abstract
Figure S2). Currently chosen thresholds are ≤30 nM representation of the many types of data associated
for kinases, ≤100 nM for GPCRs and nuclear recep- with all human genes and proteins (BOX 1).
tors, ≤10 μM for ion channels and ≤1 μM for other Whereas the first three data types were used to assign
target families. Bioactivity values were extracted from the TDL category for proteins in the Tbio and Tdark catego-
ChEMBL5 and DrugCentral10. ries, the other four data types — derived from separate text
• Tbio (biology) refers to those proteins that have a con- corpora and repositories — provide independent valida-
firmed Mendelian disease phenotype in the Online tion of our criteria for categorization overall. Distribution
Mendelian Inheritance in Man (OMIM) database11 trends within TDL categories are consistently repro-
(that is, at least two publications), have Gene Ontology duced across all data types in FIG. 2a and have statisti-
(GO)12 leaf term annotations based on experimental cally significant differences (Supplementary Table S4).
evidence or meet two of the following three condi- Tdark proteins have the least amount of data associated
tions: a fractional PubMed publications count13 above with them regardless of source.
five; three or more National Center for Biotechnology Increasing amounts of data are observed for proteins
Information (NCBI) Gene Reference Into Function when progressing through the categories from Tdark to
(RIF) annotations; or 50 or more commercial anti- Tbio, Tchem and Tclin. For example, Tdark proteins tend not
bodies, counted from data made available by the to be the object of study for many funded NIH R01 grants
Antibodypedia database14. Tbio assignments imply and are significantly less discussed in patents compared
that these proteins are not MoA-related drug targets with proteins in other TDLs. Statistical significance
(these are Tclin proteins). However, it does not follow breaks down when comparing Tclin and Tchem, but because
that these proteins lack associations with bioactive successful clinical trials are required for the Tchem-to-Tclin
molecules, including approved small-molecule drugs progression, this evidence may not be well captured by the
and biologics. It does, however, imply that given cur- four data types highlighted in Supplementary Table S4.
rent levels of evidence, associated bioactivity values However, this is less surprising from a knowledge man-
and clinical observations did not meet Tchem or Tclin agement perspective, since on average, the biochemis-
criteria, respectively. try and pharmacology of a protein are likely to be well
• Tdark (dark genome) refers to the remaining proteins studied upon reaching the Tchem development stage. It is
that have been manually curated at the primary important to note that the Tchem stage can be completely
sequence level in UniProt 15 yet do not meet any of bypassed for targets of therapeutic antibodies and
the criteria for Tclin, Tchem or Tbio. Even for this cate- other biologics.
gory, evidence may be available concerning genome- In summary, all the data, information and knowl-
wide association studies (GWAS), tissue location, edge aggregated and processed within the IDG KMC
dysregulation, inferred function via homology, etc. archive (partially illustrated in FIG. 2a) confirm the
Many proteins in the Tdark category are not context- existence of a knowledge deficit about many proteins,
less sequences. However, these are proteins for which some of which could have therapeutic relevance. The
there is the least current knowledge and a low number bias towards well-described proteins2 is confirmed not
of specific molecular probes available, and some rep- only with respect to publications but also with respect
resent unexplored opportunities within the druggable to patents, NIH funding patterns, GWAS and mouse
human genome. While evidence that approved drugs phenotype data (data not shown), availability of molec-
interact with some Tdark proteins may be available, the ular probes such as antibodies and small molecules, and
above criteria were observed for all Tdark assignments even queries in the STRING18 database (see below and
(Supplementary Table S3). FIG. 2b). Because of this bias, one out of three human
proteins (Tdark) have been largely unstudied. Although
The knowledge deficit. FIGURE 2a summarizes the vary- the NIH acknowledged that illumination should directly
ing degree of available data (represented using a normal- target understudied proteins, scientists engaged in target
ized count of occurrence) for seven different data types selection are likely to remain risk-averse and perhaps
associated with individual targets and grouped by TDL. systematically less inclined to study Tdark proteins.
The first three groups illustrate category differences for Our classification provides overall insight into the
three TDL defining criteria discussed above — namely, current illumination levels and sizes the opportunity
the fractional count of protein and/or gene mentions in for drug targets from well-established and precedented
PubMed abstracts, NCBI Gene RIF counts and antibody druggable protein families. The natural progression is for

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
Mode of action proteins of potential therapeutic interest to migrate from and medicinal chemistry assessments of druggability
Referred to as ‘mechanism of Tdark to Tclin over time, and TDL monitors knowledge largely focus on forecasting whether a target protein can
action’ when the molecular accumulation using multiple types of clinical, chem- bind to drug-like small molecules with high affinity and
interactions are well ical and biological evidence, while providing an easily specificity 19. However, druggability literature rarely men-
understood; describes the way
in which drugs exert their
interpretable ranking scheme. We argue that proteins in tions biologics, antibodies and other protein therapeu-
intended therapeutic action, Tdark and Tbio are understudied and more in need of illu- tics, radiotherapy (Supplementary Box S5), gene therapy
resulting in the intended mination, and we discuss approaches for achieving this or stem cells. In this section, we discuss Tclin and Tchem
therapeutic outcome. later in the article, after first overviewing knowledge on proteins in the context of small-molecule drug discovery.
proteins in the Tclin and Tchem categories.
Tclin proteins. Ideally, unequivocal Tclin assignment (that
Spotlight on Tclin and Tchem is, identification of molecular drug targets) would require
Evaluating protein target druggability — the ability of several layers of evidence: a full matrix of in vitro bio-
a protein to be therapeutically modulated by medicines activity for all prodrugs, drugs and active metabolites
— can involve complex assessments of a range of protein (active ingredients) assayed against all relevant human
characteristics. Structural biology and computational and non-human (for example, bacterial and viral) targets
Box 1 | Overview of the Illuminating the Druggable Genome Knowledge Management Center
Knowledge management implies the ability to structure data into browsing of all TCRD data. Features include search filters to reduce lists
information88 while combining low-volume, high-quality data, such as of targets, query-saving capability for sharing, and dossier functionality
thorough analyses of experimental data (for example, high-resolution to collate data during searching or browsing. Pharos provides an
X‑ray crystallographic structures) or evidence-based systematic reviews extensive REST API to support programmatic access and inclusion in
(for example, the Cochrane Collaboration), with high-volume (and perhaps pipelining tools.
lower quality) data such as genome-wide association studies (GWAS) or
Harmonizome
high-throughput screening data sets. As the overall scientific process
Given the wide variety of experimental data that is generated on
requires the archiving, evaluation and re‑interpretation of sometimes
individual proteins, it is useful to characterize the total availability of data
conflicting data, the Illuminating the Druggable Genome Knowledge
types around individual targets. This Harmonizome is a resource
Management Center (IDG KMC) faces similar challenges. Consensus
developed for KMC17 that contains a collection of processed data sets
emerges based on repeated independent experiments, robustness of the
from 70 major online resources, abstracted and organized into ~72 million
results (for example, modified reagents or conditions, or model organisms),
functional associations between genes and proteins and their attributes.
increased domain expertise and qualitative judgement. To this end, the
Such attributes could be physical relationships with other biomolecules,
IDG KMC automates algorithmic processing of structured data by
expression in cell lines and tissues, genetic associations with knockout
extracting and processing expression and functional data related to
mouse or human phenotypes or changes in expression after drug
proteins and genes, molecular probes such as small molecules and
treatment.
antibodies, small-molecule bioactivities, GWAS, disease associations
These associations are stored in a relational database along with rich
and launched drug information (among other data types) into the Target
metadata for genes and proteins, their attributes and the original sources.
Central Resource Database (TCRD)89. TCRD content is presented via
To report overall levels of knowledge for each target, the Harmonizome
Pharos, a multimodal web interface89 (see below).
computes a cumulative probability of a protein occurring within a given
TCRD–Pharos is not unique in providing integrated content:
data set. With appropriate normalization, this results in an association
ChEMBL, DrugBank90 and UniProt91 are excellent examples of drug
score for a protein–data source pair, with values ranging from 0 to 1.
discovery integration systems, for example, for chemical structure and
When a source has no data associated with a target, its score is set to 0.
drug bioactivity data and protein and disease information, largely focused
Currently, 110 individual data sources (including supplementary files from
on a specific knowledge domain. CiViC92 combines multiple resources with
publications and public repositories of omics data) are made available
a specific goal, for example, to enable clinical interpretation of gene
through the Harmonizome, resulting in a 110‑element vector
variants. The only resource that parallels the scope of IDG KMC is
representation for each target. From this vector, we compute the data
OpenTargets93, a consortium focused on disease-specific target validation
availability score as the sum of the 110 association scores.
efforts. The KMC collates evidence about all human proteins from multiple
The Harmonizome is available through a web portal, a web service and
domains, supporting research on understudied proteins and new biology,
a mobile app for querying, browsing and downloading all data. The
and includes the following resources.
Harmonizome visualizes gene–gene and attribute–attribute similarity
Target Central Resource Database networks for all processed data sets.
TCRD is the central open-access data repository for the IDG KMC and is the DrugCentral
primary data source for the IDG KMC project-wide web portal Pharos89. This online compendium provides chemical, pharmacological and
TCRD integrates 55 heterogeneous data sets, with over 85 million gene regulatory information for active pharmaceutical ingredients
and/or protein attributes. Special emphasis is placed on four families that and pharmaceutical products by linking chemical entities, multiple drug
were of interest to the pilot phase of the IDG programme: G protein-coupled identification codes, drug mode of action and pharmacological action at
receptors, ion channels, kinases and nuclear receptors (TABLE 1). The focus the target level, and pharmaceutical formulation and product-specific
on this fraction of the proteome is justified by historical evidence, which information, as well as indications, contraindications and off-label
indicates that these four protein families are among the most consistently indications10. DrugCentral links 4,509 active ingredients to 93,084
successful druggable target classes (see also TABLE 2). TCRD is available pharmaceutical products and is available under the CC‑BY‑SA 4.0 licence.
under the CC‑BY‑SA 4.0 licence. Programmatic access to TCRD is also
available via a REST application programme interface (API). Drug Target Ontology
This is an interactive framework to integrate, navigate and analyse drug
Pharos discovery data, based on formalized and standardized classifications and
Access to TCRD content is via the web portal Pharos89, which is a Java annotations of human proteins94, available under the CC‑BY‑SA 4.0
platform that supports efficient and intuitive search queries and licence.

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
(such as the half-maximal inhibitory concentration (IC50), in humanized animal models (although human data are
effector concentration for half-maximum response (EC50), preferable); and phenotypic confirmation supported by
inhibitory constant (Ki) and the dissociation constant pharmacodynamic data. In animal disease models lacking
(Kd)); on–off rate constants and other kinetic measure- the gene or genes responsible for the MoA of the drug, the
ments performed at appropriately relevant concentra- drug should lack therapeutic effect. Meeting these criteria
tions in the tissue or tissues relevant for that particular would be needed in order to attribute the desired clinical
disease context, preferably with matching in vivo data outcome to a specific drug target interaction mechanism.
a
60%
16%
T bio %
55 3%
21%
8% 35%
Tchem
28%
7%
3% GPCR Transporter
Nuclear receptor Kinase 31%
Ion channel
Tclin
3%
Transcription factor
Enzyme Epigenetic
Other 4%
24% 7%
2% 8%
12%
14%
7%
35%
Tdark
71%
b 100
90
80
Percentage of proteins in family
70
60
50
40
30
20
10
0
me
ns
CR
CR
ne tor
r
ne
to
to
rte
he
as
)
ei
tic
ep
ac
ige fac
GP
GP
an
zy
Kin
Ot
po
ot
En
nf
rec
ch
pr
ns
an
(ep ion
tio
Ion
Tra
tic
ph
ar
t
rip
rip
cle
ne
Or
sc
sc
ige
Nu
Tbio Tchem Tclin

n
n
Tra
Tra
Ep
Figure 1 | Target development level categories applied to the human proteome. a | Percentages of the whole
proteome are shown in the inner ring. Percentages of each target development level (TDL) category for selected major
protein families are shown in the outer ring, with the Tclin category expanded. Inner ring colours Natureare as follows:
Reviews TdarkDiscovery
| Drug , black;
Tbio, red; Tchem, green; and Tclin, blue. b | TDL distribution across protein families, coloured by TDL category. Data show 3,644
proteins that have a confirmed disease association according to the Online Mendelian Inheritance in Man (OMIM)
database. The enzyme category excludes kinases, which are considered separately. GPCR, G protein-coupled receptor.

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
Table 1 | Current distribution of TDL categories by protein family for the druggable genome
Target class All Tclin Tchem Tbio Tdark
GPCRs (non-olfactory) 406 96 113 145 52
Olfactory GPCRs 421 0 0 8 413
Kinases 634 50 390 163 31
Ion channels 355 126 44 150 35
Nuclear receptors 48 18 19 11 0
Transporters 473 26 46 287 114
Transcription factors 1,400 0 27 866 507
Epigenetic proteinsa 280 12 53 178 37
Enzymes b
4,146 186 493 2,607 860
Others 11,957 87 217 6,671 4,982
Total 20,120 601 1,402 11,086 7,031
GPCR, G protein-coupled receptor; TDL, target development level. aIncludes 40 transcription factors not already counted with
those in the transcription factor category. bExcludes kinases.
Because the above criteria are difficult to implement example, some nuclear receptors lack an (endogenous)
by automation, a previous analysis carefully curated ligand-binding domain or do not appear to be amenable
MoA data from approved drug labels as well as primary to small-molecule perturbation. Another factor is that
literature, based on a rigorous definition of a drug tar- not all proteins can (or will) alter the course of disease via
get9. This ongoing process, performed in parallel by three therapeutic intervention, perhaps in some cases owing to
teams, is anticipated to improve our ability to link drug our lack of understanding of the underlying pathology.
responses to genetic variation and to help us understand Kubinyi pointed out that single proteins combine
the molecular basis of clinical efficacy, safety and adverse in vivo in ways that could lead to many more drug
events. The interplay between target tissue expression target combinations across multiple pathways — that
under disease-specific conditions and the local concen- is, a ‘druggable proteome’ (REF. 22) — and there is now
tration of the drug or its active metabolites at the rele- experimental evidence that alternative splicing, post-
vant disease site is often difficult to ascertain, which is translational modification and heterogeneous oligomers
why we attributed a higher weight of evidence to data produce functional isoforms with different interaction
derived from multiple drugs belonging to the same ther- profiles, which may further result in increased diver-
apeutic class. Indeed, we anticipate that efficacy target sity of the proteome23. It is also important to note that
annotations will become more precise as our capability for many drugs, the precise MoA and contributing
to colocalize target, disease and drug increases. molecular targets remain cryptic, especially when poly
From this analysis, Tclin currently consists of ~600 pharmacology (the simultaneous modulation of multiple
protein targets9, which is at the lower end of the orig- targets by drugs) occurs. Shedding light on this would
inal estimate of between 600 and 1,500 targets for the require data completeness24, namely, experiments across
intersection between proteins in the druggable genome all proteins, in relevant physiological conditions, for all
and disease-modifying genes by Hopkins and Groom20 approved drugs. This remains a resource-intensive and
(note, however, that Tclin includes targets of biologic costly task, which was partially accomplished25 by the
drugs as well as small molecules, while the estimate was NIH Molecular Libraries Initiative26.
for small-molecule drug targets only20). So far, proteins
in the Tclin category thus represent only a small fraction Tchem proteins. Assignment to Tchem is based on com-
(3%) of the human proteome (FIG. 1a). From a compound activity thresholds originating from binding
mercial perspective, it is also noteworthy that most of experiments for small molecules (Supplementary
the global revenues of the pharmaceutical industry are Figure S2). Selectivity, though important both in vivo
derived from drugs that target a relatively small number and in vitro, could not be factored in for all Tchem targets
of the proteins in the Tclin category (BOX 2; TABLE 2). The (Supplementary Box S6 and Supplementary Table S7).
majority (259 or 79.7%) of these targets are single pro- Because, by definition, Tclin attribution requires support-
teins, whereas 39 (12%) are complex multiprotein tar- ing evidence for the MoA, many proteins known to inter-
gets. Only 25 targets (8.3%) are comprised of multiple act with approved drugs, even with high affinity, remain
proteins for non-selective drugs; these include the mus- in the Tchem category. Additional bioactivity data from,
carinic, α-adrenergic and oestrogen receptors, as well as for example, patent literature and papers currently not
cyclooxygenases and histone deacetylases. indexed in ChEMBL may progress more targets to Tchem.
Among the factors contributing to the small frac- Many compounds that have reported activity against
tion of each major protein family in Tclin so far, one Tchem targets are also candidate drugs undergoing clini-
factor is that not all members of a protein family have cal trials. Based on an in‑depth analysis of clinical trial
drug-c ompatible or ligandable 21 binding sites; for data combined with data from ChEMBL, PubMed, the

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
a As noted above, target druggability is frequently esti-

1.00
mated based on the ability to bind to small molecules20,
Normalized counts per score
and expectations of druggability typically diminish as

0.75 the size of the binding pocket increases, with affinity
and selectivity being major concerns. The challenge is
0.50
even greater when the binding pocket is shallow and
highly exposed to a solvent or when the therapeutic
strategy involves disrupting the interaction between the
0.25 targeted protein and other proteins. One approach to
evaluating druggability is to focus on protein domains
mined from the InterPro database29 and then to prior-
0.00 itize proteins that contain domains known to interact
with approved drugs20 or bioactive small molecules30.
s
Fs
nt
sc ata
ct
nt
nt
og
RI
e
u
tra
ra
te
ty d
or
Others have explored target druggability by evaluating
co
ol
ne
ili me
1g
Pa
nt
s
dy
ab
Ge
R0
lab o
side-effect similarity for known drugs31 or by perform-
o
ai niz
ed
ne
ib
t
M
av mo
Ge
An
ing combined chemical and target similarity queries32,

b
Pu
r
Ha
followed by experimental confirmation of novel drug
targets derived from clinical observation or computa-
b tion. It is possible that induced binding sites in proteins
in which a druggable pocket is not initially found may
enable them to be targeted with drug-like small mole-
cules, but identifying these binding sites with structural
1000
approaches is likely to be challenging, and phenotypic
Access counts
screens may be more useful.

An emerging approach harnessing so‑called PROTACs
(proteolysis-targeting chimaeras) may help substantially
in addressing the issue of undruggability 33, at least for
10 proteins that are capable of selectively binding a small-
molecule ligand, although not necessarily at a typical
binding site34. Essentially, this strategy harnesses the endo
genous ubiquitin–proteasome system to promote targeted
Link Name degradation of desired proteins following binding of the
PROTAC35; the mechanism for ternary ligase–PROTAC–
Figure 2 | Patterns of target development level distribution across different data:
target complex activation has been recently elucidated36.
visualizing the knowledge deficit. a | The three criteria used in
Nature establishing
Reviews | Drugthe target
Discovery
development level are to the left, and their independent validation by four other data This technology may also be subcellular location-specific,
types are to the right. For PubMed abstracts, Gene Reference Into Function (RIF) which could be an additional advantage in some (but not
annotations, antibodies, Gene Ontology, R01 grants and patents, the score for each other) cases. However, the oral bioavailability of PROTAC
target is the count of those entities associated with the target, normalized between 0 molecules may be constrained by their size.
and 1. The values for the Harmonizome data availability score were computed differently,
as described in the main text. See FIG. 1 for colour codes and Supplementary Table S4 for Spotlight on Tbio and Tdark
further details. b | Patterns of scientific curiosity: STRING database access counts by A critical effort in addressing the knowledge deficit about
target development level (January–December 2016). Tbio and Tdark proteins is being undertaken by the Monarch
Initiative37, which relies on informatics methods to iden-
tify phenotypically relevant disease models in research
IUPHAR Guide to Pharmacology 27 and ChemIDplus, we and diagnostic contexts based on integrated model organ-
mapped 144 Tchem proteins to 356 clinical (phase I–III ism and clinical research data. One of the main sources
trial) candidates, for a total of 701 unique target–clinical for the Monarch Initiative is phenotype data from the
candidate pairs. For 175 (25%) of these pairs, therapeutic International Mouse Phenotype Consortium (IMPC),
indication data extracted from ChEMBL highlight the which was set up to generate and phenotypically char-
different distribution among protein families (TABLE 3). acterize mouse knockout lines. Their recent analysis of
Most targeted proteins are kinases (93 unique enzymes), 1,751 unique gene knockouts found that human disease
followed by GPCRs (31 unique receptors) and ion chan- genes are enriched for essential genes38.
nels (13), with seven targets from other families, which The IDG KMC incorporates gene-centric mouse phe-
is similar to the prior observation that most clinical notype data and maps these data to the respective human
candidates target the most druggable target families28. orthologues. IDG coordinated with IMPC production
Analysis of the target–clinical candidate subset in which centres to prioritize production of knockout mouse
anticipated therapeutic indications are available shows strains for druggable genes. As of November 2017, 568
that most of the kinase-targeting clinical candidates are new knockout strains had been produced: 166 GPCRs,
aimed at oncology applications, whereas GPCRs and ion 141 ion channels, 238 kinases and 23 nuclear receptors
channel-targeting clinical candidates are aimed at central (see Supplementary Table S8). When ignoring olfactory
nervous system disorders. GPCRs, these represent a little more than one-third of

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
Box 2 | Financial spotlight on the human proteome

Analyses of drug sales focus on pharmaceutical products95 and the The commercial outcomes of target selection and product-led validation
companies authorized to market them96. Here, we ask a target-centric can also be analysed with respect to research funding. NIH RePORTER data
question — what are the most financially valuable therapeutic targets — (see the NIH ExPORTER website) were processed using the same
by exploring IMS Health (now known as IQVIA) data from their MIDAS™ text-mining methods described earlier for FIG. 2. During the same period,
platform. We used IMS Health MIDAS drug sales data from 75 countries 2011–2015, the NIH funded 42,924 R01 grant applications, at a total cost of
covering Europe, North America, Australia and Japan, aggregated for the $32 billion. These projects discuss up to 7,851 human proteins (see
2011–2015 period. MIDAS tracks products from most therapeutic classes, Supplementary Table S10). For example, R01 grants associated with
estimating product volumes, trends and market share through retail and oestrogen receptors were awarded $101 million, compared with $50 billion
non-retail channels. We chose quinquennial aggregation over annual sales in sales earned by 18 drugs acting through oestrogen receptors during that
data as it diminishes the importance of factors less relevant to this analysis, same period (TABLE 2). Some targets, having over 30 drugs each, are also
such as fluctuations in currency exchange rates. Because active top-earning and well funded, for example, the μ‑opioid (OPRM1) and the
ingredients lose patent coverage and become generic, annual sales figures glucocorticoid (NR3C1) receptors. Other top-earning targets with over
can abruptly drop from one year to the next. 30 drugs each, such as the β2 adrenergic receptor and cyclooxygenase 1,
After excluding traditional medicines, including botanicals and are not as well funded. We found no relationship between MIDAS global
animal products, the MIDAS set comprised 51,095 unique pharmaceutical drug target sales and NIH R01 funding during the 2011–2015 period, even
products, including small molecules and biologics. As most anti-infective when factoring in the number of APIs per target. Overall, $4.2 billion was
and antiparasitic drugs target non-human proteins (with the notable awarded to study 496 Tclin proteins, representing 13% of the R01 budget and
exception of maraviroc, which targets the host (human) CC-chemokine 6% of all R01‑funded proteins. Another 615 proteins (485 Tbio and 67 Tdark)
receptor type 5 (CCR5)), we removed these drugs because their targets are had just one funded R01 project dedicated to their study during 2011–2015,
outside of the scope of this analysis. The remainder were mapped to 1,182 and 8,857 proteins were not associated with any NIH funding for this
active pharmaceutical ingredients (APIs) from DrugCentral10, which were time frame. AT1 receptor, angiotensin II type 1 receptor; COX2,
first normalized by the number of APIs per pharmaceutical formulation, cyclooxygenase 2; DPP4, dipeptidyl peptidase 4; HMG-CoA,
then by the number of manually curated mechanism of action (MoA) 3‑hydroxy‑3‑methylglutaryl-CoA.
targets9 per API. Thus, we used 581 Tclin proteins and 1,096 APIs, a subset of
the 893 human and pathogenic biomolecules through which 1,578
previously analysed approved drugs act9.
By linking global drug sales data to drug targets, we sought to
assess a snapshot of their commercial value and to evaluate the
market value of human MoA targets. The top 20 MoA
targets ranked by aggregated sales data, together with β2 and muscarinic Proton
National Institutes of Health (NIH) R01 funding for receptors pump
the same period, are shown in TABLE 2. The entire
set covers 325 drug targets, comprising
A01
V03
A02
S03
A03
S02
A04
581 Tclin proteins, totalling over US$3,417 Glucocorticoid

S01
A05
receptor Insulin
R0
6
billion in global drug sales V receptor
A0
S A0 7
6
R0
Monoamine
A0
(Supplementary Table S10). These data
5
transporters
R0
R0 A 8
0
DPP4
3
indicate that the cytokine tumour R0

2 A1 1
R
necrosis factor (TNF) is the most 1 A1
A - Alimentary tract and metabolism 6
valuable target, and cytokines are N0 A1
7 B - Blood and blood-forming organs
the only target class comprised N0 C - Cardiovascular system
6 B01 Coagulation
entirely of biologic drugs in this Dopamine N05 D - Dermatologicals factor X
D2 receptor G - Genitourinary system B02
analysis for the 2011–2015 period. N04
B
H - Hormonal preparations B03

N
G protein-coupled receptors Voltage- N03 J - Anti-infectives

(GPCRs) are the most valuable L - Antineoplastics and immunomodulators C01
gated N02
class of druggable targets, with calcium M - Musculoskeletal system C02
channels N01 N - Nervous system C03
total aggregated sales nearing R - Respiratory system
$917 billion over the 5‑year μ-Opioid 05 C04
receptor M S - Sensory organs
period. This spotlight covers 72 V - Various
C
M04 C05
M
of the 108 druggable GPCRs M 0 3 GPCR Enzyme C07

reviewed elsewhere97. Kinases 2 Nuclear receptor Transporter C08
M0
($263 billion, with 45 drugs acting COX2 1 Ion channel Kinase C0 AT1
M0 9 receptor
on 43 targets) and cytokines Cytokine Other C1
0 4 0
($242 billion, with 17 drugs acting on L
L
TNF 3 D0
12 targets) are the only two target L0 D 4
J
2
D0
L0
classes with an extremely active ratio

5
1
D0
H
L0
of ongoing versus completed projects, G

D0
6
J05
10
D1
HMG-CoA
H05
7
D11
particularly for emerging mechanism–

H04
50 reductase
G02
H03
H02
G03
H01
G04
indication pairs98. Finally, combining 0

Glucocorticoid 10
financial data with targets organized by family
s)
0
15
n
Tyrosine receptor
io
and Anatomical, Therapeutic and Chemical (ATC)

ll
kinases 0
bi
20 S$
classification system level 2 codes shows that the 0 (U
CD20 25
top revenue categories are antineoplastics and Progesterone
immunomodulators, followed by nervous system receptor
targets (see box figure; a larger version is available as
Supplementary Figure S11; see also Supplementary Table S10).
324 | MAY 2018 | VOLUME 17

Nature Reviews | Drug Discovery
www.nature.com/nrd
©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
Table 2 | Financial (sales and NIH funding) activity for the top 20 mechanism-of-action targets
Target Target Top two drugs Number MIDAS NIH R01 Target Protein
class of APIs global sales, funding, type count
2011–2015 2011–2015
per target per target
(US$) (US$)
TNF Cytokine Adalimumab and 5 163 billion 165 million Single 1
etanercept
INSR Kinase Insulin glargine 7 144 billion 2 million Single 1
and insulin
aspart
NR3C1 NHR Fluticasone 36 143 billion 52 million Single 1
propionate and
budesonide
HMGCR Enzyme Rosuvastatin and 8 123 billion 1 million Single 1
atorvastatin
H+/K+-ATPase Transporter Esomeprazole 10 118 billion 5 million Complex 2
and omeprazole
AGTR1 GPCR Valsartan and 9 100 billion 17 million Single 1
olmesartan
medoxomil
ADRB2 GPCR Salmeterol and 36 90 billion 8 million Single 1
salbutamol
OPRM1 GPCR Oxycodone and 34 88 billion 51 million Single 1
fentanyl
COX2 Enzyme Paracetamol and 40 84 billion 8 million Single 1
diclofenac
DRD2 GPCR Aripiprazole and 48 75 billion 17 million Single 1
quetiapine
Muscarinic GPCR Tiotropium 40 64 billion 65 million Multiple 5
acetylcholine bromide and
receptors solifenacin
SLC6A4 Transporter Duloxetine and 26 59 billion 46 million Single 1
escitalopram
HTR2A GPCR Aripiprazole and 27 58 billion 13 million Single 1
quetiapine
L‑Type Ion Amlodipine and 23 57 billion 21 million Complex 3
calcium channel nifedipine
channels
SLC6A2 Transporter Duloxetine and 36 56 billion 5 million Single 1
methylphenidate
VEGFA Cytokine Bevacizumab 4 55 billion 162 million Single 1
and ranibizumab
HRH1 GPCR Olopatadine and 56 54 billion 1 million Single 1
cetirizine
Interferon α–β Membrane Interferon β‑1a 5 51 billion 7 million Multiple 2
receptor receptor and interferon
β‑1b
Voltage-gated Ion Lidocaine and 39 51 billion 40 million Multiple, 10
sodium channel lamotrigine complex
channels
Oestrogen NHR Ethinyloestradiol 18 50 billion 101 million Multiple 2
receptors and oestradiol
Targets are ranked by aggregated API sales data. The number of drugs is the number of MoA-target-associated APIs used in
this data set. Target type: single, one protein; multiple, more than one target; complex, more than one protein per target.
Protein count, number of proteins associated with that target. ADRB2, β2-adrenoceptor; AGTR1, AT1 receptor; API, active
pharmaceutical ingredient; HMGCR, hydroxymethylglutaryl-CoA reductase; DRD2, dopamine D2 receptor; GPCR, G
protein-coupled receptor; HRH1, histamine H1 receptor; HTR2A, serotonin 5‑HT2A receptor; INSR, insulin receptor; MoA,
mechanism of action; NHR, nuclear hormone receptor; NIH, National Institutes of Health; NR3C1, glucocorticoid receptor;
OPRM1, μ-opioid receptor; COX2, cyclooxygenase 2; SLC6A2, sodium-dependent noradrenaline transporter; SLC6A4,
sodium-dependent serotonin transporter; TNF, tumour necrosis factor; VEGFA, vascular endothelial growth factor A.

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
the druggable genes in these protein families. Phenotype L1000 is a set of 978 landmark genes that have been
data are available for 80% of these strains, with abnor- selected for their ability to predict a large portion of the
malities detected in numerous biological systems, total variability seen in large sets of microarray exper-
including those related to development, immune func- iments. The proportion of Tdark proteins in the L1000
tion, metabolism and behaviour. These IMPC strains set (79 of 978; 8%) is substantially smaller than would
provide evidence for biological systems that may be be anticipated based simply on the a priori distribution
affected when a drug targets a gene with little-known of Tdark proteins (which make up 35% of the proteome),
function. whereas Tbio targets (671 of 978; 69%) are more common
Of the 119 Tdark genes (51 GPCRs, 36 ion channels than expected, as these make up 53% of the proteome.
and 32 kinases) submitted by IDG to IMPC, 45 mouse The proportions of Tclin proteins (41 of 978; 4%) and
lines were produced, with 41 phenotypes observed. Tchem proteins in the L1000 set (187 of 978; 19%) are also
For example, knockouts of the Tdark kinase gene Alpk3 higher than expected, as these make up 3% and 7% of the
have increased embryonic and perinatal lethality, with proteome, respectively. The L1000 TDL distribution data
the surviving adults displaying severe heart defects support the existence of a knowledge deficit.
(see Further information). Of 482 Tbio genes submitted To some extent, the data on Tbio and Tdark suggest
by IDG (135 GPCRs, 133 ion channels, 200 kinases and a causality dilemma: are Tdark proteins underfunded
14 nuclear receptors), 184 mouse lines were produced, because there is no scientific interest in this category, or
with 145 phenotypes observed. For example, knock- is the lack of knowledge perpetuated by lack of funding?
outs of the Tbio GPCR gene Adgrd1 display reproductive Although our data do not allow us to establish a causal
defects, such as female infertility, and skeleton pheno- relationship, we suggest that the absence of high-quality,
type defects, such as decreased bone mineral density (see well-characterized molecular probes is a root cause for
Further information). this situation. Lack of tools leads to lack of interest, and
Among 2,788 genes phenotyped in mice at the IMPC, lack of interest diminishes the probability of such tools
953 have at least one significant behavioural, neurological being developed. A bibliometric evaluation by Edwards
or other nervous system-related phenotype observation. and colleagues2 examined how many newly sequenced
Target Central Resource Database (TCRD) data from the proteins from several protein families were the subject of
GWAS Catalog 39, OMIM11 and text-mined DISEASES13 new studies 10 years after the completion of the human
databases confirmed human disease phenotypes for 191 genome sequencing project. This analysis concluded
(20%) of these 953 genes, ranging from neurological (for that the process of druggable target selection is con-
example, seizure disorders) to cognitive (for example, servative and incremental and that limited progress
tauopathy) and psychotic affective disorders. Because has been observed with respect to understanding newly
only 9 of the 953 genes lack confirmed expression in any discovered proteins.
of the 34 neuro-related tissues tracked by IDG KMC (for “If you don’t know very much to begin with, don’t
example, GTEx 40,41, HPA42 and HPM43), these data sug- expect to learn a lot quickly” (REF. 44). Anecdotal evi-
gest that the remaining 80% of this set have the potential dence (summarized in TABLE 4) suggests that it is pos-
to be associated with human neurobehavioural pheno- sible for proteins to migrate from Tdark to Tclin within
types, paving the way for new research avenues in this 12–20 years. Data on the six protein targets highlighted
direction (see Supplementary Table S8). Production of indicate that proteins for which little information was
IMPC strains is set to continue for several more years, available two decades ago (effectively Tdark) became
and so further knockout strains for druggable genes and attractive from a drug discovery perspective follow-
their phenotype data are anticipated. ing key papers, namely, deorphanization and protein–
To further explore the characteristics of Tdark and Tbio disease association studies. Five of these six targets are
proteins, we analysed their distribution in the L1000 gene modulated by at least one approved drug, which places
set, as annotated in TCRD (Supplementary Table S3). them in the Tclin category.
Table 3 | Summary of clinical candidates (phase I–III) with activity against Tchem proteins
Disease category GPCRs Ion channels Kinases Other
Cancer – – 41 (35) 1 (1)

Central nervous system 7 (14) 8 (5) – –
disorders
Inflammation and immune 5 (5) 1 (1) 5 (5) –
disorders
Respiratory disorders 3 (7) 1 (1) 1 (1) –
Metabolic disorders 3 (2) – – –
Other 3 (3) – 7 (6) 1 (1)
Unmapped 28 (58) 11 (23) 87 (175) 5 (14)
Numbers in brackets indicate the number of unique clinical candidates. GPCRs, G protein-coupled receptors.

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
Table 4 | Examples of successful attempts of targeting the dark genome

Gene name Relevant Study type and Citation API name Therapeutic Market
study (year) reference counta indication approval (year)
LEPR 1995 Receptor 4,100 Metreleptin Lipodystrophy 2014
deorphanization99
SMO 1998 Protein–disease 1,195 Vismodegib Basal cell 2012
association study100 carcinoma
S1PR1 1998 Receptor 968 Fingolimod Multiple sclerosis 2010
deorphanization101
HCRTR1 and 1998 Receptor 4,608 Suvorexant Insomnia 2014
HCRTR2 deorphanization102
PCSK9 2003 Protein–disease 1,840 Evolocumab Hyper- 2015
association study103 cholesterolaemia
GHSR 1999 Receptor 8,248 Anamorelinb Cachexia Successful phase
deorphanization104 III clinical trial105
API, active pharmaceutical ingredient; GHSR, ghrelin receptor; HCTR1, orexin receptor type 1; LEPR, leptin receptor; PCSK9,
proprotein convertase subtilisin/kexin type 9; S1PR1, sphingosine 1‑phosphate receptor 1; SMO, smoothened homologue.
a
Citation count for the ‘relevant study’ reference, according to Google Scholar, as of 28 December 2017. bThe European Medicines
Agency refused marketing authorization for anamorelin in September 2017.
Successful ‘promotions’ across classes, such as those in As many as 3,644 proteins have significant disease
TABLE 4, are currently rare. We expect the rate of knowl- (confirmed OMIM11 phenotype) associations. Given
edge accumulation for Tdark proteins to be low, at least their TDL assignments (335 Tclin, 543 Tchem and 2,766
initially. Well-studied proteins require multiple layers of Tbio), we examined the distribution of the TDL in rela-
management for diverse, rich sets of data, with information to druggable protein family categories (FIG. 1b). It
tion and knowledge stemming from corpora such as bio- appears that Tbio–disease associations are quite rare for
medical literature, patents and clinical trials. A paucity of druggable families such as nuclear receptors, ion chan-
data and lack of information for understudied proteins nels and GPCRs, as these families are more likely to be
(Tbio and Tdark) affect both knowledge management and in the Tclin or Tchem category. Instead, Tbio assignments
the decision-making process with respect to experimen- are quite frequent for transcription factors, epigenetic
tal planning, what research questions need to be asked targets, transporters and unassigned protein families.
(and in what order) and which methods may be better The exception among druggable families are olfactory
suited for each task. For example, we examined access GPCRs, which appear to attract less interest from drug
counts for human proteins in STRING18 during 2016 discovery programmes, despite some of these GPCRs
(FIG. 2b). ‘Counts by name’ represents users that access being linked to metabolism and ageing 47.
the STRING website and type in a gene symbol. ‘Counts Concerted efforts focused on an entire target
by link’ represents users accessing the network for a gene class have sometimes led to new drugs. For example,
in STRING by linking to it from another resource (for GlaxoSmithKline (GSK) had a comprehensive pro-
example, GeneCards45 or UniProt 15). Whereas ‘Counts gramme aimed at finding new ligands and characteriz-
by link’ shows a more comprehensive method to access ing the biology of nuclear receptors48. New insights into
the entire proteome, it also suggests that Tdark proteins bile acid metabolism49 and xenobiotic transcription of
have a lower probability of being recognized (input) by cytochrome P450s50, mediated by nuclear receptors,
gene name. These data show a pattern similar to that were described. A bile acid receptor (FXR) agonist has
observed in FIG. 2a: Tdark proteins are less likely to be reached the market since this programme started: obet-
the subject of scientific curiosity, which is a reflection icholic acid (Ocaliva), which was discovered by GSK
of funding patterns and an overall lack of information in collaboration with the University of Perugia51 and
and molecular probes. Indeed, the paucity of antibodies subsequently developed by Intercept Pharmaceuticals.
and small molecules (criteria that help define Tbio and Currently, several FXR agonists are in clinical develop-
Tchem, respectively) that could be used to interrogate Tdark ment52. Choosing the appropriate proteins as drug targets
proteins diminishes our ability to subject Tdark proteins remains a complex process, where scientific factors need
to scientific inquiry. to be balanced against commercial factors (such as com-
Genomic and proteomic responses following radia- pany investors and medical insurance companies) and
tion therapy are also understudied. One in vitro study 46 societal factors (such as physicians and patients), as well
suggested that as many as one-third of the 10,174 genes as legal factors (such as the requirements of regulatory
examined in immortalized B cells following ionizing agencies)1.
radiation are radioresponsive (GSE26835 column46 in
Supplementary Table S3). Of the 447 genes with signif- Spotlight on G protein-coupled receptors
icant fold changes in the GSE26835 set, only 26 are Tclin GPCRs are membrane-bound, cell-surface receptors that
and 61 are Tchem, whereas the majority (268 Tbio, 92 Tdark) transduce signals via interactions with heterotrimeric G
are understudied (see also Supplementary Box S5). proteins, arrestins and other cellular transducers53,54.

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
Alterations in GPCR signalling are implicated in the antidepressant-like behaviour compared with wild-type
pathogenesis and treatment of neuropsychiatric 55, mice70, whereas the association71 between schizophrenia
immunological56, gastrointestinal57,58, cardiac59, renal, and Fzd3 mutants (Tbio) remains controversial72,73.
hormonal, infectious60 and many other disorders53,54.
GPCRs represent the largest family of druggable targets Spotlight on protein kinases
in the human genome9, with between 20% and 30% of T clin and T chem kinases. The ~600‑member human
approved drugs acting on them53,54. kinome (Supplementary Table S3) is made up of pro-
The number of publications per GPCR and the num- tein kinases, in addition to metabolic and lipid kinases,
ber of chemicals associated with that GPCR in ChEMBL and is highly druggable using both competitive and
were examined61 to determine which of the druggable, allosteric small-molecule inhibitors. However, the func-
non-olfactory GPCRs are understudied: less than 100 tions of about one-third of the kinases in this family are
citations and less than ten ChEMBL compounds define poorly defined or unknown. The 634 human kinases
understudied, uninterrogated GPCRs. The number of were categorized as follows: Tclin, N = 50; Tchem, N = 390;
publications, similar to the fractional PubMed publi- Tbio, N = 163; and Tdark, N = 31 (TABLE 1). Tclin kinases
cations count 13, does not take into account large-scale are not exclusively protein kinases, and the number of
(many proteins per paper) analyses. Counting ChEMBL FDA-approved small-molecule kinase inhibitors varies
compounds, a quantitative criterion similar to Tchem depending on inclusion criteria. Wu and colleagues74
assignments, does not consider bioactivity values. found 38 small-molecule protein kinase inhibi-
However, this independent analysis validates the more tors. Based on DrugCentral, we found 50 approved
general TDL criteria with respect to GPCR biological kinase inhibitors, including 40 small-molecule pro-
functions and corresponding chemical matter. tein kinase inhibitors, of which 32 are FDA-approved,
one FDA-approved protein kinase activator (ingenol
Tclin and Tchem G protein-coupled receptors. Currently mebutate) and the phosphoinositide 3‑kinase subunit-δ
827 GPCRs — including 421 olfactory GPCRs — are (PIK3CD) small-molecule inhibitor idelalisib, which is
tracked by IDG KMC; of these, 96 are Tclin and 113 also FDA-approved. An additional seven FDA-approved
are Tchem (none of which are olfactory). Slightly more antibodies target the receptor tyrosine kinases human
than half of the non-olfactory GPCRs have annotated epidermal growth factor receptor 2 (HER2; also known as
drugs and small molecules targeting them53,54,61; see ERBB2), epidermal growth factor receptor (EGFR), vas-
also TABLE 1. A recent analysis indicates, however, that a cular endothelial growth factor receptor 2 (VEGFR2) and
handful of GPCRs — mainly biogenic amine, muscarinic platelet-derived growth factor receptor-α (PDGFRα),
and opioid receptors — represent the most abundantly and there is also an FDA-approved HER2‑targeting
targeted receptors for FDA-approved medications53,54. antibody–drug conjugate, trastuzumab emtansine (see
GPCRs also represent important off-targets for kinase Supplementary Table S9).
inhibitors62,63, ion channel modulators64, anti-infectives65
and other classes of drug-like molecules53,54. As with Tbio and Tdark kinases. A number of Tbio and Tdark kinases
other druggable target classes, off-target actions within are known to interact with FDA-approved multikinase
the GPCR class can be associated with severe and life- inhibitors. According to data in DrugCentral, sorafenib
threatening side effects. For example, valvular heart inhibits 114 kinases, of which only 9 are associated MoA-
disease is associated with anorectic agents, such as fen- related targets, whereas sunitinib inhibits 263 kinases, of
fluramine, and antimigraine medications, such as ergot- which 9 are MoA-related targets. Given the current state
amine, via serotonin 5‑HT2B receptor agonism66. Recent of kinase inhibitor chemistry, it is very likely that Tbio and
successes in structure-guided and cheminformatics- Tdark kinases can be effectively therapeutically targeted
driven drug discovery show promise for creating safer with highly selective small-molecule inhibitors. Some of
and more effective medications targeting GPCRs. the characteristics shared by understudied Tbio and Tdark
kinases include poorly defined integration of the kinase
Tbio and Tdark G protein-coupled receptors. Although 52 in signalling networks, poorly defined function and reg-
non-olfactory GPCRs are categorized as Tdark, the avail- ulation, lack of activation-loop phospho-antibodies or
ability of new screening platforms to discover chemical immunohistochemistry-grade antibodies, and lack of
matter for these GPCRs has begun the process of illumi- selective chemical tools for functional characterization.
nation64,67,68. Of 62 GPCRs for which significant pheno- Primary tools for knockout and/or altered expression
type calls have been reported by IMPC (Supplementary are RNA interference (RNAi) and CRISPR–Cas9, and
Table S8), 24 are Tbio and 7 are Tdark; of these, 15 are asso- cDNAs for overexpression; kinase knockout or altered
ciated with neurological and behavioural phenotypes. expression rarely provides readily assayable phenotypes
Including olfactory GPCRs, 618 proteins are classified (for example, growth, migration, apoptosis or in vivo
as Tdark or Tbio; of these, 126 non-olfactory GPCRs and function in mouse organ physiology). Currently, the
51 olfactory GPCRs have significant associations with IMPC has targeted 238 kinases with 114 knockouts hav-
human diseases via OMIM, GWAS and text mining. ing a significant phenotype; of the latter, 22 are of current
Whereas the majority of these associations (nearly 59%) interest for phase 2 of the IDG programme.
stem from text mining 69, 48 GPCRs have confirmed Many Tchem, Tbio or Tdark kinases are altered in expres-
associations from at least two information channels. For sion or mutated in TCGA. TABLE 5 shows ten Tbio or Tdark
example, Adgrb2-mutant mice (Tbio) showed significant kinases whose amplification is observed in the TCGA

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
database, together with their RNA expression in tri- ion channels an attractive target class for drug develop-
ple-negative breast cancer (TNBC) cells. These under- ment. Ion channels are mostly heteromeric complexes
studied kinases are frequently altered in breast cancer. that require optimal interactions with ligands at specific
The potential increased expression of many kinases in locations. Currently, 355 ion channel pore-forming and
primary human tumours suggests these understudied auxiliary subunits are tracked by IDG KMC (TABLE 1;
kinases have important functions for the tumour cell Supplementary Table S3). About 100 ion channel modula-
phenotype that have not been characterized to date. tors, including auxiliary subunits, are reported, but to our
These represent unexplored kinases with possible knowledge, a systematic list of cell type-specific auxiliary
therapeutic utility. subunits for all ion channels is not available.
The potential therapeutic importance of Tbio and Tdark
kinases in the kinome is highlighted by a recent clinical Tclin and Tchem ion channels. Many drugs are known to
study that assessed the response to trametinib, a MEK1 bind to ion channels. There are 217 drugs annotated in
and MEK2 inhibitor, in TNBC patients75. Pretreatment DrugCentral10 as acting through 125 (Tclin) ion channels
needle biopsies and surgical tumour resections follow- for the MoA9. The number of drugs increases to 497
ing 7‑day trametinib treatment were used for RNA when querying how many drugs are known to interact
sequencing (RNA-seq) to analyse tumour transcriptomic with ion channels outside of the MoA-related constraint.
changes in response to the drug. Pretreatment biopsies Some of these interactions are likely to be responsible for
matched to post-treatment surgical specimens showed side effects such as cardiac toxicity. An accurate under-
overall concordance of the transcriptional kinase standing of MoA and side-effect assignment at the target
response to trametinib, with FRK (Tchem) exhibiting the (molecular) level is required if we are to improve upon
highest mean increase and cytoplasmic BMX (also Tchem) available drugs. For example, the anaesthetic ketamine,
exhibiting the highest mean decrease among patients which has been postulated to act as a noncompetitive
in response to a 7‑day drug treatment. Among the N-methyl-d-aspartate (NMDA) antagonist 79, has been
kinases transcriptionally altered in the TNBC tumours used off-label as an antidepressant 80. However, in‑depth
were several understudied kinases, including MRCK-γ analysis of the antidepressant effects of ketamine found
(also known as CDC42BPG), PRKACB, STK32B and that its active metabolite (2R,6R)-hydroxynorketamine
leukocyte tyrosine kinase receptor (LTK). These find- (HNK) does not block the NMDA receptor. Instead,
ings demonstrate that in TNBC tumours in patients, HNK displays sustained activation of α-amino-3-
members of the understudied Tbio and Tdark kinome are hydroxy-5-methyl-4-isoxazole propionic acid (AMPA)
co‑regulated transcriptionally with kinases from the Tclin receptors and lacks ketamine-related side effects 81.
and Tchem category, in a dynamic adaptive response to This may pave the way for the development of novel,
targeted inhibition. rapid-acting antidepressants. It is therefore conceivable
that some ion channels currently categorized as Tclin or
Spotlight on ion channels Tchem are in need of further illumination with respect to
Ion channels mediate signalling within cells, between MoA and drug specificity. Indeed, the low bioactivity
cells, and between cells and their environment. Defects in cut-off criterion for ion channels (≤10 μM) in Tchem (see
ion channels underlie many major disorders in humans, also Supplementary Figure S2) may need revision, given
also known as channelopathies, including neuronal that older drugs continue to reveal unexpected modes
disorders76, diabetes77 and heart failure78. This makes of action.
Table 5 | Understudied kinases that are frequently altered in breast cancer

Gene name Protein name TDL SUM159 Alteration Kinase family
averagea frequencyb
(%)
TRIB1 Tribbles homologue 1 Tbio 342 24 CAMK
RPS6KC1 Ribosomal protein S6 kinase δ1 Tdark 463 23 Other
UHMK1 Serine/threonine-protein kinase Kist Tbio 1,308 23 Other
NRBP2 Nuclear receptor-binding protein 2 Tdark 350 21 Other
PIP5K1A Phosphatidylinositol 4‑phosphate 5‑kinase type Tbio 2,246 19 Metabolic
1‑α
CDK12 Cyclin-dependent kinase 12 Tbio 3,148 14 CMGC
MAP3K1 Mitogen-activated protein kinase kinase kinase 1 Tbio 455 11 STE
STRADA STE20‑related kinase adapter protein-α Tbio 464 9 STE
BCKDK (3‑methyl‑2‑oxobutanoate dehydrogenase Tbio 1,087 6 Atypical
(lipoamide)) kinase, mitochondrial
EEF2K Eukaryotic elongation factor 2 kinase Tbio 1,176 6 Atypical
a
Expression levels in SUM159PT claudin-low triple-negative breast cancer (TNBC) cells. bAlteration frequency in breast cancer from
The Cancer Genome Atlas (TCGA). TDL, target development level.

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
Tdark ion channels. A relatively small number of ion This implicitly requires the construction of a knowledge
channels (31) are categorized as Tdark (TABLE 1). Part of base objectively and in an unbiased manner, asserting
the difficulty in illuminating dark ion channels is the what is currently believed to be true (a process that is
replication of physiological context and expression of explored metaphorically in a classic book by Italo Calvino,
proteins in the appropriate heteromeric, pore-forming ‘The Castle of Crossed Destinies’; see Supplementary Box
functional complexes. Currently, there are no scalable S1). The IDG KMC enables us to quantitatively demon-
systems available to study the localization of functional strate the existence of a knowledge deficit with respect to
complexes. Moreover, most ion channels have paralogues dark and understudied proteins, which underscores the
that function redundantly. Gene redundancy increases need for basic science and its major role in illuminating
the difficulty of revealing phenotype and precise localiza- gene functions and roles in human disease. The TDL clas-
tion, both important elements for understanding physio sification scheme provides a convenient way to partition
logical functions other than ion channel activity. This human targets that highlights the focus (or lack thereof)
considerably delays our progress in understanding ion of science and drug discovery efforts on different targets.
channel function in vivo and their role in human health. Through the use of the TDL groupings, we can highlight
Unlike GPCRs or kinases, neither pore-forming subunits knowledge accumulation, as well as deficits, for a vari-
nor auxiliary subunits share characteristic motifs. Lack ety of target families, with a common theme being that
of specific protein sequence motifs makes it difficult to while much is known, there remains a large fraction of the
flag candidate genes for further study, even with compu- proteome that is understudied. The IDG KMC, by collat-
tational assertions. There could be other ion channels, ing and linking a plethora of disparate and diverse data
which perhaps should be categorized as Tdark.dark, to reflect sources and data types, aims to shed light on these dark
our complete lack of knowledge, even by computational regions with the hope that researchers will be empowered
means, regarding these proteins. to use the data and knowledge presented by the KMC to
The list of ion channel pore-forming subunits, as well jumpstart research programmes on these targets.
as auxiliary subunits, continues to grow. For example, Confirmed associations with a specific disease, or
leucine-rich repeat-containing protein 8 (LRRC8) heter- receptor deorphanization (TABLE 4), remain major incen-
omers form82 volume-regulated anion channels (VRACs), tives to allocate resources and further study of Tdark
and ORAI proteins assemble to form83 calcium-release- proteins. As mentioned above, the only other deliber-
activated calcium channels (CRACs), whereas anoc- ate targeted effort to study Tdark proteins in addition to
tamins are olfactory calcium-activated chloride channels IDG is the IMPC. As of March 2017, mouse lines cor-
(CACCs)84. Currently, LRRC8B, LRRC8C and LRRC8D responding to 4,165 human genes have been produced,
subunits are classified as Tdark, with the exception being with phenotypes available, 2,788 of which have resulted
the subunit LRRC8A (Tbio); ORAI1 is annotated as Tchem, in statistically significant phenotype calls. Of these 2,788
wheras the ORAI2 and ORAI3 proteins are annotated as proteins, 827 (436 Tdark) are not associated with any NIH-
Tbio. With the exception of anoctamin 1 (Tclin), all other funded grants between 2000 and 2015 (Supplementary
anoctamins are labelled Tbio. These, and all other Tdark Table S8). By contrast, only 120 of the 1,961 proteins with
proteins that lack computational assertions, are in need significant IMPC phenotype calls and that are associated
of systematic genomic-scale studies. with NIH funding are Tdark. It was Edgar Allan Poe who
once said, “the enormous multiplication of books in every
Conclusions branch of knowledge is one of the greatest evils of this
Modern medicine often employs artificial distinctions age, since it presents one of the most serious obstacles to
in terms of what and how biological systems are stud- the acquisition of correct information”. Poe’s 19th cen-
ied: segregated by organ (for example, ophthalmology tury line of thought is remarkably apt in the context of
and cardiology) or by disease (for example, oncology and current KMC activities, since the “acquisition of correct
infectious diseases), medical specialty separations carry information” remains the largest challenge.
over into the research arena, both in academia and indus- Another challenge relates to an area of knowledge
try. This distinction breaks down in nature, as we are largely neglected in the scientific literature: the large-scale
likely to observe the interplay between the same genes capture of negative results. Due to confirmation bias8,
and pathways regardless of organ, albeit in a context- scientists have a tendency to primarily publish success-
specific manner. These artificial divisions can prevent sci- ful accounts of research. Although there are attempts to
entists from achieving a translational, integrative view of overcome this problem86, we are not aware of the exist-
gene and protein function. We suspect this to be another ence of an unbiased, easy mechanism to capture negative
reason why funding to study Tdark proteins is scarce: for results. The aphorism “absence of evidence is not evidence
functionally enigmatic proteins, or the ‘ignorome’ (REF. 85), of absence” illustrates practical limitations of knowledge
anticipating which organ, disease or phenotype is relevant management systems: does lack of evidence imply that
may be far from trivial. To address this limitation, the the study was conducted, but nothing was found, or does
NIH launched a series of high-risk programmes via it imply (more often) that the measurement was not car-
the Common Fund resource, aimed to catalyse trans- ried out? Proper archiving of negative results (for example,
disciplinary research. The IDG is one such Common “protein P is not expressed in cell type CT” or “gene muta-
Fund programme. The IDG programme’s ostensible tion Gm does not play a role in disease D”) would benefit
goal is to encourage and track the illumination of rela- the community at large and would improve our specific
tively understudied and unstudied parts of the genome. knowledge about proteins. However, such non-positive

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
facts fit poorly to current publishing and citation para- Tdark proteins in particular, experimentalists would like
digms. One possibility for archiving such statements could to know what experiment to do next, what phenotypic
be nanopublications87, which would be amenable to large- changes should be examined first and which pathway is
scale integration into systems such as TCRD–Pharos. relevant in a specific disease. These, and other similar
Finally, a key challenge faced by IDG KMC, and questions, have yet to find a computer-driven, reliable
perhaps by other data analysts working in drug dis- answer. To paraphrase William Gibson, “The truth is
covery, is that of reliable predictions: when examining already here — it’s just not very evenly distributed”.
1. Knowles, J. & Gromo, G. Target selection in drug 28. Waring, M. J. et al. An analysis of the attrition of drug 52. Hambruch, E., Kinzel, O. & Kremoser, C.
discovery. Nat. Rev. Drug Discov. 2, 63–69 (2003). candidates from four major pharmaceutical On the pharmacology of farnesoid X receptor
2. Edwards, A. M. et al. Too many roads not taken. companies. Nat. Rev. Drug Discov. 14, 475–486 agonists: give me an ‘A’, like in ‘acid’. Nucl. Recep. Res.
Nature 470, 163–165 (2011). (2015). 3, 101207 (2016).
3. Alberts, B., Kirschner, M. W., Tilghman, S. & 29. Hunter, S. et al. InterPro in 2011: new developments 53. Wacker, D., Stevens, R. C. & Roth, B. L. How ligands
Varmus, H. Rescuing US biomedical research from its in the family and domain prediction database. Nucleic illuminate GPCR molecular pharmacology. Cell 170,
systemic flaws. Proc. Natl Acad. Sci. USA 111, Acids Res. 40, D306–312 (2012). 414–427 (2017).
5773–5777 (2014). 30. Kruger, F. A., Gaulton, A., Nowotka, M. & 54. Roth, B. L., Irwin, J. J. & Shoichet, B. K. Discovery of
4. Kim, S. et al. PubChem Substance and Compound Overington, J. P. PPDMs‑a resource for mapping new GPCR ligands to illuminate new biology.
databases. Nucleic Acids Res. 44, D1202–D1213 small molecule bioactivities from ChEMBL to Pfam‑A Nat. Chem. Biol. 13, 1143–1151 (2017).
(2016). protein domains. Bioinformatics 31, 776–778 55. Roth, B. L., Sheffler, D. J. & Kroeze, W. K. Magic
5. Gaulton, A. et al. The ChEMBL database in 2017. (2015). shotguns versus magic bullets: selectively non-
Nucleic Acids Res. 45, D945–D954 (2017). 31. Campillos, M., Kuhn, M., Gavin, A.‑C., Jensen, L. J. & selective drugs for mood disorders and
6. Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Bork, P. Drug target identification using side-effect schizophrenia. Nat. Rev. Drug Discov. 3, 353–359
Cancer Genome Atlas (TCGA): an immeasurable source similarity. Science 321, 263–266 (2008). (2004).
of knowledge. Contemp. Oncol. 19, A68–A77 (2015). 32. Keiser, M. J. et al. Predicting new molecular targets 56. Hernandez, P. A. et al. Mutations in the chemokine
7. Munafò, M. R. et al. A manifesto for reproducible for known drugs. Nature 462, 175–181 (2009). receptor gene CXCR4 are associated with WHIM
science. Nat. Hum. Behav. 1, 0021 (2017). 33. Huang, X. & Dixit, V. M. Drugging the undruggables: syndrome, a combined immunodeficiency disease.
8. Nickerson, R. S. Confirmation bias: a ubiquitous exploring the ubiquitin system for drug development. Nat. Genet. 34, 70–74 (2003).
phenomenon in many guises. Rev. Gen. Psychol. 2, Cell Res. 26, 484–498 (2016). 57. Sternini, C. Receptors and transmission in the brain-
175–220 (1998). 34. Lai, A. C. & Crews, C. M. Induced protein degradation: gut axis: potential for novel therapies. III. Mu‑opioid
9. Santos, R. et al. A comprehensive map of molecular an emerging drug discovery paradigm. Nat. Rev. Drug receptors in the enteric nervous system.
drug targets. Nat. Rev. Drug Discov. 16, 19–34 (2017). Discov. 16, 101–114 (2017). Am. J. Physiol. Gastrointest. Liver Physiol. 281,
10. Ursu, O. et al. DrugCentral: online drug compendium. 35. Sakamoto, K. M. et al. Protacs: chimeric molecules G8–15 (2001).
Nucleic Acids Res. 45, D932–D939 (2017). that target proteins to the Skp1–Cullin–F box 58. Sternini, C. Taste receptors in the gastrointestinal
11. Amberger, J., Bocchini, C. A., Scott, A. F. & complex for ubiquitination and degradation. tract. IV. Functional implications of bitter taste
Hamosh, A. McKusick’s Online Mendelian Inheritance Proc. Natl Acad. Sci. 98, 8554–8559 (2001). receptors in gastrointestinal chemosensing.
in Man (OMIM). Nucleic Acids Res. 37, D793–D796 36. Gadd, M. S. et al. Structural basis of PROTAC Am. J. Physiol. Gastrointest. Liver Physiol. 292,
(2009). cooperative recognition for selective protein G457–461 (2007).
12. Ashburner, M. et al. Gene ontology: tool for the degradation. Nat. Chem. Biol. 13, 514–521 (2017). 59. Rockman, H. A., Koch, W. J. & Lefkowitz, R. J. Seven-
unification of biology. Nat. Genet. 25, 25–29 (2000). 37. Mungall, C. J. et al. The Monarch Initiative: transmembrane-spanning receptors and heart
13. Pletscher-Frankild, S., Pallejà, A., Tsafou, K., an integrative data and analytic platform connecting function. Nature 415, 206–212 (2002).
Binder, J. X. & Jensen, L. J. Diseases: text mining and phenotypes to genotypes across species. Nucleic 60. Elphick, G. F. et al. The human polyomavirus, JCV, uses
data integration of disease-gene associations. Acids Res. 45, D712–D722 (2017). serotonin receptors to infect cells. Science 306,
Methods 74, 83–89 (2015). 38. Dickinson, M. E. et al. High-throughput discovery of 1380–1383 (2004).
14. Kiermer, V. Antibodypedia. Nat. Methods 5, 860–861 novel developmental phenotypes. Nature 537, 61. Roth, B. L. & Kroeze, W. K. Integrated approaches
(2008). 508–514 (2016). for genome-wide interrogation of the druggable
15. UniProt Consortium. UniProt: a hub for protein 39. MacArthur, J. et al. The new NHGRI-EBI Catalog of non-olfactory G protein-coupled receptor
information. Nucleic Acids Res. 43, D204–D212 (2015). published genome-wide association studies (GWAS superfamily. J. Biol. Chem. 290, 19471–19477
16. Papadatos, G. et al. SureChEMBL: a large-scale, Catalog). Nucleic Acids Res. 45, D896–D901 (2015).
chemically annotated patent document database. (2017). 62. Elkins, J. M. et al. Comprehensive characterization of
Nucleic Acids Res. 44, D1220–1228 (2016). 40. GTEx Consortium. The Genotype-Tissue Expression the Published Kinase Inhibitor Set. Nat. Biotechnol.
17. Rouillard, A. D. et al. The harmonizome: a collection of (GTEx) project. Nat. Genet. 45, 580–585 (2013). 34, 95–103 (2016).
processed datasets gathered to serve and mine 41. GTEx Consortium et al. Genetic effects on gene 63. Lin, X. et al. Life beyond kinases: structure-based
knowledge about genes and proteins. Database expression across human tissues. Nature 550, discovery of sorafenib as nanomolar antagonist of
2016, baw100 (2016). 204–213 (2017). 5‑HT receptors. J. Med. Chem. 55, 5749–5759
18. Szklarczyk, D. et al. STRING v10: protein-protein 42. Uhlén, M. et al. Proteomics. Tissue-based map of the (2012).
interaction networks, integrated over the tree of life. human proteome. Science 347, 1260419 (2015). 64. Huang, X.‑P. et al. Allosteric ligands for the
Nucleic Acids Res. 43, D447–D452 (2015). 43. Kim, M.‑S. et al. A draft map of the human proteome. pharmacologically dark receptors GPR68 and GPR65.
19. Hajduk, P. J., Huth, J. R. & Tse, C. Predicting protein Nature 509, 575–581 (2014). Nature 527, 477–483 (2015).
druggability. Drug Discov. Today 10, 1675–1682 44. Lenat, D. B. & Feigenbaum, E. A. On the thresholds of 65. Chan, J. D. et al. The anthelmintic praziquantel is
(2005). knowledge. Artif. Intell. 47, 185–250 (1991). a human serotoninergic G‑protein-coupled receptor
20. Hopkins, A. L. & Groom, C. R. The druggable genome. 45. Fishilevich, S. et al. Genic insights from integrated ligand. Nat. Commun. 8, 1910 (2017).
Nat. Rev. Drug Discov. 1, 727–730 (2002). human proteomics in GeneCards. Database 2016, 66. Roth, B. L. Drugs and valvular heart disease. N. Engl.
21. Surade, S. & Blundell, T. L. Structural biology and baw030 (2016). J. Med. 356, 6–9 (2007).
drug discovery of difficult targets: the limits of 46. Smirnov, D. A. et al. Genetic variation in radiation- 67. Kroeze, W. K. et al. PRESTO-Tango as an open-source
ligandability. Chem. Biol. 19, 42–50 (2012). induced cell death. Genome Res. 22, 332–339 resource for interrogation of the druggable human
22. Kubinyi, H. Drug research: myths, hype and reality. (2012). GPCRome. Nat. Struct. Mol. Biol. 22, 362–369
Nat. Rev. Drug Discov. 2, 665–668 (2003). 47. Garrison, J. L. & Knight, Z. A. Linking smell to (2015).
23. Yang, X. et al. Widespread expansion of protein metabolism and aging. Science 358, 718–719 68. Lansu, K. et al. In silico design of novel probes for the
interaction capabilities by alternative splicing. (2017). atypical opioid receptor MRGPRX2. Nat. Chem. Biol.
Cell 164, 805–817 (2016). 48. Kliewer, S. A., Lehmann, J. M. & Willson, T. M. Orphan 13, 529–536 (2017).
24. Mestres, J., Gregori-Puigjané, E., Valverde, S. & nuclear receptors: shifting endocrinology into reverse. 69. Pafilis, E. et al. The SPECIES and ORGANISMS
Solé, R. V. Data completeness—the Achilles heel of Science 284, 757–760 (1999). Resources for Fast and Accurate Identification of
drug-target networks. Nat. Biotechnol. 26, 983–984 49. Willson, T. M., Jones, S. A., Moore, J. T. & Taxonomic Names in Text. PLoS ONE 8, e65390
(2008). Kliewer, S. A. Chemical genomics: functional analysis (2013).
25. Schreiber, S. L. et al. Advancing biological of orphan nuclear receptors in the regulation of bile 70. Okajima, D., Kudo, G. & Yokota, H. Antidepressant-
understanding and therapeutics discovery with small- acid metabolism. Med. Res. Rev. 21, 513–522 like behavior in brain-specific angiogenesis inhibitor
molecule probes. Cell 161, 1252–1265 (2015). (2001). 2‑deficient mice. J. Physiol. Sci. 61, 47–54 (2011).
26. Austin, C. P., Brady, L. S., Insel, T. R. & Collins, F. S. 50. Moore, L. B. et al. Orphan nuclear receptors 71. Katsu, T. et al. The human frizzled‑3 (FZD3) gene on
NIH molecular libraries initiative. Science 306, constitutive androstane receptor and pregnane X chromosome 8p21, a receptor gene for Wnt ligands,
1138–1139 (2004). receptor share xenobiotic and steroid ligands. J. Biol. is associated with the susceptibility to schizophrenia.
27. Southan, C. et al. The IUPHAR/BPS guide to Chem. 275, 15122–15127 (2000). Neurosci. Lett. 353, 53–56 (2003).
pharmacology in 2016: towards curated quantitative 51. Pellicciari, R. et al. 6alpha‑ethyl-chenodeoxycholic acid 72. Wei, J. & Hemmings, G. P. Lack of a genetic
interactions between 1300 protein targets and 6000 (6‑ECDCA), a potent and selective FXR agonist association between the frizzled‑3 gene and
ligands. Nucleic Acids Res. 44, D1054–D1068 endowed with anticholestatic activity. J. Med. Chem. schizophrenia in a British population. Neurosci. Lett.
(2016). 45, 3569–3572 (2002). 366, 336–338 (2004).

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S
73. Jeong, S. H., Joo, E. J., Ahn, Y. M., Lee, K. Y. & 92. Griffith, M. et al. CIViC is a community knowledgebase and G.P.); and by Novo Nordisk Foundation Denmark grant
Kim, Y. S. Investigation of genetic association between for expert crowdsourcing the clinical interpretation of NNF14CC0001 (S.B., L.J.L. and D.W). R.G., A.J., D.T.N., A.S.,
human Frizzled homolog 3 gene (FZD3) and variants in cancer. Nat. Genet. 49, 170–174 (2017). N.S., and G.Z.K. were supported by the Intramural Research
schizophrenia: results in a Korean population and 93. Koscielny, G. et al. Open Targets: a platform for Program, National Center for Advancing Translational
evidence from meta-analysis. Psychiatry Res. 143, therapeutic target identification and validation. Sciences (NCATS) and by U54 CA189205. Dedicated to
1–11 (2006). Nucleic Acids Res. 45, D985–D994 (2017). Francisc Schneider (1933–2017).
74. Wu, P., Nielsen, T. E. & Clausen, M. H. Small-molecule 94. Lin, Y. et al. Drug target ontology to classify and
kinase inhibitors: an analysis of FDA-approved drugs. integrate drug discovery data. J. Biomed. Semant. 8, Competing interests
Drug Discov. Today 21, 5–10 (2016). 50 (2017). The authors declare competing interests: see Web version
75. Zawistowski, J. S. et al. Enhancer remodeling during 95. Maggon, K. Best-selling human medicines 2002– for details.
adaptive bypass to MEK inhibition is attenuated by 2004. Drug Discov. Today 10, 739–742 (2005).
pharmacologic targeting of the P‑TEFb complex. 96. Stebbins, S. The world’s 15 top selling drugs. 24/7 Wall Publisher’s note
Cancer Discov. 7, 302–321 (2017). St. http://247wallst.com/special-report/2016/04/26/ Springer Nature remains neutral with regard to jurisdictional
76. Kullmann, D. M. The neuronal channelopathies. top-selling-drugs-in-the-world/ (2016). claims in published maps and institutional affiliations.
Brain 125, 1177–1195 (2002). 97. Hauser, A. S., Attwood, M. M., Rask-Andersen, M.,
77. Gloyn, A. L. et al. Large-scale association studies of Schiöth, H. B. & Gloriam, D. E. Trends in GPCR drug
variants in genes encoding the pancreatic beta-cell discovery: new agents, targets and indications. RELATED LINKS
KATP channel subunits Kir6.2 (KCNJ11) and SUR1 Nat. Rev. Drug Discov. 16, 829–842 (2017). Anamorelin: https://en.wikipedia.org/wiki/Anamorelin
(ABCC8) confirm that the KCNJ11 E23K variant is 98. Shih, H.‑P., Zhang, X. & Aronov, A. M. Drug discovery ATC codes: https://www.whocc.no/atc_ddd_index/
associated with type 2 diabetes. Diabetes 52, effectiveness from the standpoint of therapeutic Channelopathy: https://en.wikipedia.org/wiki/
568–572 (2003). mechanisms and indications. Nat. Rev. Drug Discov. Channelopathy
78. Marbán, E. Cardiac channelopathies. Nature 415, 17, 19–33 (2018). ChemIDplus: https://chem.nlm.nih.gov/chemidplus/
213–218 (2002). 99. Tartaglia, L. A. et al. Identification and expression ClinicalTrials.gov: https://clinicaltrials.gov/
79. Berman, R. M. et al. Antidepressant effects of cloning of a leptin receptor, OB‑R. Cell 83, Cochrane Collaboration: http://www.cochrane.org/
ketamine in depressed patients. Biol. Psychiatry 47, 1263–1271 (1995). DrugCentral: http://drugcentral.org/
351–354 (2000). 100. Xie, J. et al. Activating Smoothened mutations in Drug Target Ontology: http://drugtargetontology.org/
80. Kirby, T. Ketamine for depression: the highs and lows. sporadic basal-cell carcinoma. Nature 391, 90–92 Edgar Allan Poe quotes: https://quoteinvestigator.
Lancet Psychiatry 2, 783–784 (2015). (1998). com/2017/07/12/many-books/
81. Zanos, P. et al. NMDAR inhibition-independent 101. Lee, M. J. et al. Sphingosine‑1‑phosphate as a ligand Evidence of absence: https://en.wikipedia.org/wiki/
antidepressant actions of ketamine metabolites. for the G protein-coupled receptor EDG‑1. Science Evidence_of_absence
Nature 533, 481–486 (2016). 279, 1552–1555 (1998). Harmonizome: http://amp.pharm.mssm.edu/Harmonizome/
82. Pedersen, S. F., Klausen, T. K. & Nilius, B. 102. Sakurai, T. et al. Orexins and orexin receptors: Illuminating the Druggable Genome: https://commonfund.
The identification of a volume-regulated anion a family of hypothalamic neuropeptides and G protein- nih.gov/idg
channel: an amazing Odyssey. Acta Physiol. 213, coupled receptors that regulate feeding behavior.
IMPC (International Mouse Phenotyping Consortium):
868–881 (2015). Cell 92, 573–585 (1998).
http://www.mousephenotype.org/
83. Niemeyer, B. A. Changing calcium: CRAC channel 103. Abifadel, M. et al. Mutations in PCSK9 cause
IMPC information about Adgrd1: http://www.
(STIM and Orai) expression, splicing, and autosomal dominant hypercholesterolemia.
mousephenotype.org/data/genes/MGI:3041203
posttranslational modifiers. Am. J. Physiol. Cell Nat. Genet. 34, 154–156 (2003).
IMPC information about Alpk3: http://www.
Physiol. 310, C701–709 (2016). 104. Kojima, M. et al. Ghrelin is a growth-hormone-
mousephenotype.org/data/genes/MGI:2151224
84. Dauner, K., Lissmann, J., Jeridi, S., Frings, S. & releasing acylated peptide from stomach. Nature 402,
Integrity | Available from Clarivate Analytics at: https://
Möhrlen, F. Expression patterns of anoctamin 1 and 656–660 (1999).
clarivate.com/products/integrity/
anoctamin 2 chloride channels in the mammalian 105. Temel, J. S. et al. Anamorelin in patients with non-
L1000: https://www.ncbi.nlm.nih.gov/geo/query/acc.
nose. Cell Tissue Res. 347, 327–341 (2012). small-cell lung cancer and cachexia (ROMANA 1 and
cgi?acc=GPL20573
85. Pandey, A. K., Lu, L., Wang, X., Homayouni, R. & ROMANA 2): results from two randomised, double-
MIDAS Platform | IQVIA analytics platform for industry-
Williams, R. W. Functionally enigmatic genes: blind, phase 3 trials. Lancet Oncol. 17, 519–531
leading sales and medical data available at: https://www.
a case study of the brain ignorome. PLoS ONE 9, (2016).
iqvia.com/solutions/commercialization/geographies/midas
e88889 (2014).
Monarch Initiative: http://www.monarchinitiative.org/
86. Pfeffer, C. & Olsen, B. R. Editorial: Journal of negative Acknowledgements
NIH Common Fund: https://commonfund.nih.gov/
results in biomedicine. J. Negat. Results Biomed. 1, 2 This work was supported by US National Institutes of Health
NIH ExPORTER: https://exporter.nih.gov/
(2002). (NIH) grants U54 CA189205 and U24 224370 (Illuminating
NIH RePORTER: https://projectreporter.nih.gov/reporter.cfm
87. Groth, P., Gibson, A. & Velterop, J. The anatomy of the Druggable Genome Knowledge Management Center (IDG
Pharos: https://pharos.nih.gov
a nanopublication. Inf. Serv. Use 30, 51–56 (2010). KMC)) at the University of New Mexico, Novo Nordisk
PubMed: https://pubmed.gov
88. Agarwal, P. & Searls, D. B. Can literature analysis Foundation Center for Protein Research, European
STRING: https://string-db.org/
identify innovation drivers in drug discovery? Nat. Rev. Bioinformatics Institute (EBI) and University of Miami, U54
The Cancer Genome Atlas: https://cancergenome.nih.gov/
Drug Discov. 8, 865–878 (2009). CA189201 and U24 CA224260 (A.M., Mount Sinai), P30
Target Central Resource Database: http://juniper.health.
89. Nguyen, D.‑T. et al. Pharos: Collating protein CA118100 (T.I.O., G.N.G. and L.A.S., UNM) and UL1
unm.edu/tcrd/
information to shed light on the druggable genome. TR001449 (T.I.O. and L.A.S.), UM1 HG006370 (International
William Gibson Wikiquotes: https://en.wikiquote.org/wiki/
Nucleic Acids Res. 45, D995–D1002 (2017). Mouse Phenotyping Consortium, T.F.M. and I.T.), U01
William_Gibson
90. Wishart, D. S. et al. DrugBank 5.0: a major update to MH104974 (B.L.R.), U01 MH104984 (S.T.), U01 MH105028
the DrugBank database for 2018. Nucleic Acids Res. (M.T.M.), U01 MH105026 (J.Q. and A.M., Baylor) and U01 SUPPLEMENTARY INFORMATION
46, D1074–D1082 (2017). MH104999, R01 CA177993 and U24 DK116204 (S.G. and See online article: S1 (box) | S2 (figure) | S3 (table) | S4 (table) |
91. The UniProt Consortium. UniProt: the universal G.L.J.) and by the European Molecular Biology Laboratory S5 (box) | S6 (box) | S7 (table) | S8 (table) | S9 (table) | S10
protein knowledgebase. Nucleic Acids Res. 45, (EMBL) and Wellcome Trust Strategic Awards WT086151/ (table) | S11 ( figure)
D158–D169 (2017). Z/08/Z and WT104104/Z/14/Z (A.G., A.H., A.R.L., A.K., J.P.O.,

©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.

Unexplored Therapeutic Opportunities in The Human Genome

Uploaded by

Copyright:

Available Formats

You might also like

Unexplored Therapeutic Opportunities in The Human Genome

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unexplored Therapeutic Opportunities in The Human Genome

Uploaded by

Copyright:

Available Formats

ANALYSIS

Unexplored therapeutic opportunities

NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 317

318 | MAY 2018 | VOLUME 17 www.nature.com/nrd

NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 319

320 | MAY 2018 | VOLUME 17 www.nature.com/nrd

Tbio Tchem Tclin

NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 321

322 | MAY 2018 | VOLUME 17 www.nature.com/nrd

a As noted above, target druggability is frequently esti-

and expectations of druggability typically diminish as

ing combined chemical and target similarity queries32,

screens may be more useful.

NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 323

Box 2 | Financial spotlight on the human proteome

581 Tclin proteins, totalling over US$3,417 Glucocorticoid

indicate that the cytokine tumour R0

H - Hormonal preparations B03

G protein-coupled receptors Voltage- N03 J - Anti-infectives

of the 108 druggable GPCRs M 0 3 GPCR Enzyme C07

classes with an extremely active ratio

of ongoing versus completed projects, G

particularly for emerging mechanism–

indication pairs98. Finally, combining 0

and Anatomical, Therapeutic and Chemical (ATC)

324 | MAY 2018 | VOLUME 17

NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 325

Cancer – – 41 (35) 1 (1)

326 | MAY 2018 | VOLUME 17 www.nature.com/nrd

Table 4 | Examples of successful attempts of targeting the dark genome

NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 327

328 | MAY 2018 | VOLUME 17 www.nature.com/nrd

Table 5 | Understudied kinases that are frequently altered in breast cancer

NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 329

330 | MAY 2018 | VOLUME 17 www.nature.com/nrd

NATURE REVIEWS | DRUG DISCOVERY VOLUME 17 | MAY 2018 | 331

332 | MAY 2018 | VOLUME 17 www.nature.com/nrd

You might also like