Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Article

KDC YJMBI-65714; No. of pages: 12; 4C:

Annotating Diseases Using Human


Phenotype Ontology Improves
Prediction of Disease-Associated
Long Non-coding RNAs

Duc-Hau Le 1, 2 and Lan T.M. Dao 2


1 - School of Computer Science and Engineering, Thuyloi University, 175 Tay Son, Dong Da, Hanoi, Vietnam
2 - Vinmec Research Institute of Stem Cell and Gene Technology, 458 Minh Khai, Hai Ba Trung, Hanoi, Vietnam

Correspondence to Duc-Hau Le: School of Computer Science and Engineering, Thuyloi University, 175 Tay Son, Dong Da,
Hanoi, Vietnam. hauldhut@gmail.com
https://doi.org/10.1016/j.jmb.2018.05.006
Edited by Nir Yosef

Abstract
Recently, many long non-coding RNAs (lncRNAs) have been identified and their biological function has been
characterized; however, our understanding of their underlying molecular mechanisms related to disease is still
limited. To overcome the limitation in experimentally identifying disease–lncRNA associations, computational
methods have been proposed as a powerful tool to predict such associations. These methods are usually based
on the similarities between diseases or lncRNAs since it was reported that similar diseases are associated with
functionally similar lncRNAs. Therefore, prediction performance is highly dependent on how well the similarities
can be captured. Previous studies have calculated the similarity between two diseases by mapping exactly each
disease to a single Disease Ontology (DO) term, and then use a semantic similarity measure to calculate the
similarity between them. However, the problem of this approach is that a disease can be described by more than
one DO terms. Until now, there is no annotation database of DO terms for diseases except for genes. In contrast,
Human Phenotype Ontology (HPO) is designed to fully annotate human disease phenotypes. Therefore, in this
study, we constructed disease similarity networks/matrices using HPO instead of DO. Then, we used these
networks/matrices as inputs of two representative machine learning-based and network-based ranking
algorithms, that is, regularized least square and heterogeneous graph-based inference, respectively. The results
showed that the prediction performance of the two algorithms on HPO-based is better than that on DO-based
networks/matrices. In addition, our method can predict 11 novel cancer-associated lncRNAs, which are supported
by literature evidence.
© 2018 Elsevier Ltd. All rights reserved.

Introduction number of identified lncRNAs, many related databases


have been established including GENCODE [7],
Recent genome-wide studies have revealed that lncRNAdb [8], lncRBase [9], lnRNA2Function [10]
two-thirds of the genome is being transcribed but only a and LncRNA2Target [11]. More specifically, focusing
minority of the transcriptional output encodes for on their roles in the common diseases and cancer,
proteins [1–4]. One class of non-coding RNAs is there are the lncRNADisease [12] and Lnc2Cancer
termed long non-coding RNAs (lncRNAs), described [13] databases, respectively. Even if a certain number
as non-coding RNA transcripts longer than 200 of disease–lncRNA associations are experimentally
nucleotides. LncRNAs are characterized by being confirmed, the obvious fact is that the vast majority is
transcribed by RNA polymerase II but at low levels, still unknown. Indeed, approximately 1000 lncRNA–
and to exhibit alternative splicing as well as being multi- disease entries have been reported in lncRNA2Di-
exonic, polyadenylated, and generally exhibit low sease [12]. Therefore, a number of computational
coding potential [4,5]. Studies of lncRNA at a methods have been recently proposed to predict novel
genome-wide scale have revealed that a large disease–lncRNA associations [14].
proportion of long non-coding transcripts can be Computational methods to predict novel disease–
functionally important for normal and pathological lncRNA association fall into two main categories:
developmental processes [6]. With the increasing machine learning-based and network-based methods.

0022-2836/© 2018 Elsevier Ltd. All rights reserved. J Mol Biol (2018) xx, xxx–xxx

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
2 Using HPO for Better Prediction of Disease-Associated lncRNA

The machine learning-based methods usually use and diseases. Therefore, the construction of exact
known disease–lncRNA association as well as un- similarity networks/matrices is the most important step
known ones to train learning models, then the learned to correctly identify novel disease-associated lncRNAs.
models are used to predict novel associations. This Indeed, many efforts have been made to build effective
approach also integrates various biological information lncRNA FSNs such as studies based on disease
to annotate lncRNAs. For example, a naïve Bayesian association [23–25], shared microRNAs [19,26], or a
model was used to integrate genome, regulome and study that combine both lncRNA-related transcriptional
transcriptome features to identify novel cancer-related and post-transcriptional information [27]. However, few
lncRNAs [15]. However, this method required negative efforts were made for constructing disease similarity
training samples (i.e., lncRNAs which are not associ- networks based on sharing of associated lncRNAs [28]
ated with diseases) to train the model. Given that there and Disease Ontology (DO) [19]. DO [29] terms
is no experimental confirmation of such negative describe phenotype characteristics, related medical
associations, unknown lncRNA–disease associations vocabulary and disease concepts; thus, a certain
were used in that study. Recently, a semi-supervised disease can be annotated with several DO terms. In
model, for example, regularized least square (RLS) other words, a disease cannot be fully described by a
[16], was used to overcome this limitation. The model single DO term. For example, breast cancer (OMIM ID:
does not require negative training samples, since it 114480) was characterized by only one DO term,
trains the model based on the known and the unknown mammary cancer (DO ID: 1612). Nevertheless, previ-
disease–lncRNA associations. In contrast to machine ous studies directly mapped each disease to only one
learning-based studies, where only a few have been DO term [17,19]. More specifically, to calculate a
used, many network-based methods have been similarity between two diseases, each disease was
proposed to predict novel disease-associated directly mapped with only one DO term, then a
lncRNAs. The network-based methods usually rank semantic similarity measure was used to calculate the
candidate lncRNAs based on their relevance to a similarity between two corresponding DO terms
disease of interest. The most commonly used algo- [17,19]. In addition, until now, there is no database
rithms are label propagation algorithms such as that stores annotation of DO terms for diseases except
random walk with restart (RWR) [17–21] and KATZ for genes [30]. Therefore, to measure the similarity
[22]. The main difference between these studies is the between two DO terms, a directed acyclic graph (DAG)
construction of underlying networks on which the for each DO term was used instead of annotation
propagation algorithms were applied. For example, information. In contrast, the Human Phenotype Ontol-
Sun et al. [17] applied RWR on a constructed lncRNA ogy (HPO) [31] (i.e., a controlled vocabulary database)
functional similarity network (abbreviated as lncRNA is designed to fully annotate human disease pheno-
FSN). Liu et al. [18] constructed a protein-coding gene– types. For example, breast cancer is annotated with
lncRNA bipartite network based on lncRNAs and three HPO terms, that is, autosomal dominant inheri-
protein-coding gene expression profiles, and then tance (HP:0000006), heterogeneous (HP:0001425)
used RWR to predict cancer-related lncRNAs. Mean- and breast carcinoma (HP:0003002).
while, Zhou et al. [19] and Ganegoda et al. [20] built In this study, diseases were first mapped to OMIM
heterogeneous networks of lncRNAs and diseases by records [32,33], which were annotated by HPO [31].
combining a lncRNA similarity network, a disease Then, we calculated the similarities between any pair
similarity network and known disease–lncRNA associ- of diseases to construct disease similarity networks/
ations, and then applied the RWR algorithm on the matrices using the HPO annotation database (i.e.,
heterogeneous networks to predict novel disease- HPO-based method). In addition, we constructed other
associated lncRNAs. These network-based methods disease similarity networks/matrices based on DO, as
were proposed based on an observation that function- in other previous methods [17,19] (i.e., DO-based
ally similar lncRNAs are usually associated with the method). For fair comparison, we collected a set of
same or similar diseases, known as “disease module” diseases that have at least one known lncRNA and
principle. By additionally integrating disease similarity were exactly mapped to one OMIM record as well as
networks, the “disease module” principle was used one DO term. Comparing our HPO-based method with
more effectively for methods based on the heteroge- the DO-based one on the same ranking algorithms, we
neous networks compared to those solely based on the found that our method achieved better prediction
lncRNA similarity network [19,20,22]. performances, in terms of area under the curve
Although network- and machine learning-based (AUC) values, for two representative machine
methods are different in terms of data representation learning-based and network-based algorithms, that
(i.e., machine learning-based methods use vector- is, RLS and heterogeneous graph-based inference
based representation of lncRNAs and interactions, and (HGBI), respectively. Furthermore, we tested our
similarity matrices of lncRNAs and diseases; mean- method in identifying cancer-associated lncRNAs
while, network-based method are based on similarity and found that 11 out of highly ranked lncRNAs were
networks of lncRNAs and diseases), they are both annotated in Lnc2cancer as associated to 10 cancer
based on the similarity networks/matrices of lncRNAs diseases [13].

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
Using HPO for Better Prediction of Disease-Associated lncRNA 3

Results and Discussion a

Overall comparison of prediction performance

To compare the prediction performance of our


method (i.e., HPO-based) with the DO-based method
on two representative ranking algorithms (i.e., RLS-
based and HGBI-based algorithms), we used cross-
validation methods for each disease in a set of 42
diseases (see Materials and Methods). For the RLS-
based algorithm, we set the best RLS parameter
settings (ηL = ηD = 1 and w = 0.9), as recommended in
[34,35]. Meanwhile, for the HGBI-based algorithm, we
set decay factor α = 0.4 as in the original study [36].
First, we compared the two methods using the
unweighted lncRNA FSN collected from Zhou et al.
[19] on all of 42 diseases using the leave-one-out cross-
validation (LOOCV) method. The result shows that the
prediction performance of the HPO-based method is
better than the DO-based in predicting disease- b
associated lncRNAs in terms of AUC value (Fig. 1).
More specifically, the HPO-based method achieved
AUC = 0.946, better than the DO-based method
(AUC = 0.870), with the RLS-based ranking algorithm
(Fig. 1(a)). Similarly, with the HGBI-based ranking
algorithm, the HPO-based method (AUC = 0.837) is
superior to the DO-based one (AUC = 0.592) (Fig. 1(b)).
Second, we compared the two methods on 10 diseases
having no less than three known associated lncRNAs
using a 3-fold cross-validation. The result shows that
the HPO-based method (AUC = 0.975) outperformed
the DO-based method (AUC = 0.904) with the RLS-
based ranking algorithm (Fig. S1(a)). Similarly, with
the HGBI-based algorithm, the HPO-based method
(AUC = 0.906) was also superior to the DO-based
method (AUC = 0.747) (Fig. S1(b)). In addition to the
unweighted lncRNA FSN, we also compared the two
methods on the weighted lncRNA FSN using LOOCV
method. With the RLS-based ranking algorithm, the
HPO-based method achieved an AUC = 0.984, which
was superior to the DO-based method (AUC = 0.919) Fig. 1. Performance comparison between HPO-based
and DO-based methods. The prediction performance of the
(Fig. S1(c)). Meanwhile, with the HGBI-based algo-
two methods was compared on the unweighted lncRNA FSN
rithm, the HPO-based and the DO-based methods by LOOCV method using RLS-based (a) and HGBI-based
achieved AUC = 0.412 and AUC = 0.382, respectively (b) ranking algorithms.
(Fig. S1(d)). In this case, both of the methods
performed badly since AUC values are less than 0.5.
Taken together, these results indicate that the HPO- phenotypes, see Materials and Methods) results
based method performed better than the DO-based in different sizes of the disease similarity networks
method irrespective of the used ranking algorithms. (i.e., number of diseases and number of associations
This could be due to the HPO-based method better between them), thus we further investigated whether
reflected the similarity between two diseases compared the size of the networks affects the prediction
to the DO-based method. performance. It was reported that ranking of nodes in
a network generated by ranking algorithms is highly
Performance comparison with respect to degree dependent on its degree of connectivity (i.e., number of
of connectivity its neighbors in the network) [37]. Therefore, for a fair
performance comparison, we constructed different
The difference in number of DO terms and annotated networks based on the original DO-based and HPO-
OMIM records (i.e., 2152 DO terms and 6521 disease based disease similarity networks (see Materials and

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
4 Using HPO for Better Prediction of Disease-Associated lncRNA

Methods) with a fixed degree of connectivity. More irrespective of degree of connectivity of constructed
specifically, for each disease in the original disease disease similarity networks.
similarity networks, we selected only a fixed number of
neighboring diseases (e.g., 10) having the largest Prediction of novel cancer-associated lncRNAs
similarities with it. These networks were also repre-
sented by adjacency matrices as inputs for the ranking In this experiment, we tried to predict novel associ-
algorithms. Then, we applied the RLS-based ranking ations between lncRNA and cancer. Cancer is a group
algorithm on these networks to predict disease- of diseases involving out-of-control cell growth with the
associated lncRNAs. First, we combined these disease potential to invade or spread to other parts of the body.
networks with the unweighted lncRNA FSN and Many intensive studies have been carried out but little is
assessed the prediction performance of each network known about the underlying molecular mechanisms.
using LOOCV method on the set of 42 diseases. As Recently, a cancer-associated lncRNA database
shown in Fig. 2, regardless of the degree of connec- Lnc2Cancer was introduced [13]. This is a manually
tivity, the HPO-based method always performed better curated database that provides comprehensive exper-
than the DO-based one. We also achieved similar imentally supported associations between lncRNA and
results with the 3-fold cross-validation method on the human cancers. The current version of Lnc2Cancer
set of 10 valid diseases (Fig. S2(a)). In addition, we documents 1488 entries of associations between 666
combined these disease similarity networks with the human lncRNAs and 97 human cancers through the
weighted lncRNA FSN and compared the prediction revision of more than 2000 published papers. There are
performance of the two methods on these networks 10 cancers in common between Lnc2Cancer and
using the LOOCV method on the set of 42 diseases. lncRNADisease [12]. For each of the 10 cancers, we
Figure S2(b) also shows that the HPO-based method used known disease–lncRNA associations in lncRNA-
achieved higher performance compared to that of the Disease as input of the RLS-based ranking algorithm to
DO-based method. Once again, these results indicate rank candidate lncRNAs (remaining ones in the
that annotating disease with HPO terms better reflects unweighted lncRNA FSN, which are not known to be
the similarity between two diseases than directly associated with the cancer of interest) and then
mapping them to DO terms, and thus, the HPO- selected the top 100 ranked candidates. Table 1 lists
based method was superior to the DO-based method 11 lncRNAs in the top which are reported in

Fig. 2. Performance comparison between HPO-based and DO-based methods based on degree of connectivity. The
unweighted lncRNA FSN and LOOCV method were used. By setting the same degree of connectivity for each disease in
HPO-based and DO-based networks, the prediction performance using RLS-based ranking method on HPO-based
networks is higher than that on DO-based networks. Note: 375 is the ratio between the number of associations and the
number of diseases in the original DO-based disease similarity network. Using this ratio, we constructed a network from
the original HPO-based disease similarity matrix.

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
Using HPO for Better Prediction of Disease-Associated lncRNA 5

Table 1. List of evidenced disease–lncRNA associations in top 100 ranked candidates for 10 cancers
OMIM ID Disease lncRNA PubMed ID
MIM109800 Bladder cancer PVT1 26517688
MIM114500 Colorectal cancer H19 11120891, 19926638, 22427002, 26068968, 26989025
PVT1 24196785, 26990997
MIM114550 Hepatocellular carcinoma GAS5 26109807, 25120813, 26163879, 26404135, 26404135
H19 15736456, 17786216, 24063685, 24761865, 24939300, 23222811
HOTAIR 24663081, 23292722, 21327457, 26024833, 22289527, 27301338
MIM151400 Chronic lymphocytic leukemia DLEU2 19347735, 9395242, 11161783, 19591824
NEAT1 25971364
MIM211980 Lung cancer H19 16707459, 24063685
ZNRD1-AS1 27166266
MIM275355 Oral squamous cell carcinoma HOTAIR 23292713, 25901533
MALAT1 26522444
MEG3 23292713
MIM603956 Cervical cancer PVT1 27272214, 27232880
MIM607107 Nasopharyngeal carcinoma H19 27040767
NEAT1 27020592
MIM607174 Meningioma H19 10738131
MEG3 20179190
MIM613659 Gastric cancer CCAT1 25561974, 25674211, 23143645, 25755774
HOXA-AS2 26384350
PVT1 25258543, 26096073, 25956062, 26925791

Lnc2Cancer for the 10 cancers. Among these achieve high prediction performance in terms of AUC
lncRNAs, six of them (i.e., GAS5, H19, HOTAIR, values on known disease–lncRNA associations, but
MALAT1, MEG3 and PVT1) are known to be associ- also predict novel disease-associated lncRNAs.
ated with at least one disease in lncRNADisease
database (excluding the disease of interest), mean-
while five remaining ones (i.e., DLEU2, NEAT1, Conclusions
ZNRD1-AS1, CCAT1 and HOXA-AS2) are not known
to be associated with any disease in lncRNADisease Recent studies have shown the important roles of
database. More importantly, newly predicted associa- lncRNAs in the development of a number of diseases.
tions between these 11 lncRNAs and the 10 cancers In addition, they can be potential targets for drug
were supported by literature evidence. For instance, discovery. However, many potential disease–lncRNA
PVT1 was upregulated in bladder cancer tissues and associations have not yet been revealed experimen-
further experiments revealed that PVT1 promoted cell tally. Therefore, computational methods have been
proliferation and suppressed cell apoptosis [38]. H19 proposed as an alternative approach to reduce the
was found highly expressed in mesenchymal-like cost and time of such laborious tasks. A large number
cancer cells and primary colorectal cancer tissues of computational methods have been developed for
[39]. HOTAIR expression was detected in primary predicting associations between other non-coding
hepatocellular carcinoma in 13 out of 64 patients [40]. RNAs (e.g., miRNAs) and diseases [48–54]. Compu-
The lncRNAs NEAT1 and lincRNA-p21 were detected tational methods often relied on the similarity between
as novel elements of the p53-dependent DNA damage diseases and lncRNAs using network- and machine
response machinery in chronic lymphocytic leukemia learning-based techniques. Therefore, careful con-
and lymphoma [41]. Higher expression of ZNRD1-AS1 struction of similarity matrices/networks among dis-
was shown in lung cancer tissues [42]. MALAT1 was eases and lncRNAs is an important step to identify
overexpressed in oral squamous cell carcinoma novel disease-associated lncRNAs. For constructing
tissues compared to normal oral mucosa by real-time disease similarity networks, a previous study built a
PCR [43]. PVT1 was upregulated in cervical cancer disease similarity matrix based on the interaction
tissues [44]. NEAT1 lncRNA was significantly upregu- profile of known disease–lncRNA associations [28].
lated in nasopharyngeal carcinoma cell lines and Obviously, this method limits the size of the disease
tissues [45]. MEG3 was not expressed in the majority similarity networks/matrices to the number of diseases
of human meningiomas or the human meningioma cell which are known to be associated with at least one
lines IOMM-Lee and CH157-MN [46]. Expression lncRNAs. Therefore, other studies constructed the
levels of the CCAT2 lncRNA in gastric cancer tissues disease similarity networks/matrices by exactly map-
were significantly higher than those in adjacent non- ping each disease with one DO term, then the similarity
tumor tissues [47]. Taken together, these results of two diseases was calculated based on the similarity
indicate that our HPO-based method can not only between two corresponding DO terms using a

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
6 Using HPO for Better Prediction of Disease-Associated lncRNA

semantic similarity measure. Although DO was recent- HPO and annotation databases
ly designed to describe phenotype characteristics and
related medical vocabulary disease concepts, only the HPO and annotation databases were obtained from
DGA database [30] has annotations that relate DO http://www.human-phenotype-ontology.org [31]. This
terms to genes but not to diseases. More importantly, database provides a standardized vocabulary of
only a small set of DO terms have been used for such phenotypic abnormalities encountered in human
annotations (i.e., approximately 2161 DO terms out of disease. Each term in the HPO describes a pheno-
8839 were used). This also limits the number of typic abnormality. The HPO is currently being
diseases possible to be investigated. In contrast, HPO developed using the medical literature, Orphanet
is designed to be supported for a decade [55], with [63], DECIPHER [64] and OMIM [32,33]. HPO
continued updating [56] of disease phenotype anno- currently contains approximately 11,000 terms and
tation. In addition, most disease phenotypes are over 115,000 annotations to hereditary diseases for
annotated by HPO terms [i.e., 6521 disease pheno- 6521 phenotypes (including approximately 4000
types out of 6875 (including 3739 phenotypes with disease phenotypes). The HPO is presented as a
known molecular basis, 1597 with unknown molecular DAG with the root term as “All” (HP:0000001). Using
basis, and mendelian phenotypes or loci with unknown this graph, ancestors and descendants of a term can
molecular basis) from OMIM were annotated]. There- be specified.
fore, using HPO to calculate disease similarity could
yield more accurate results than using DO. Indeed, Construction of disease similarity
experimental results show that HPO-based networks/ networks/matrices
matrices are better than the DO-based ones in the
prediction of disease-associated lncRNAs. Our pro-
To construct a disease similarity matrix, previous
posed method can also predict 11 novel lncRNAs studies [17,19] mapped directly each disease to
associated with 10 cancers which were not yet
one DO term [29]. Therefore, similarity between
reported in lncRNADisease [12]. In summary, using
two diseases was calculated based on similarity
HPO for the calculation of disease similarity could be a between two mapped DO terms. In this study, we
promising approach for the prediction of disease-
first mapped each disease to one OMIM record
associated lncRNAs.
and then annotated the OMIM with HPO terms.
After that, we calculated the similarity between two
diseases based on the annotated HPO terms. In
Materials and Methods the next section, we will introduce the two
methods including the one used by previous
Known disease–lncRNA associations studies [17,19] (abbreviated as “DO-based
method”) and our method (abbreviated as “HPO-
We used lncRNADisease database [12] for known based method”).
disease–lncRNA associations, which contain experi-
mentally validated disease–lncRNA associations.
Currently, there are 1028 associations between 321 DO-based method
lncRNAs and 221 diseases collected from ~ 500 To construct DO-based disease similarity matrix,
publications. we mapped each disease to one DO term as in
previous studies [17,19], and then calculated the
DO and annotation databases similarity between any pair of mapped DO terms in
the set of 2161 DO terms and annotations in DGA
DO was obtained from http://disease-ontology.org/ database [30]. The similarity between two ontology
[29]. This database describes phenotype characteris- terms was calculated based on information content
tics and related medical vocabulary disease concepts. (IC) of each term, which is defined as follows:
The DO semantically integrates disease and medical
vocabularies through extensive cross-mapping of DO ICðt Þ ¼ − logðp ðt ÞÞ
terms to MeSH [57,58], ICD [59], NCI's thesaurus [60], where p(t) is the probability of term t occurring in a
SNOMED [61,62] and OMIM [32,33]. DO currently corpus (i.e., an annotation database, e.g., DGA for
contains approximately 10,878 terms. The DO is f ðt Þ
presented as a DAG with the root term as “disease” DO). More specifically, that is, pðt Þ ¼ f ðrootÞ such that
(DO ID: 4). Unfortunately, there is no database f ðt Þ ¼ Annotðt Þ þ ∑c∈Childrenðt Þ f ðcÞ . In this formula,
containing DO term annotations to any disease. Annot(t) means the number of phenotypes annotated
However, there are 2161 DO terms used to annotate with t in the corpus and Children(t) represents the set
genes in DGA [30]. Therefore, in this study, this of children terms of t in the DO graph. “root” is root term
annotation information was used to calculate the of the DO graph. Then, the semantic similarity
similarity between DO terms. between the two DO terms, ti and tj, based on the

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
Using HPO for Better Prediction of Disease-Associated lncRNA 7

most informative common ancestor approach Resnik Construction of lncRNA functional similarity
[65], is calculated as follows: network/matrix
 
simTerm t i ; t j ¼ max ðICðc ÞÞ In this study, we first collected an unweighted
c ∈ P ðt i ;t j Þ
lncRNA FSN data from Zhou et al. [19]. Each
where P(ti, tj) is the set of shared ancestors of ti and tj. interaction in this network was constructed by signif-
For a pair of diseases, di and dj are directly mapped icant co-occurrence of shared miRNA response
to ti and tj, respectively. The similarity between them is elements on lncRNA transcripts. This network includes
defined as follows: 13,640 interactions between 697 lncRNAs. This
network can be represented as an adjacency matrix
   
w ij ¼ simDis d i ; d j ¼ simTerm t i ; t j WL, where its element (WL)i, j was set to 1 or 0 with
respect to whether an interaction between lncRNA i
We calculated the similarity for every pair of DO and j exists or not. Second, we collected original data of
terms in the total of 2161 DO terms to construct a DO- lncRNA–miRNA associations and then repeated the
based disease similarity matrix. By selecting pairs same procedure introduced in that study to calculate
having simDis(di, dj) N 0, we also constructed a DO- degree of significance of co-occurrence of shared
based disease similarity network containing 806,505 miRNAs between lncRNAs. After setting the threshold
interactions (also known as an original DO-based of significance to 0.0001 as in that study, only
similarity network). Both the disease similarity significant lncRNA interactions were kept. However,
network and the matrix can be represented as an instead of setting the same value (i.e., 1) to those
adjacency matrix WD, where its element (WD)i, j was interactions, we transformed significant values using a
set to wij representing the similarity between disease logarithm of base 10 and normalized them into a range
di and dj. (0, 1) to form a weighted lncRNA FSN (see Supple-
mentary Table S2). Similarly, this network can be
represented as an adjacency matrix WL, where its
HPO-based method element (WL)i, j was set to a corresponding weight or 0
with respect to whether an interaction between lncRNA
To construct HPO-based disease similarity net- i and j exists or not.
work, we first mapped each disease to one OMIM
record, then annotated the OMIM with HPO terms Construction of adjacency network/matrix of
using the HPO annotation database [31]. Then, the known disease–lncRNA associations
similarity of every pair of term was calculated using
the same formulas as for a pair of DO terms. More As aforementioned, known disease–lncRNA asso-
specifically, the similarity between a pair of disease ciations were collected from lncRNADisease data-
di and dj is calculated as the maximum of simTerm base [12]. These diseases and lncRNAs were
values between all possible pairs of terms as follows: mapped onto disease and lncRNA similarity net-
     works/matrices. After mapping, a total of 42 diseases
simDis d i ; d j ¼ max simTerm t i ; t j
t i ∈T ðd i Þ;t j ∈T ðd j Þ and 151 associated lncRNAs remained for assessing
the prediction performance, in which each disease
where T(di) and T(dj) represent the set of terms was mapped exactly to one OMIM record and one DO
annotating to di and dj, respectively. term (see Supplementary Table S1). These associa-
This value is normalized in range [0, 1] to account tions can be represented as a bipartite network with an
for an unequal number of HPO terms for both adjacency matrix WDL, where (WDL)i, j = 1 if disease i is
disease phenotypes as follows: known to be associated with lncRNA j, otherwise
  (WDL)i, j = 0.
2  simDis d i ; d j
w ij ¼  
simDisðd i ; d i Þ þ simDis d j ; d j Ranking algorithms

We calculated the similarity for every pair of diseases In this study, we proposed a novel method (i.e., HPO-
in the total of 6521 disease phenotypes to construct an based method) to construct disease similarity network/
HPO-based disease similarity matrix. By selecting matrix using HPO for prediction of disease-associated
pairs having simDis(di, dj) N 0, we also constructed an lncRNAs. To compare prediction performance of our
HPO-based disease similarity network containing method (i.e., HPO-based method) with that of previous
21,258,460 interactions (also called as an original one (i.e., DO-based method), we tested it with different
HPO-based disease similarity network). Similarly, both ranking algorithms and showed that the HPO-based
HPO-based similarity matrix and network can be method is superior to the DO-based one, irrespective of
represented as an adjacency matrix WD, where its the used ranking algorithms. Many ranking algorithms
element (WD)i, j was set to wij representing the similarity including machine learning-based and network-based
between disease di and dj. algorithms working on the heterogeneous networks

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
8 Using HPO for Better Prediction of Disease-Associated lncRNA

have been proposed. Here, we used two representa- tions for all the diseases simultaneously, without using
tive algorithms for predicting disease-associated negative samples. This method was designed to
lncRNAs. Figure 3 shows an Illustration of HPO- construct a continuous classification function which
based and DO-based methods and overall framework can determine the association probability between
of prediction of disease-associated lncRNAs using each lncRNA and a given disease (i.e., the higher this
ranking algorithms. probability is, the more each lncRNA is related to a
given disease). To this end, a RLS classifier was
constructed by defining and minimizing a cost function.
RLS-based ranking algorithm
This cost function was trained in the lncRNA FSN and
RLS is a semi-supervised and global learning the disease similarity network, and then it was used to
method since it can rank disease–lncRNA associa- optimize the classification function [34]. Formally, the

Fig. 3. Illustration of HPO-based and DO-based methods and overall framework for the prediction of disease-
associated lncRNAs. (a) Similarity between two diseases was calculated using ontology data. In HPO-based method
(upper panel), each disease was mapped to one OMIM record, and then they were annotated with HPO terms. Finally, the
similarity between two diseases was calculated based on the similarity of every pair of the annotating HPO terms
annotating to each disease. Meanwhile, in DO-based method (lower panel), each disease was directly mapped to one DO
term, and then the similarity between two diseases was calculated as the similarity of the two annotating DO terms.
(b) Overall framework of prediction of disease-associated lncRNAs. A disease of interest and its known associated
lncRNAs (if any) as well as candidate lncRNAs were mapped onto the similarity networks/matrices. Then, a ranking
algorithm was applied to rank all the candidate lncRNAs.

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
Using HPO for Better Prediction of Disease-Associated lncRNA 9

optimal classifier in these two spaces was defined as All disease–lncRNA pairs in the heterogeneous
follows: networks are eventually ranked according to the
steady-state probability vector F ∞, which is obtained
F  ¼ w F T 
L þ ð1−w ÞF D by repeating the iterations until convergence is
reached (in this study, the number of iterations is
where set to 10).
- FL∗ and FD∗ are optimal classification functions in
the lncRNA and disease phenotype spaces, Performance assessment
respectively, as follows:
To assess the prediction performance of ranking
algorithms (i.e., RLS-based and HGBI-based
F L ¼ W L ðW L þ ηL I L ÞW DL T
algorithms) on different DO- and HPO-based disease
similarity networks, we used the LOOCV method for
F D ¼ W D ðW D þ ηD I D ÞW DL each disease in the set of 42 diseases. More
specifically, for each disease (d) with known associated
lncRNAs (S), in each round of LOOCV, we held out one
- w is the weight between these two spaces. ηM known d-associated lncRNA. The held-out lncRNA (s)
and ηD are trade-off parameters in the lncRNA and remaining lncRNAs (C) in the lncRNA network/
and disease phenotype spaces, respectively. matrix, which were not known to be associated to d,
- IL and ID are identity matrices with the same were then ranked by the method. After that, we plotted
size as matrices WL and WD, respectively. the receiver operating characteristic (ROC) curve and
calculated the AUC to compare the performance of
the methods. This curve represents the relationship
HGBI-based ranking algorithm between sensitivity and (1 − specificity), where
In addition to the machine learning-based algorithm sensitivity refers to the percentage of known d-
in previous section (i.e., RLS-based ranking algorithm), associated lncRNAs that were ranked above a
in this section, we introduce an HGBI algorithm. This particular threshold and specificity refers to the
algorithm was first proposed to infer novel drug–target percentage of lncRNAs which were not known to be
interactions [36], and then it was applied to predict associated top ranked below this threshold. More
disease-associated miRNAs [66]. HGBI is based on the specifically, given a threshold τ, we counted TP (true
guilt-by-association principle on a heterogeneous positives), FN (false negatives), FP (false positives)
network of diseases and lncRNAs and predicts new and TN (true negatives), which were formally defined
disease–lncRNA associations by iteratively updating as follows:
the measure of strength between unlinked disease– X X
lncRNA pairs by taking all the paths in the network into P¼ I ð rankðs Þ ≤ τ Þ FN ¼ I ð rankðsÞ Nτ Þ
account. s∈S s∈S
The potential association probability between a
disease in the disease similarity network and a lncRNA
in the lncRNA network can be defined as follows: X X
FP ¼ I ð rankðc Þ ≤ τ Þ TN ¼ I ð rankðc ÞN τ Þ
t þ1
F ¼ ∝W 0D F t W 0L þ ð1−∝ÞW DL c ∈C c ∈C

where rank(s), rank(c) and I(∙) denote the rank of s, the


where α is a decay factor.
rank of a lncRNA c out of the set C and the indicator
According to the original study [36], association
function, respectively. Then, we defined sensitivity and
probability matrix F will converge when WD′ and WL′
(1 − specificity) as follows:
are normalized as follows:
TP FP
  ðW D Þi; j sensitivity ¼ 1−specificity ¼
TP þ FN FP þ TN
W 0D i; j ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pm Pm ffi
k ¼1 ð W Þ
D i;k k ¼1 ð W Þ
D k; j By varying τ from one to the number of lncRNAs in
the set C ∪ {s}, the relationship between sensitivity and
(1 − specificity) was plotted. The ROC curve is the
  ðW L Þi; j
W 0L i; j
¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn Pn
curve constructed based on those pairs of values, and
the AUC is the area under the ROC curve.
k ¼1 ðW L Þi;k k ¼1 ðW L Þk ; j
In addition to the LOOCV, we assessed the
prediction performance of the methods using 3-fold
where m and n are number of diseases and lncRNAs in cross validation method. This is because there were
the disease similarity networks and the lncRNA FSN, 10 out of the set of 42 diseases having no less than 3
respectively. known associated lncRNAs which are available on

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
10 Using HPO for Better Prediction of Disease-Associated lncRNA

the lncRNA FSN. Figure S3 shows the distribution of Chromatin signature reveals over a thousand highly con-
known disease-associated lncRNAs. served large non-coding RNAs in mammals, Nature 458
(7235) (2009) 223–227.
[6] È.-L. Mathieu, M. Belhocine, L.T.M. Dao, D. Puthier, S. Spicuglia,
Rôle des longs ARN non codants dans le développement normal
et pathologique, Med. Sci. (Paris) 30 (8-9) (2014) 790–796.
[7] T. Derrien, R. Johnson, G. Bussotti, A. Tanzer, S. Djebali, H.
Acknowledgment Tilgner, G. Guernec, D. Martin, A. Merkel, D.G. Knowles,
et al., The GENCODE v7 catalog of human long noncoding
This research is funded by Vietnam National RNAs: analysis of their gene structure, evolution, and
Foundation for Science and Technology Development expression, Genome Res. 22 (9) (2012) 1775–1789.
(NAFOSTED) under grant number 102.01-2017.14. [8] X.C. Quek, D.W. Thomson, L.V. Maag Jesper, N. Bartonicek,
B. Signal, M.B. Clark, B.S. Gloss, M.E. Dinger, lncRNAdb v2.0:
expanding the reference database for functional long noncod-
Appendix A. Supplementary data ing RNAs, Nucleic Acids Res. 43 (D1) (2015) D168–D173.
[9] S. Chakraborty, A. Deb, R.K. Maji, S. Saha, Z. Ghosh,
Supplementary data to this article can be found LncRBase: an enriched resource for lncRNA information,
online at https://doi.org/10.1016/j.jmb.2018.05.006. PLoS One 9 (9) (2014), e108010.
[10] Q. Jiang, R. Ma, J. Wang, X. Wu, S. Jin, J. Peng, R. Tan, T.
Received 2 November 2017; Zhang, Y. Li, Y. Wang, LncRNA2Function: a comprehensive
resource for functional investigation of human lncRNAs
Received in revised form 28 April 2018;
based on RNA-seq data, BMC Genomics 16 (3) (2015) S2.
Accepted 5 May 2018
[11] Q. Jiang, J. Wang, X. Wu, R. Ma, T. Zhang, S. Jin, Z. Han, R. Tan,
Available online xxxx J. Peng, G. Liu, et al., LncRNA2Target: a database for
differentially expressed genes after lncRNA knockdown or
Keywords: overexpression, Nucleic Acids Res. 43 (D1) (2015) D193–D196.
disease-associated lncRNA; [12] G. Chen, Z. Wang, D. Wang, C. Qiu, M. Liu, X. Chen, Q.
Human Phenotype Ontology; Zhang, G. Yan, Q. Cui, LncRNADisease: a database for long-
Disease Ontology; non-coding RNA-associated diseases, Nucleic Acids Res. 41
semantic similarity; (D1) (2013) D983–D986.
ranking algorithms [13] S. Ning, J. Zhang, P. Wang, H. Zhi, J. Wang, Y. Liu, Y. Gao,
M. Guo, M. Yue, L. Wang, et al., Lnc2Cancer: a manually
Abbreviations used: curated database of experimentally supported lncRNAs
associated with various human cancers, Nucleic Acids Res.
lncRNAs, long non-coding RNAs; DAG, directed acyclic
44 (D1) (2016) D980–D985.
graph; HPO, Human Phenotype Ontology; RLS, regularized
[14] X. Chen, C.C. Yan, X. Zhang, Z.-H. You, Long non-coding RNAs
least square; HGBI, heterogeneous graph-based inference; and complex diseases: from experimental results to computa-
LOOCV, leave-one-out cross-validation; ROC, receiver tional models, Brief. Bioinform. 18 (4) (2017) 558–576.
operating characteristic; AUC, area under the curve. [15] T. Zhao, J. Xu, L. Liu, J. Bai, C. Xu, Y. Xiao, X. Li, L. Zhang,
Identification of cancer-related lncRNAs through integrating
genome, regulome and transcriptome features, Mol. BioSyst.
References 11 (1) (2015) 126–136.
[16] X. Chen, G.-Y. Yan, Novel human lncRNA–disease associ-
ation inference based on lncRNA expression profiles,
[1] P. Bertone, V. Stolc, T.E. Royce, J.S. Rozowsky, A.E. Urban, X. Bioinformatics 29 (20) (2013) 2617–2624.
Zhu, J.L. Rinn, W. Tongprasit, M. Samanta, S. Weissman, et al., [17] J. Sun, H. Shi, Z. Wang, C. Zhang, L. Liu, L. Wang, W. He, D.
Global identification of human transcribed sequences with Hao, S. Liu, M. Zhou, Inferring novel lncRNA–disease associ-
genome tiling arrays, Science 306 (5705) (2004) 2242–2246. ations based on a random walk model of a lncRNA functional
[2] P. Carninci, T. Kasukawa, S. Katayama, J. Gough, M.C. similarity network, Mol. BioSyst. 10 (8) (2014) 2074–2081.
Frith, N. Maeda, R. Oyama, T. Ravasi, B. Lenhard, C. Wells, [18] Y. Liu, R. Zhang, F. Qiu, K. Li, Y. Zhou, D. Shang, Y. Xu,
et al., The transcriptional landscape of the mammalian Construction of a lncRNA-PCG bipartite network and
genome, Science 309 (5740) (2005) 1559–1563. identification of cancer-related lncRNAs: a case study in
[3] E. Birney, J.A. Stamatoyannopoulos, A. Dutta, R. Guigó, T.R. prostate cancer, Mol. BioSyst. 11 (2) (2015) 384–393.
Gingeras, E.H. Margulies, Z. Weng, M. Snyder, E.T. Dermitzakis, [19] M. Zhou, X. Wang, J. Li, D. Hao, Z. Wang, H. Shi, L. Han, H.
R.E. Thurman, Identification and analysis of functional elements Zhou, J. Sun, Prioritizing candidate disease-related long non-
in 1% of the human genome by the ENCODE pilot project, coding RNAs by walking on the heterogeneous lncRNA and
Nature 447 (7146) (2007) 799–816. disease network, Mol. BioSyst. 11 (3) (2015) 760–769.
[4] P. Kapranov, J. Cheng, S. Dike, D.A. Nix, R. Duttagupta, A.T. [20] G.U. Ganegoda, M. Li, W. Wang, Q. Feng, Heterogeneous
Willingham, P.F. Stadler, J. Hertel, J. Hackermüller, I.L. network model to infer human disease-long intergenic non-
Hofacker, et al., RNA maps reveal new rna classes and a coding RNA associations, IEEE Trans. Nanobiosci. 14 (2)
possible function for pervasive transcription, Science 316 (5830) (2015) 175–183.
(2007) 1484–1488. [21] X. Chen, Z.-H. You, G.-Y. Yan, D.-W. Gong, IRWRLDA:
[5] M. Guttman, I. Amit, M. Garber, C. French, M.F. Lin, D. improved random walk with restart for lncRNA–disease associ-
Feldser, M. Huarte, O. Zuk, B.W. Carey, J.P. Cassady, et al., ation prediction, Oncotarget 7 (36) (2016) 57919.

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
Using HPO for Better Prediction of Disease-Associated lncRNA 11

[22] X. Chen, KATZLDA: KATZ measure for the lncRNA–disease lncRNA H19 promotes epithelial to mesenchymal transition
association prediction, Sci. Rep. 5 (2015) 16840. by functioning as miRNA sponges in colorectal cancer,
[23] X. Chen, C. Clarence Yan, C. Luo, W. Ji, Y. Zhang, Q. Dai, Oncotarget 6 (26) (2015) 22513.
Constructing lncRNA functional similarity network based on [40] M. Ishibashi, R. Kogo, K. Shibata, G. Sawada, Y. Takahashi,
lncRNA–disease associations and disease semantic similar- J. Kurashige, S. Akiyoshi, S. Sasaki, T. Iwaya, T. Sudo,
ity, 5 (2015) 11338. Clinical significance of the expression of long non-coding
[24] Y.-A. Huang, X. Chen, Z.-H. You, D.-S. Huang, K. Chan, RNA HOTAIR in primary hepatocellular carcinoma, Oncol.
ILNCSIM: improved lncRNA functional similarity calculation Rep. 29 (3) (2013) 946–950.
model, Oncotarget 7 (18) (2016) 25902–25914. [41] C. Blume, A. Hotz-Wagenblatt, J. Hüllein, L. Sellner, A.
[25] X. Chen, Y.-A. Huang, X.-S. Wang, Z.-H. You, K. Chan, Jethwa, T. Stolz, M. Slabicki, K. Lee, A. Sharathchandra, A.
FMLNCSIM: fuzzy measure-based lncRNA functional similarity Benner, p53-dependent non-coding RNA networks in chronic
calculation model, Oncotarget 7 (29) (2016) 45948–45958. lymphocytic leukemia, Leukemia 29 (10) (2015).
[26] X. Chen, Predicting lncRNA–disease associations and [42] D. Li, L. Song, Z. Wen, X. Li, J. Jie, Y. Wang, L. Peng, Strong
constructing lncRNA functional similarity network based on evidence for LncRNA ZNRD1-AS1, and its functional Cis-
the information of miRNA, Sci. Rep. 5 (2015) 13186. eQTL locus contributing more to the susceptibility of lung
[27] L. Cheng, H. Shi, Z. Wang, Y. Hu, H. Yang, C. Zhou, J. Sun, cancer, Oncotarget 7 (24) (2016) 35813.
M. Zhou, IntNetLncSim: an integrative network analysis [43] X. Zhou, S. Liu, G. Cai, L. Kong, T. Zhang, Y. Ren, Y. Wu, M. Mei,
method to infer human lncRNA functional similarity, Onco- L. Zhang, X. Wang, Long non-coding RNA MALAT1 promotes
target 7 (30) (2016) 47864. tumor growth and metastasis by inducing epithelial–mesenchy-
[28] X. Yang, L. Gao, X. Guo, X. Shi, H. Wu, F. Song, B. Wang, A mal transition in oral squamous cell carcinoma, 5 (2015) 15972.
network based method for analysis of lncRNA–disease [44] S. Zhang, G. Zhang, J. Liu, Long noncoding RNA PVT1
associations and prediction of lncRNAs implicated in promotes cervical cancer progression through epigenetically
diseases, PLoS One 9 (1) (2014), e87797. silencing miR-200b, APMIS 124 (8) (2016) 649–658.
[29] W.A. Kibbe, C. Arze, V. Felix, E. Mitraka, E. Bolton, G. Fu, C.J. [45] Y. Lu, T. Li, G. Wei, L. Liu, Q. Chen, L. Xu, K. Zhang, D. Zeng,
Mungall, J.X. Binder, J. Malone, D. Vasant, et al., Disease R. Liao, The long non-coding RNA NEAT1 regulates
Ontology 2015 update: an expanded and updated database of epithelial to mesenchymal transition and radioresistance in
human diseases for linking biomedical knowledge through through miR-204/ZEB1 axis in nasopharyngeal carcinoma,
disease data, Nucleic Acids Res. 43 (D1) (2015) Tumor Biol. 37 (9) (2016) 11733–11741.
D1071–D1078. [46] X. Zhang, R. Gejman, A. Mahta, Y. Zhong, K.A. Rice, Y.
[30] K. Peng, W. Xu, J. Zheng, K. Huang, H. Wang, J. Tong, Z. Zhou, P. Cheunsuchon, D.N. Louis, A. Klibanski, Maternally
Lin, J. Liu, W. Cheng, D. Fu, et al., The disease and gene expressed gene 3, an imprinted noncoding RNA gene, is
annotations (DGA): an annotation resource for human associated with meningioma pathogenesis and progression,
disease, Nucleic Acids Res. 41 (D1) (2013) D553–D560. Cancer Res. 70 (6) (2010) 2350–2358.
[31] S. Köhler, S.C. Doelken, C.J. Mungall, S. Bauer, H.V. Firth, I. [47] C.-Y. Wang, L. Hua, K.-H. Yao, J.-T. Chen, J.-J. Zhang, J.-H.
Bailleul-Forestier, G.C.M. Black, D.L. Brown, M. Brudno, J. Hu, Long non-coding RNA CCAT2 is up-regulated in gastric
Campbell, et al., The Human Phenotype Ontology project: cancer and associated with poor prognosis, Int. J. Clin. Exp.
linking molecular biology and disease through phenotype Pathol. 8 (1) (2015) 779.
data, Nucleic Acids Res. 42 (D1) (2014) D966–D974. [48] D.-H. Le, L. Verbeke, L.H. Son, D.-T. Chu, V.-H. Pham,
[32] A. Hamosh, A.F. Scott, J.S. Amberger, C.A. Bocchini, V.A. Random walks on mutual microRNA–target gene interaction
McKusick, Online Mendelian Inheritance in Man (OMIM), a network improve the prediction of disease-associated micro-
knowledgebase of human genes and genetic disorders, RNAs, BMC Bioinforma. 18 (1) (2017) 479.
Nucleic Acids Res. 33 (Suppl. 1) (2005) D514–517. [49] D.-H. Le, Network-based ranking methods for prediction of
[33] J. Amberger, C.A. Bocchini, A.F. Scott, A. Hamosh, novel disease associated microRNAs, Comput. Biol. Chem.
McKusick's Online Mendelian Inheritance in Man (OMIM®), 58 (2015) 139–148.
Nucleic Acids Res. 37 (Suppl. 1) (2009) D793–D796. [50] D.H. Le, V.H. Pham, T.T. Nguyen, An ensemble learning-
[34] X. Chen, G.-Y. Yan, Semi-supervised learning for potential based method for prediction of novel disease-microRNA
human microRNA-disease associations inference, Sci. Rep. associations, 2017 9th International Conference on Knowl-
4 (2014) 5501. edge and Systems Engineering (KSE): 19-21 Oct. 2017
[35] T. van Laarhoven, S.B. Nabuurs, E. Marchiori, Gaussian 2017, pp. 7–12.
interaction profile kernels for predicting drug–target interac- [51] D.-H. Le, Disease phenotype similarity improves the prediction of
tion, Bioinformatics 27 (21) (2011) 3036–3043. novel disease-associated microRNAs, Information and Comput-
[36] W. Wang, S. Yang, J. Li, Drug target predictions based on er Science (NICS), 2015 2nd National Foundation for Science
heterogeneous graph inference, Biocomputing 2013, World and Technology Development Conference on: 16–18 Sept. 2015
Scientific 2013, pp. 53–64. 2015, pp. 76–81.
[37] S. Erten, G. Bebek, R. Ewing, M. Koyuturk, DADA: degree- [52] X. Chen, D. Xie, Q. Zhao, Z.-H. You, MicroRNAs and
aware algorithms for network-based disease gene prioritiza- complex diseases: from experimental results to computa-
tion, BioData Min. 4 (1) (2011) 19. tional models, Brief. Bioinform. (2017) bbx130–bbx130.
[38] C. Zhuang, J. Li, Y. Liu, M. Chen, J. Yuan, X. Fu, Y. Zhan, L. Liu, [53] Z.-H. You, Z.-A. Huang, Z. Zhu, G.-Y. Yan, Z.-W. Li, Z. Wen, X.
J. Lin, Q. Zhou, Tetracycline-inducible shRNA targeting long non- Chen, PBMDA: a novel and effective path-based computational
coding RNA PVT1 inhibits cell growth and induces apoptosis in model for miRNA–disease association prediction, PLoS Comput.
bladder cancer cells, Oncotarget 6 (38) (2015) 41194. Biol. 13 (3) (2017), e1005455.
[39] W.-C. Liang, W.-M. Fu, C.-W. Wong, Y. Wang, W.-M. Wang, [54] X. Chen, C.C. Yan, X. Zhang, Z.-H. You, L. Deng, Y. Liu, Y.
G.-X. Hu, L. Zhang, L.-J. Xiao, D.C.-C. Wan, J.-F. Zhang, The Zhang, Q. Dai, WBSMDA: Within and Between Score for

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006
12 Using HPO for Better Prediction of Disease-Associated lncRNA

MiRNA–disease Association prediction, Sci. Rep. 6 (2016) [61] M.Q. Stearns, C. Price, K.A. Spackman, A.Y. Wang, SNOMED
21106. clinical terms: overview of the development process and project
[55] P.N. Robinson, S. Köhler, S. Bauer, D. Seelow, D. Horn, S. status, Proceedings of the AMIA Symposium 2001,
Mundlos, The Human Phenotype Ontology: a tool for annotating pp. 662–666.
and analyzing human hereditary disease, Am. J. Hum. Genet. 83 [62] R. Cornet, N. de Keizer, Forty years of SNOMED: a literature
(5) (2008) 610–615. review, BMC Med. Inform. Decis. Mak. 8 (1) (2008) S2.
[56] S. Köhler, N.A. Vasilevsky, M. Engelstad, E. Foster, J. McMurry, [63] S.S. Weinreich, R. Mangon, J. Sikkens, M. Teeuw, M. Cornel,
S. Aymé, G. Baynam, S.M. Bello, C.F. Boerkoel, K.M. Boycott, Orphanet: a European database for rare diseases, Ned.
et al., The Human Phenotype Ontology in 2017, Nucleic Acids Tijdschr. Geneeskd. 152 (9) (2008) 518–519.
Res. 45 (D1) (2017) D865–D876. [64] H.V. Firth, S.M. Richards, A.P. Bevan, S. Clayton, M. Corpas, D.
[57] H.J. Lowe, G.O. Barnett, Understanding and using the Medical Rajan, S.V. Vooren, Y. Moreau, R.M. Pettett, N.P. Carter,
Subject Headings (MeSH) vocabulary to perform literature DECIPHER: Database of Chromosomal Imbalance and Pheno-
searches, JAMA 271 (14) (1994) 1103–1108. type in Humans Using Ensembl Resources, Am. J. Hum. Genet.
[58] C.E. Lipscomb, Medical Subject Headings (MeSH), Bull. 84 (4) (2009) 524–533.
Med. Libr. Assoc. 88 (3) (2000) 265–266. [65] P. Resnik, Using information content to evaluate semantic
[59] Organization WH, The ICD-10 Classification of Mental and similarity in a taxonomy, Proceedings of the 14th International
Behavioural Disorders: Clinical Descriptions and Diagnos- Joint Conference on Artificial Intelligence, vol. 1, Morgan
tic Guidelines, vol. 1, World Health Organization, 1992. Kaufmann Publishers Inc., Montreal, Quebec, Canada, 1995.
[60] N. Sioutos, Sd Coronado, M.W. Haber, F.W. Hartel, W.-L. Shaiu, [66] X. Chen, C.C. Yan, X. Zhang, Z.-H. You, Y.-A. Huang, G.-Y.
L.W. Wright, NCI Thesaurus: a semantic model integrating Yan, HGIMDA: heterogeneous graph inference for miRNA–
cancer-related clinical and molecular information, J. Biomed. disease association prediction, Oncotarget 7 (40) (2016)
Inform. 40 (1) (2007) 30–43. 65257–65269.

Please cite this article as: D.-H. Le, L. T.M. Dao, Annotating Diseases Using Human Phenotype Ontology Improves Prediction of
Disease-Associated Long Non-coding RNAs, J. Mol. Biol. (2018), https://doi.org/10.1016/j.jmb.2018.05.006

You might also like