Professional Documents
Culture Documents
Community Curation and Expert Curation of Human Long Noncoding RNAs With LncRNAWiki and LncBook
Community Curation and Expert Curation of Human Long Noncoding RNAs With LncRNAWiki and LncBook
INTRODUCTION
Due to the rapid advancement of next-generation sequencing technology, more and more
human long noncoding RNAs (lncRNAs) have been discovered in many species ranging
from mammals to plants (Fang et al., 2018). In particular, the number of known human
lncRNAs has increased exponentially (Derrien et al., 2012; Fang et al., 2018; Iyer et al.,
Ma et al.
Given the large number of human lncRNAs, it would be quite laborious and time con-
suming to rely mainly on expert curation to curate these lncRNAs. However, a wiki
platform enables any user to edit the information at any time and thus features collabo-
rative information integration, up-to-date content, and low-maintenance cost. Based on
MediaWiki, we developed LncRNAWiki (http://lncrna.big.ac.cn/index.php/Main_Page)
to harnesses collective efforts to collect, edit, and annotate information about human
lncRNAs (Ma et al., 2015). In the past years, LncRNAWiki has been frequently updated
by curating more experimentally validated human lncRNAs, linking lncRNAs to diseases,
and identifying small peptides encoded by lncRNAs (BIG Data Center Members, 2017,
2018, 2019). In LncRNAWiki, users can search lncRNAs and browse basic information,
including genomic location, classification, exon number, and sequence, and obtain the
annotations on function, expression, and disease association, among others. In addition,
it is convenient to add a newly discovered lncRNA and edit/curate existing lncRNAs by
registered users.
While LncRNAWiki has been an extremely useful tool for building a database of lncRNAs
and their associated information, it has significant limitations on managing structured data
and providing customized functionalities. In LncRNAWiki, the functional annotations
and sequence data are stored as unstructured text, which makes it difficult to retrieve and
show data items of interest. To organize large-scale annotations in a structured manner
and to provide customized Web functionalities with more friendly interfaces, we con-
structed LncBook (http://bigd.big.ac.cn/lncbook; Ma et al., 2019) as a complement to
the community curation–based LncRNAWiki. In LncBook, users can obtain the system-
atically curated function and disease information, which are derived from LncRNAWiki.
Most importantly, users can access large-scale lncRNA-related data including expres-
sion, methylation, variation, and interaction, among others, while also taking advantage
of fully incorporated analysis tools.
2 of 19
Current Protocols in Bioinformatics
Software
Up-to-date Web browser, such as Firefox, Safari, or Internet Explorer
Search lncRNA in LncRNAWiki
Users can search lncRNA by transcript ID, symbol, or other keywords. In the homepage,
a global search in LncRNAWiki can be performed by entering the transcript ID or symbol,
keywords, or phrases in the search box and pressing “Enter” or by clicking the magnifier
icon in the search box. The related pages that include both the page title matches and
page text matches will be shown, or a page will appear with a message informing you
that no page has the keywords and phrases. If a page has the same title as the keyword,
press “Enter” to jump to that page directly.
Figure 1 Search results of “MALAT1.” There is no page titled “MALAT1,” and all page text
matches are displayed. In LncRNAWiki, MALAT1 corresponds to the ID “ENST00000534336.1.” Ma et al.
3 of 19
Current Protocols in Bioinformatics
Figure 2 Screenshot of the annotation page of “MALAT1.” There are two parts: the user-editable part and the
Basic Information table. The user-editable part includes Annotated Information, Labs working on this lncRNA,
and References, which allow users to edit the information; the Basic Information table is not editable.
3. Browse the user-editable section, which appears at the top of the page.
The user-editable section stores various annotations as unstructured text, including
Annotated Information, Labs working on this lncRNA, and References.
4. Browse the Basic Information table, which mainly contains genomic annotations.
This table has ten subsections: Transcript ID, Source, Same with, Classification,
Length, Genomic location, Exon number, Exons, Genome context, and Sequence.
Edit or update the annotations of lncRNAs
The annotations in LncRNAWiki can be edited or annotated, which enables the content
to be kept correct and up to date.
4 of 19
Current Protocols in Bioinformatics
Figure 3 Screenshot of the curation page of “MALAT1.” Users should use wiki markup to format the text.
Download
15. Click on “Downloads” on the left side of the page.
Readme, data sources, basic information, sequence, and small protein are all freely
available. If you use the data, please cite the database and the related publications.
Ma et al.
5 of 19
Current Protocols in Bioinformatics
Figure 4 Create a page for a new lncRNA in LncRNAWiki. The keyword “TEST-LncRNA” does not exist in
LncRNAWiki. Clicking the highlighted “TEST-LncRNA” will create a new page titled “TEST-LncRNA.”
6 of 19
Current Protocols in Bioinformatics
Figure 5 Browse lncRNAs in LncBook. Click on the transcript or gene ID of interest in the table of lncRNAs
(A) to access the Transcript page (B) or Gene page (C), respectively, where there are multiple subsections
describing basic information of a lncRNA transcript or gene.
and all the functionally studied lncRNA transcripts, respectively. Alternatively, the two
sections can be accessed in the Resources section in the middle of the homepage. This
protocol demonstrates workflows for browsing lncRNAs by chromosome and GC content,
the content of a lncRNA transcript or gene page, and all the experimentally validated
lncRNAs.
7 of 19
Current Protocols in Bioinformatics
Figure 6 Browse featured lncRNAs. Featured LncRNAs allows users to browse the functionally curated
lncRNAs by gene symbol, transcript ID, gene ID, or synonyms. Basic information of each featured lncRNA is
summarized in the table.
3. Browse lncRNA gene. Click the gene ID to access the Gene page, where users
can browse gene-related information, including basic information (genomic context
and length), transcripts, and multi-omics data (expression, methylation, genome
variation) as shown in Figure 5. The reciprocal links of transcripts are also available
in the Gene page.
4. Browse featured LncRNAs. The Featured LncRNAs section enables users to browse
all the functionally curated lncRNAs by gene symbol, transcript ID, gene ID, or syn-
onyms. For the featured lncRNAs, the primary information—including functional
mechanism, biological process, disease, MeSH ontology, and PMID—is summa-
rized in the results table (Fig. 6).
5. In the homepage, type or copy a keyword into the search box in the middle of the
page, and then click “Search” on the right of the search box to perform a keyword
search (Fig. 7). This will lead to a results page that displays a list of lncRNAs that
match the search.
For example, entering “breast cancer” in the search box and then clicking the search
button will lead to a results page containing all lncRNAs that are associated with breast
cancer. The search results for breast cancer by default only show items including Tran-
script ID, Gene ID, Symbol, Classification, Biological Processes, and Disease. For a
personalized view, click the pull-down menu in the top right corner of the search results
to add or remove items (Fig. 7).
Ma et al.
8 of 19
Current Protocols in Bioinformatics
Figure 7 Basic search in LncBook. The example shows the output of “breast cancer.” For a personalized
view, click the pull-down box to add or remove items.
6. LncBook also allows for an individual feature search in different sections; it may
be convenient to search a lncRNA or a specific group of lncRNAs in the resources,
including Featured LncRNAs, Disease, Function, Methylation, and Variation.
Browse multi-omics data
In the Multi-Omics pull-down list in the navigation menu or Resources section in the
middle of homepage, hyperlinks provide access to basic browsing of multi-omics data
including expression, methylation, variation, and interaction (Fig. 8).
9 of 19
Current Protocols in Bioinformatics
Figure 8 Browse multi-omics data. Multi-omics data can be accessed by clicking items in the pull-down list
of the navigation menu or Resources section in the middle of homepage. Multi-omics data covers expression,
methylation, variation, and interaction.
Figure 9 Browse lncRNA expression profiles across normal human tissues: (A) filter options, (B) lncRNAs
that match the filter options in (A), (C) expression profile of the lncRNA “HSALNT0001588,” and (D) criteria for
tissue-specific or housekeeping lncRNAs.
b. Click the button in the Chart column to view expression levels of the transcript
in different tissues in a box plot (Fig. 9).
c. Click the hyperlinks in the left corner of the expression page to view results
pages containing tissue-specific lncRNAs and housekeeping lncRNAs, which are
Ma et al.
characterized with specific parameters (Fig. 9).
10 of 19
Current Protocols in Bioinformatics
Figure 10 Browse methylation profiles across different cancers. Methylation levels of promoters and body
regions can be viewed in dot plots. Methylation allows specific browsing by gene symbol, synonyms, and
transcript ID.
8. Methylation archives DNA methylation levels of both cancer and normal sam-
ples in nine different types of cancer (bladder urothelial carcinoma, glioblastoma
multiforme, lung squamous cell carcinoma, lung adenocarcinoma, breast invasive
carcinoma, colon adenocarcinoma, rectum adenocarcinoma, stomach adenocarci-
noma, uterine corpus endometrial carcinoma). Methylation level is calculated based
on whole genome bisulfite sequencing data.
a. Click “Methylation” to browse the average DNA methylation level of promoter
regions in nine types of cancer (Fig. 10).
b. Click on the chart hyperlinks to view dot plots of methylation levels of the
promoter and body regions in cancer and normal samples (Fig. 10).
The Methylation section also allows specific browsing by gene symbol, synonyms, and
transcript ID (Fig. 10).
9. Variation provides information on single-nucleotide polymorphisms (SNPs) that are
mapped to lncRNA loci in the SNP database (dbSNP; Sherry et al., 2001). For
each SNP, minor allele frequency (MAF) values are annotated based on the 1000
Genomes Project (Abecasis et al., 2010), and pathogenic information is obtained
from ClinVar (Landrum et al., 2016) and COSMIC (Forbes et al., 2017).
a. Click “Variation” to browse SNPs in lncRNA loci by chromosome, posi-
tive/negative strand, genomic location, and dbSNP ID.
For example, choosing the filter options “Chromosome2,” “+,” and genomic region
between “1,000” and “100,000” leads to the results table shown in Figure 11.
b. View lncRNA SNP-related information listed in the summary table. For each
SNP, lncRNA transcript ID, basic information, MAF values obtained from the
1000 Genomes Project, and pathogenic information in ClinVar and COSMIC are
displayed (Fig. 11).
Clicking the dbSNP ID provides access to the dbSNP page, where detailed information
Ma et al.
is available.
11 of 19
Current Protocols in Bioinformatics
Figure 11 Browse variation data. The results table shows SNPs that match the filter options.
Figure 12 Browse lncRNA-miRNA interactions. Interaction allows users to browse lncRNA-miRNA interac-
tions by gene symbol, miRNA ID, synonyms, transcript ID, and experimental evidence.
13 of 19
Current Protocols in Bioinformatics
Figure 14 Browse validated lncRNA-disease associations. Validated allows specific searches by different
items. Alternatively, the lncRNAs associated with specific MeSH ontology terms are available by clicking on
the hyperlink.
15. Selecting “Disease” or “MeSH Ontology” in the search box pull-down menu pro-
vides the results of a group of lncRNAs related with specific diseases (Fig. 14).
Similarly, the MeSH Ontology hyperlink at the top of the page provides access to
the specific list of lncRNAs (Fig. 14).
16. Click the gene symbol to access the annotation page in LncRNAWiki.
17. Click “Predicted” toward the top of the page to access the predicted disease-
associated lncRNAs (Fig. 15), which are obtained based on the pathogenic evidence
from methylation, variation, and interaction.
18. View supportive evidence listed in the summary table (Fig. 15). Supportive evidence
is indicated with a checkmark, and classification category and expression pattern
are also provided.
19. Search a specific predicted disease-associated lncRNA or a group of such lncRNAs
by selecting Transcript ID or Classification in the pull-down search menu (Fig. 15).
20. Click “Methylation Change,” “Pathogenic Variation,” or “All Grades” to browse
a group of lncRNAs supported by methylation evidence, pathogenic variation evi-
dence, or one or multiple evidence, respectively (Fig. 15).
Download
21. Click “Download” in the navigation menu to access the Download page.
LncBook is an open access database distributed under the terms of the Creative Commons
Attribution Noncommercial License, which permits unrestricted noncommercial use, dis-
tribution, and reproduction in any medium, provided the original work is properly cited.
In the Download page, users can freely download various data including lncRNA se-
quence in FASTA format; annotation files in GTF and GFF formats; lists of featured
Ma et al. lncRNAs; and function, disease, expression, methylation, variation, and interaction data.
14 of 19
Current Protocols in Bioinformatics
Figure 15 Browse predicted disease-associated lncRNAs. Predicted allows specific searches by differ-
ent items including transcript ID and classification. In addition, lncRNAs associated with different kinds of
pathogenic evidence are available by clicking on the hyperlinks in Methylation Change, Pathogenic Variation,
and All Grades.
COMMENTARY
Background Information traced back to their genomic locations and thus
Human lncRNAs in LncRNAWiki and are not included.
LncBook were integrated from different To date LncBook contains a large collec-
sources. BLAST alignment (Altschul, Gish, tion of 247,246 existing lncRNAs. Also, novel
Miller, Myers, & Lipman, 1990) or Cuffcom- lncRNAs were identified based on the 122
pare comparison (Trapnell et al., 2012) was RNA-seq data from HPA, and 21,815 novel
used to identify redundant lncRNAs. However, lncRNAs were identified. Finally, we obtained
LncBook used the updated data of existing a total number of 270,044 lncRNAs.
databases and also identified novel lncRNAs
based on RNA-seq data in HPA.
LncBook provides the most comprehensive Future Directions
list of human lncRNAs to date. Specifically, In the future, LncBook will try to improve
we collected existing human lncRNAs from data quality by using more strict standards
GENCODE version 27 (Derrien et al., 2012), and integrating high-quality annotations. We
NONCODE version 5.0 (Fang et al., 2018), plan to integrate full-length lncRNAs from
LNCipedia version 4.1 (Volders et al., 2019), additional databases such as FANTOM CAT
and MiTranscriptome beta (Iyer et al., 2015). (Hon et al., 2017) and BIGTranscriptome
To obtain high-confidence lncRNAs in (You, Yoon, & Nam, 2017). Also, we will
LncBook, a set of strict criteria was maintain regular integration of newly discov-
adopted by considering redundancy, back- ered lncRNAs to obtain a comprehensive list
ground noise, mapping error, incomplete tran- of human lncRNAs. Future developments of
script, length, and coding potential. On the LncBook also include incorporation of other
other hand, we integrated the experimentally omics data, such as RNA N6 -methyladenosine
validated lncRNAs, which were sourced from and RNA 5-methylcytosine modifications,
LncRNAWiki. The RefSeq and Ensembl ref- and identification of differentially expressed
erences were obtained from the HUGO Gene lncRNAs in normal and disease samples.
Nomenclature Committee (HGNC) to enable This would provide users with more in-
genomic location comparison. However, half formation to find important functional
of these lncRNAs are presently unable to be lncRNAs.
Ma et al.
15 of 19
Current Protocols in Bioinformatics
Figure 16 Tools for online analysis. Tools can be accessed by clicking items in the pull-down list of the
navigation menu or by clicking “Tools” in the homepage.
Figure 17 BLAST results of the query sequence “ENST00000469225.1_1” against LncBook lncRNAs.
16 of 19
Current Protocols in Bioinformatics
Figure 18 Coding potential prediction. (A) LGC main page. Clicking “Example Sequence” and “Run” will
jump to the waiting page (B). (C) Prediction results.
Figure 19 Classifying a lncRNA based on its positional relationship with a protein-coding gene. In this
example, clicking “Example” and “Run” shows that this is an intergenic lncRNA.
(Fig. 16), which would help users perform on- human lncRNAs in LncBook (Fig. 17). For
line analysis. users who would like to test if a certain
lncRNA has been included in LncBook or
BLAST alignment to try to find human lncRNA homologs,
BLAST allows sequence similarity BLAST could be used to perform their
searches against the comprehensive list of analyses. Ma et al.
17 of 19
Current Protocols in Bioinformatics
Figure 20 LncRNA ID conversion. Clicking “Example IDs” and “Run” will display IDs in other databases.
Literature Cited
Classification Abecasis, G. R., Altshuler, D., Auton, A., Brooks,
Classification is also an in-house tool which L. D., Durbin, R. M., Gibbs, R. A., . . . McVean,
was developed to classify a lncRNA transcript G. A. (2010). A map of human genome variation
based on its positional relationship with a from population-scale sequencing. Nature, 467,
protein-coding gene (Fig. 19). Users could use 1061–1073. doi: 10.1038/nature09534.
this tool to annotate the relative genomic loca- Altschul, S. F., Gish, W., Miller, W., Myers, E.
tion of their own lncRNAs. W., & Lipman, D. J. (1990). Basic local align-
ment search tool. Journal of Molecular Biology,
215, 403–410. doi: 10.1016/S0022-2836(05)
80360-2.
LncRNA ID conversion
LncBook integrates a comprehensive list Betel, D., Wilson, M., Gabow, A., Marks, D.
S., & Sander, C. (2008). The microRNA.org
of human lncRNAs from existing databases, resource: Targets and expression. Nucleic
and this conversion tool allows transcript Acids Research, 36, D149–153. doi: 10.1093/
ID conversion among six different databases: nar/gkm995.
LncBook, GENCODE, RefSeq, NONCODE, BIG Data Center Members. (2017). The
Ma et al.
LNCipedia, MiTranscriptome (Fig. 20). Users BIG Data Center: From deposition to
18 of 19
Current Protocols in Bioinformatics
integration to translation. Nucleic Acids Re- Ma, L., Cao, J., Liu, L., Du, Q., Li, Z., Zou, D., . . .
search, 45, D18–D24. doi: 10.1093/nar/ Zhang, Z. (2019). LncBook: A curated knowl-
gkw1060. edgebase of human long non-coding RNAs.
BIG Data Center Members. (2018). Database re- Nucleic Acids Research, 47, D128–D134. doi:
sources of the BIG Data Center in 2018. 10.1093/nar/gky960.
Nucleic Acids Research, 46, D14–D20. doi: Ma, L., Li, A., Zou, D., Xu, X., Xia, L., Yu,
10.1093/nar/gkx897. J., . . . Zhang, Z. (2015). LncRNAWiki: Har-
BIG Data Center Members. (2019). Database Re- nessing community knowledge in collabora-
sources of the BIG Data Center in 2019. tive curation of human long non-coding RNAs.
Nucleic Acids Research, 47, D8–D14. doi: Nucleic Acids Research, 43, D187–192. doi:
10.1093/nar/gky993. 10.1093/nar/gku1167.
Chen, G., Wang, Z., Wang, D., Qiu, C., Liu, M., Sherry, S. T., Ward, M. H., Kholodov, M., Baker,
Chen, X., . . . Cui, Q. (2013). LncRNADis- J., Phan, L., Smigielski, E. M., & Sirotkin, K.
ease: A database for long-non-coding RNA- (2001). dbSNP: The NCBI database of genetic
associated diseases. Nucleic Acids Research, 41, variation. Nucleic Acids Research, 29, 308–311.
D983–986. doi: 10.1093/nar/gks1099. doi: 10.1093/nar/29.1.308.
Derrien, T., Johnson, R., Bussotti, G., Tanzer, A., The GTEx Consortium. (2015). Human genomics.
Djebali, S., Tilgner, H., . . . Guigo, R. (2012). The Genotype-Tissue Expression (GTEx) pi-
The GENCODE v7 catalog of human long non- lot analysis: Multitissue gene regulation in hu-
coding RNAs: Analysis of their gene structure, mans. Science, 348, 648–660. doi: 10.1126/
evolution, and expression. Genome Research, science.1262110.
22, 1775–1789. doi: 10.1101/gr.132159.111. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim,
Fang, S., Zhang, L., Guo, J., Niu, Y., Wu, Y., D., Kelley, D. R., . . . Pachter, L. (2012). Dif-
Li, H., . . . Zhao, Y. (2018). NONCODEV5: ferential gene and transcript expression analy-
A comprehensive annotation database for long sis of RNA-seq experiments with TopHat and
non-coding RNAs. Nucleic Acids Research, 46, Cufflinks. Nature Protocols, 7, 562–578. doi:
D308–D314. doi: 10.1093/nar/gkx1107. 10.1038/nprot.2012.016.
Forbes, S. A., Beare, D., Boutselakis, H., Bam- Uhlen, M., Fagerberg, L., Hallstrom, B. M., Lind-
ford, S., Bindal, N., Tate, J., . . . Campbell, P. skog, C., Oksvold, P., Mardinoglu, A., . . . Pon-
J. (2017). COSMIC: Somatic cancer genetics ten, F. (2015). Proteomics. Tissue-based map of
at high-resolution. Nucleic Acids Research, 45, the human proteome. Science, 347, 1260419.
D777–D783. doi: 10.1093/nar/gkw1121. doi: 10.1126/science.1260419.
Hon, C. C., Ramilowski, J. A., Harshbarger, J., Volders, P. J., Anckaert, J., Verheggen, K., Nuytens,
Bertin, N., Rackham, O. J., Gough, J., . . . For- J., Martens, L., Mestdagh, P., & Vandesom-
rest, A. R. (2017). An atlas of human long non- pele, J. (2019). LNCipedia 5: Towards a ref-
coding RNAs with accurate 5 ends. Nature, 543, erence set of human long non-coding RNAs.
199–204. doi: 10.1038/nature21374. Nucleic Acids Research, 47, D135–D139. doi:
10.1093/nar/gky1031.
Iyer, M. K., Niknafs, Y. S., Malik, R., Singhal, U.,
Sahu, A., Hosono, Y., . . . Chinnaiyan, A. M. Volders, P. J., Verheggen, K., Menschaert, G., Van-
(2015). The landscape of long noncoding RNAs depoele, K., Martens, L., Vandesompele, J., &
in the human transcriptome. Nature Genetics, Mestdagh, P. (2015). An update on LNCipedia:
47, 199–208. doi: 10.1038/ng.3192. A database for annotated human lncRNA se-
quences. Nucleic Acids Research, 43, D174–
Landrum, M. J., Lee, J. M., Benson, M., Brown, D180. doi: 10.1093/nar/gku1060.
G., Chao, C., Chitipiralla, S., . . . Maglott, D.
R. (2016). ClinVar: Public archive of interpre- Wang, G., Yin, H., Li, B., Yu, C., Wang, F., Xu, X.,
tations of clinically relevant variants. Nucleic . . . Zhang, Z. (2019). Characterization and iden-
Acids Research, 44, D862–868. doi: 10.1093/ tification of long non-coding RNAs based on
nar/gkv1222. feature relationship. Bioinformatics, Epub ahead
of print. doi: 10.1093/bioinformatics/btz008.
Lewis, B. P., Burge, C. B., & Bartel, D. P.
(2005). Conserved seed pairing, often flanked by Yanai, I., Benjamin, H., Shmoish, M., Chalifa-
adenosines, indicates that thousands of human Caspi, V., Shklar, M., Ophir, R., . . . Shmueli,
genes are microRNA targets. Cell, 120, 15–20. O. (2005). Genome-wide midrange transcription
doi: 10.1016/j.cell.2004.12.035. profiles reveal expression level relationships in
human tissue specification. Bioinformatics, 21,
Li, J. H., Liu, S., Zhou, H., Qu, L. H., & Yang,
650–659. doi: 10.1093/bioinformatics/bti042.
J. H. (2014). starBase v2.0: Decoding miRNA-
ceRNA, miRNA-ncRNA and protein-RNA in- You, B. H., Yoon, S. H., & Nam, J. W. (2017). High-
teraction networks from large-scale CLIP-Seq confidence coding and noncoding transcriptome
data. Nucleic Acids Research, 42, D92–97. doi: maps. Genome Research, 27, 1050–1062. doi:
10.1093/nar/gkt1248. 10.1101/gr.214288.116.
Ma et al.
19 of 19
Current Protocols in Bioinformatics