Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

5. Juzenas, S. et al.

(2017) A comprehensive, cell specific


the evidence of these papers, another microRNA catalogue of human peripheral blood. Nucleic
protein-coding requires more effort. Sur-
curtain appears behind it. To pull this Acids Res. 45, 9290–9301 prisingly, in many textbooks not much
second curtain back, we must develop 6. McCall, M.N. et al. (2017) Toward the human cellular attention is spent on defining the term
microRNAome. Genome Res. 27, 1769–1781
strategies to isolate cells directly from 7. Ludwig, N. et al. (2016) Distribution of miRNA expression
ORF, apparently taking its meaning for
tissues. What this successful technology across human tissues. Nucleic Acids Res. 44, 3865–3877 granted. Moreover, the given definitions
will be is not yet clear, but without this 8. Fromm, B. et al. (2015) A uniform system for the annotation are often not perfectly clear-cut. For
of vertebrate microRNA genes and the evolution of the
information, our understanding of the human microRNAome. Annu. Rev. Genet. 49, 213–242 example, the standard textbook Genes
roles miRNAs play in cellular differentia- 9. Pritchard, C.C. et al. (2012) MicroRNA profiling: VII by Lewin [4] states on p. 26: ‘A reading
approaches and considerations. Nat. Rev. Genet. 13,
tion and homeostasis remains incom- 358–369
frame that consists exclusively of triplets
plete. Accurate miRNA expression 10. Baras, A.S. et al. (2015) miRge – a multiplexed method of representing amino acids is called an
estimates are even more important as processing small RNA-seq data to determine microRNA open reading frame or ORF. A sequence
entropy. PLoS One 10, e0143066
we learn about the importance of the 11. Kuosmanen, S.M. et al. (2017) MicroRNA profiling reveals
that is translated into protein has a read-
relative abundance of miRNAs to their distinct profiles for tissue-derived and cultured endothelial ing frame that starts with a special initia-
cells. Sci. Rep. 7, 10943
targets and sponges [13]. tion codon (AUG) and that extends
12. Schwarz, E.C. et al. (2016) Deep characterization of blood
cell miRNomes by NGS. Cell. Mol. Life Sci. 73, 3169–3181 through a series of triplets representing
All together, these insights and resources 13. Pinzon, N. et al. (2017) microRNA target prediction pro- amino acids until it ends at one of the
grams predict many false positives. Genome Res. 27,
greatly advance miRNA research. 234–245
three types of termination codon’. The
first sentence defines an ORF as bounded
Acknowledgments by stop codons (stop/stop definition)
M.K.H. and M.N.M. are supported by grant whereas the second sentence may be
1R01HL137811 from the National Institutes of Health. Forum (mis)understood as beginning with a start
M.K.H. is also supported by an American Heart Asso- codon (start/stop definition). Currently at
ciation Grant-in-Aid (17GRNT33670405). M.N.M. is The Definition of Open least three definitions are in use, which
also supported by the University of Rochester CTSA
award number UL1 TR002001 from the National Cen-
Reading Frame differ in the location of the ORF bound-
aries [5] (Box 1).
ter for Advancing Translational Sciences of the National Revisited
Institutes of Health. B.F. is supported by the South-
Eastern Norway Regional Health Authority (Grant No. Patricia Sieber,1 Before going into detail, it is worth recall-
2014041). K.J.P. is supported by NASA-Ames. Matthias Platzer,2 and ing the different meanings of the term
Stefan Schuster1,* ‘definition’ itself. A ‘lexical definition’
1
Department of Pathology, Johns Hopkins University
School of Medicine, Baltimore, MD 21205, USA
reports the most common usage of a
2
Department of Tumor Biology, Institute for Cancer The term open reading frame term [6,7]. It is the definition likely to be
Research, The Norwegian Radium Hospital, Oslo
(ORF) is of central importance to found in a dictionary and can change over
University Hospital, N-0424 Oslo, Norway
time. An ‘operational definition’ focuses
3
Department of Biological Sciences, Dartmouth College, gene finding. Surprisingly, at least
Hanover, NH 03755, USA on a specific objective or application and
4 three definitions are in use. We
Department of Biostatistics and Computational Biology, may differ from the lexical definition [6]. In
University of Rochester Medical Center, Rochester, NY discuss several molecular biologi- our case, the main objective is gene find-
14642, USA
@
cal and bioinformatics aspects, ing using bioinformatics software. The
Twitter: @Marc_Halushka
and we recommend using the question arises of how the term ORF devi-
*Correspondence: mhalush1@jhmi.edu (M.K. Halushka). definition in which an ORF is
URL: http://labs.pathology.jhu.edu/halushka/.
ates from that of a coding DNA sequence
https://doi.org/10.1016/j.tig.2017.12.015
bounded by stop codons. (CDS). A CDS means a nucleotide
sequence that is eventually translated into
References Open reading frame (ORF) is a basic term a protein [8]. This implies that the CDS of a
1. Kent, O.A. et al. (2014) Lessons from miR-143/145: the
importance of cell-type localization of miRNAs. Nucleic
in molecular genetics and bioinformatics. particular protein is bounded by transla-
Acids Res. 42, 7528–7538 The detection of ORFs is an important tion start and stop codons. In some cases
2. McCall, M.N. et al. (2011) MicroRNA profiling of diverse step in finding protein-coding genes in the term ORF is considered equivalent to
endothelial cell types. BMC Med. Genom. 4, 78
3. Witwer, K.W. and Halushka, M.K. (2016) Toward the
genomic sequences, including analyses that of CDS [9]. Other authors describe an
promise of microRNAs – enhancing reproducibility and based on highly fragmented draft (meta) ORF as a potential protein-coding
rigor in microRNA research. RNA Biol. 13, 1103–1116
genome assemblies [1–3]. ORFs can be sequence which can be determined by
4. de Rie, D. et al. (2017) An integrated expression atlas of
miRNAs and their promoters in human and mouse. Nat. detected by simple in silico analysis, sequence features alone [8]. Note that
Biotechnol. 35, 872–878 while proving that a sequence is really there is a difference between the

Trends in Genetics, March 2018, Vol. 34, No. 3 167


concepts of reading frame and ORF. A or homology search) can be included to ORF according to Definition 2 is not much
reading frame is one of six possibilities find the correct start of the ORF [11]. In longer than when beginning with a start
for translating a given double-stranded the case of multiple ATG triplets, several codon [9]. Furthermore, it is easier to
genomic sequence into amino acids. For tools based on Definition 1 only consider apply than Ddefinition 1 because stop
a particular reading frame, an ORF is a the first as the start codon. codons simply need to be found. Among
region that is not interrupted by a stop others, Definition 2 was applied in the
codon and is bounded in accordance with In eukaryotic genomes, it is much more algorithm of OrfM [3], which shows to
a particular definition (Box 1) [5]. Thus, an complicated to predict CDSs because be significantly faster than methods that
ORF is a sequence region that is ‘open’ for most introns contain stop codons and/ search for start and stop codons. By con-
translation. It is an indicator for a potential or cause shifts between reading frames trast, Definition 3 is limited to eukaryotic
protein-coding gene [3]. (when comparing mature transcripts with internal and (potentially) completely pro-
the DNA sequence). Furthermore, it is tein-coding exons because they are iden-
We revisit and discuss here the different difficult to identify the correct splice sites. tified by specific algorithms of eukaryotic
ORF definitions to finally recommend one Splicing can be considered easily when gene annotation before determining
definition universally applicable in finding applying Definition 2, which can deal with translation start and stop positions.
protein-coding genes by bioinformatics stop codons located within introns. An Although finding splice sites is more com-
tools. All three ORF definitions currently ORF according to this definition does plicated than finding start and stop
in use (Box 1 and Figure 1) consider stop not necessarily contain an entire CDS, codons, Definition 3 is useful for these
codons. In the most widely used genetic but a potential exon or group of exons. algorithms, but is only rarely mentioned
code, three of 64 triplets encode stop In addition, metagenomic or other frag- in the literature. All three definitions are
codons (TAG, TGA, and TAA). In DNA ments such as transcript contigs based on operational rules and are well-
sequences not subject to selection and obtained by RNA-seq with missing start suited for being implemented. All three
with a G+C content of 50%, the average or stop codons can be analyzed. In this are employed in ORF prediction software,
distance between stop codons is 64/ case, the search for ATG triplets is often as shown in Box S1 in the supplemental
3 ffi 21 codons. It is even shorter when meaningless. The definition should then information online.
the G + C content is lower [10]. By con- be relaxed by considering a maximal
trast, the median length of protein-coding stretch of a nucleotide sequence not In this context, it is also worthwhile com-
sequences is considerably higher than 21 interrupted by internal stop codons in paring the concepts of ORF and exon.
codons [5,10]. the considered reading frame. It can be First, they obviously differ because stop
applied to complete genomic sequences codons and splice sites are clearly not
Definition 1 is the current lexical definition as well, and could be proposed as gen- identical. Second, there may not be a
because most textbooks use it. This may eral definition. Another point is that the 50 - stop codon in the neighboring introns,
have historical reasons because the first untranslated region (50 -UTR) frequently and an ORF may therefore include more
completely sequenced genomes (except contains stop codons such that the than one exon (legend to Figure 1).
viruses) were prokaryotic, and gene
structure in prokaryotes is less complex
than in eukaryotes because of the Box 1. Three ORF Definitions Currently in Use
absence of splicing. Definition 1 focuses In all definitions, an ORF is regarded as a stretch of nucleotide sequence that is not interrupted by stop
on identifying potential CDSs in prokary- codons in a given reading framea [5], while they differ as follows:
otic genomes. In eukaryotes, it is applica-
ble only to mature mRNAs, the few genes Definition 1: an ORF is a sequence that has a length divisible by three and begins with a translation start
codon (ATG) and ends at a stop codon [2,8–10].
containing only a single translated exon,
and to genes with introns of a length Definition 2: an ORF is a sequence that has a length divisible by three and is bounded by stop codons
divisible by three and not containing stop [3,5,12].
codons in the respective reading frame.
However, in both prokaryotes and eukar- Definition 3: an ORF is a sequence delimited by an acceptor and a donor splice site [1]. Thus, it refers to a
potentially translated eukaryotic internal exon. 50 - and 30 -terminal exons of a putative gene are determined at
yotes there is the problem that internal
the end of the gene prediction process and are not considered for the actual ORF detection.
methionine codons can be mistaken for
start codons [11]. Context information a
This overarching ‘boundless’ definition is inherent in all three definitions and is necessary when analyz-
(such as potential promoter sequences ing very short sequence stretches, as in the case of metagenome assemblies.

168 Trends in Genetics, March 2018, Vol. 34, No. 3


Prokaryotes
ATG Stop
DefiniƟon 1
5'UTR + CDS
DNA strand

Stop ATG Stop


DefiniƟon 2
5'UTR + CDS
DNA strand

Eukaryotes

Stop ATG Stop Stop Stop


DefiniƟon 1
Exon Exon
DNA strand

Stop ATG Stop Stop Stop


DefiniƟon 2
Exon Exon
DNA strand

Stop ATG Stop Stop Stop


DefiniƟon 3
Exon Exon
DNA strand

Figure 1. Applying the Three Definitions Leads to Different Open Reading Frames (ORFs) (Indicated by Orange Lines) Concerning Their Boundaries.
The corresponding ORFs vary between prokaryotes and eukaryotes. An ORF is delimited by a start codon and a stop codon (Definition 1; in the case of prokaryotes
practically redundant with CDS), two stop codons (Definition 2), or donor and acceptor splice sites (Definition 3; only for eukaryotes). In all cases the ORFs are not
interrupted by internal stop codons in the considered reading frame. According to Definition 2, the ORFs of a eukaryotic gene need not lie in the same reading frame. An
ORF according to Definitions 1 or 2 may involve more than one exon if there are no stop codons in the intronic region in between and if they lie in the same reading frame.

Definition 2 distinguishes clearly between sequences. Overall, we are coming to Acknowledgments


ORF, CDS, and exon. It can easily be the conclusion that Definition 2 is to be We thank Günter Theißen and Martin Hölzer for stim-
processed by a computer and is the most preferred, and we suggest making it the ulating discussions. Financial support by the Univer-
general definition. Furthermore, this defi- lexical definition in the future. Definition 2 sity of Jena and the Deutsche
Forschungsgemeinschaft (Transregio 124 FungiNet,
nition can be applied even in the case of is preferable from an operational, prag-
project B1) is gratefully acknowledged.
prokaryotes and metagenomic matic point of view: from stop to stop.

Trends in Genetics, March 2018, Vol. 34, No. 3 169


Supplemental Information References 7. Lau, J.Y.F. (2011) An Introduction to Critical Thinking and
1. Brent, M.R. (2005) Genome annotation past, present, and Creativity. Think More, Think Better, John Wiley & Sons
Supplemental information associated with this article future: How to define an ORF at each locus. Genome Res. 8. Andrews, S.J. and Rothnagel, J.A. (2014) Emerging evi-
can be found online at https://doi.org/10.1016/j.tig. 15, 1777–1786 dence for functional peptides encoded by short open
2017.12.009. 2. Mir, K. et al. (2012) Predicting statistical properties of open reading frames. Nat. Rev. Genet. 15, 193–204
reading frames in bacterial genomes. PLoS One 7, e45103 9. Min, X.J. et al. (2005) OrfPredictor: predicting protein-
1
Department of Bioinformatics, Friedrich Schiller 3. Woodcroft, B.J. et al. (2016) OrfM: a fast open reading coding regions in EST-derived sequences. Nucleic Acids
University Jena, Ernst-Abbe-Platz 2, 07743 Jena, frame predictor for metagenomic data. Bioinformatics 32, Res. 33, W677–W680
2702–2703 10. Pohl, M. et al. (2012) GC content dependency of open
Germany
2
Leibniz Institute on Aging – Fritz Lipmann Institute (FLI), 4. Lewin, B. (ed.) (1999) Genes VII, Oxford University Press reading frame prediction via stop codon frequencies. Gene
5. Claverie, J.-M. (1997) Computational methods for the 511, 441–446
Beutenbergstraße 11, 07745 Jena, Germany
identification of genes in vertebrate genomic sequences. 11. Guigo, R. et al. (1992) Prediction of gene structure. J. Mol.
Hum. Mol. Genet. 6, 1735–1744 Biol. 226, 141–157
*Correspondence: stefan.schu@uni-jena.de (S. Schuster).
6. Sevilla, C.G. et al. (2007) Research Methods, Rex Book 12. Fermin, D. et al. (2006) Novel gene and gene model detec-
https://doi.org/10.1016/j.tig.2017.12.009
Store tion using a whole genome open reading frame analysis in
proteomics. Genome Biol. 7, R35

170 Trends in Genetics, March 2018, Vol. 34, No. 3

You might also like