Classification and Function of Small Open Reading Frames

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318402958

Classification and function of small open reading frames

Article  in  Nature Reviews Molecular Cell Biology · July 2017


DOI: 10.1038/nrm.2017.58

CITATIONS READS

106 915

2 authors, including:

Pedro Patraquim
Brighton and Sussex Medical School
15 PUBLICATIONS   452 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Pedro Patraquim on 12 January 2018.

The user has requested enhancement of the downloaded file.


ANALYSIS
Classification and function of small
open reading frames
Juan-Pablo Couso1,2 and Pedro Patraquim2
Abstract | Small open reading frames (smORFs) of 100 codons or fewer are usually — if arbitrarily
— excluded from proteome annotations. Despite this, the genomes of many metazoans,
including humans, contain millions of smORFs, some of which fulfil key physiological functions.
Recently, the transcriptome of Drosophila melanogaster was shown to contain thousands of
smORFs of different classes that actively undergo translation, which produces peptides of mostly
unknown function. Here, we present a comprehensive analysis of smORFs in flies, mice and
humans. We propose the existence of several functional classes of smORFs, ranging from inert
DNA sequences to transcribed and translated cis-regulators of translation and peptides with a
propensity to function as regulators of membrane-associated proteins, or as components of
ancient protein complexes in the cytoplasm. We suggest that the different smORF classes could
represent steps in gene, peptide and protein evolution. Our analysis introduces a distinction
between different peptide-coding classes of smORFs in animal genomes, and highlights the role
of model organisms for the study of small peptide biology in the context of development,
physiology and human disease.

Gene sequences were initially identified as containing not annotated because they have not been experimen-
open reading frames (ORFs) that are potentially trans- tally corroborated, and they have not been corroborated
latable into proteins. More recently, the full molecular because they are not annotated, a difficulty which is
complexity of genes has been exposed, culminating in the rarely (and only serendipitously) overcome.
updated concept of a gene that includes both regulatory The problem with computational annotation is
regions and transcripts1. Intriguingly, many genes have that, as is the case for canonical protein-coding ORFs,
been discovered that produce transcripts with mRNA-like it relies fundamentally on sequence similarities, which
features, such as capping and polyadenylation, but that are reveal the conservation of the putative coding sequence,
not apparently translated into proteins, and these tran- thereby indicating a selective value and hence func-
scripts are known as long non-coding RNAs (lncRNAs). tion; and the similarity to proteins and protein domains
These genes and their products have revolutionized our with an experi­mentally corroborated function, thereby
understanding of gene regulation and RNA metabolism2,3. suggest­ing a similar function for the smORF. However,
There is another class of genetic elements that also true conservation and homology of smORFs is difficult
challenges the understanding of the coding potential of to establish owing to two fundamental problems: short
the genome: putatively functional small ORFs (smORFs; sequences accrue lower quantitative conservation scores
also known as sORFs) of 10 to 100 codons4. Millions of (that is, the sum of the scores from each amino acid)
smORF sequences are found in eukaryotic genomes5–7, than longer canonical proteins, whereas, reciprocally,
1
Centro Andaluz de Biologia and thousands can be mapped to transcripts, in many the probability of short sequences obtaining such a ‘low
del Desarrollo, CSIC-UPO, cases, to putative lncRNAs8–10. It is as if we have a genome conservation score’ by chance is higher. For example, the
Sevilla 41013, Spain. within our genome: a hidden genome about which we BLAST search tool penalizes the identification of pro-
2
Brighton and Sussex Medical know very little. smORFs have been deemed non-coding tein sequences of fewer than 80 amino acids, and fails
School, University of Sussex,
Brighton BN1 9PX, UK.
on the basis of their short length, which defeats stand- with those that have fewer than 20 (REF. 6). Given the high
ard computational detection of protein-coding capacity; ­number of smORFs in the genome, and the expectation
Correspondence to J.-P.C.
jpcou@upo.es the fact that there has been little experimental corrobor­ that most of them are not functional (see below), arbitrary
ation of their function; and for convenience, as their very cut-offs for a minimal ORF length of 50 or 100 codons are
doi:10.1038/nrm.2017.58
Published online 12 Jul 2017; high numbers present a challenge for annotation and used in genome annotation, discarding shorter ORFs that
corrected online 1 Aug 2017 ­curation. Consequently, functional smORFs are often lack experimental evidence of function.

NATURE REVIEWS | MOLECULAR CELL BIOLOGY VOLUME 18 | SEPTEMBER 2017 | 575


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Obtaining experimental evidence for smORF func- In this article, we present an analysis of both previ-
tion is also difficult. Standard biochemical methods for ously available and new smORF data. Although there is
protein isolation fail to detect peptides below 10 kD, evidence for the pervasive use of non-AUG start codons
which escape a typical gel or filter, and which can be and the translation of overlapping ORFs14,27,28, we focus
masked by degradation peptides from larger proteins. on the population of non-overlapping ORFs with canon-
Genetics also encounters problems, as short sequences ical AUG start codons in the reference genomes of three
such as smORFs offer a small target for random metazoans (fruit flies, mice and humans). We propose
mutagenesis screens and other gene-discovery proto- that sufficient information has accrued to approach
cols, and the large number of smORFs in the genome smORFs not as an undersized ‘discard bin’, but as a group
(see below) makes it impractical to carry out systematic of novel molecular actors with specific characteristics,
directed mutagenesis. In the unlikely event of a smORF evolutionary origin and biological functions at both
mutation being isolated, it is often assigned to adjacent the RNA (non-coding) and the peptide (coding) levels.
canonical genes, as most smORFs are not annotated. We present a classification of animal smORFs based
Nevertheless, smORFs are finally receiving atten- on the characteristics of their sequence and the struc-
tion and breaking out of this impasse. There is a grow- ture of their RNAs and encoded peptides, a classifica-
ing realiza­tion that hundreds, if not thousands, of tion that provides predictions on the function of not yet
smORFs are translated8,9,11 and that smORF-encoded fully characterized smORFs. Finally, we show evidence
peptides (SEPs) can have important functions and can indicating that novel smORFs can randomly and con-
be widely conserved across metazoans12,13 (reviewed in tinuously appear in a­ nimal genomes, and that different
REFS 14–16). However, the full repertoire of SEP func- smORF classes r­ epresent steps in the evolution of new
tions is still unknown, and the genomic functions and canonical proteins.
evolutionary impact of smORF sequences have not been
revealed. Experimental characterization of smORFs at a Classification of smORFs
genomic level in bacteria, yeast and plants showed that Translated Drosophila melanogaster smORFs have been
hundreds of smORFS can produce a phenotype5,17,18. identified at the genomic level using a combination of
In metazoans, experimental evidence has also accumu- ribosome profiling, peptide tagging and bioinformatics
lated to support these conclusions, although it is still analyses. Two main groups of translated smORFs have
anecdotal. Some SEPs16,19 (sometimes known as micro- been identified, according to their profiling metrics and
peptides (REFS 20,21)) are annotated as having biological bioinformatics characteristics, with one such group
activity as antibacterial peptides22, cell signalling mol­ enriched in peptides allocated to cell membranes and
ecules23, cytoskeletal regulators24 and other regulators of organelles8. This classification is important, because it
canonical proteins13,25. These functions can be essential seemed to identify smORFs with high chances of being
for viability, but only a small minority (a few hundred) functional. Such a classification could direct research
of the putative smORFs in each genome have a suspected to more promising smORFs by linking their sequence to
function (inferred by homology)26, and even fewer (tens) biochemical properties and molecular functions, thereby
have an experimentally corroborated function14. These greatly reducing the need for exploratory experiments.
functional smORFs tend to be longer (approximately Accordingly, we subsequently characterized hemotin,
80 codons) and thus are more amenable to standard which is a fly smORF from the membrane-associated
homology searches, as well as to bio­chemical and genetic group, and revealed its activity in phagosome matur­
analyses. However, 90% of smORFs are much shorter ation and its conservation in vertebrates29. We have
(approximately 20 codons), but can display sequence also obtained further experimental and bioinformatics
conservation and evidence of being translated, similarly data that support a functionally relevant classification
to ‘functional’ smORFs6,8,10. Furthermore, even these of D. melanogaster smORFs. Thus, although this pre-
shorter smORFs can have crucial developmental and liminary classification is far from being able to predict
physiological functions and homologues across great smORF function with the kind of precision that we enjoy
evolutionary distances12–14. Therefore, there could be with most canonical proteins, it offers heuristic value,
many more uncharacterized smORFs with biomedically and therefore we refine and expand it here to include
relevant functions, but until now we have not identified vertebrate smORFs. We propose the existence of at least
such smORFs, or predicted their functions. five types of smORFs with distinct transcript organiza­
Current experimental evidence shows that, in ­animal tion, size, conservation, mode of translation, amino acid
genomes, only approximately 1.2% of smORFs are tran- usage (frequency of each amino acid) and peptide struc-
scribed, and of these only about one-third seem to be ture properties. We suggest that these five classes are
translated (see below). However, even this small per- likely to have different cellular and molecular functions,
centage of functional smORFs could theoretically prod­ from inert DNA sequences to transcribed and translated
Ribosome profiling uce tens of thousands of uncharacterized peptides in cis-regulators of translation, to the expression of func-
A technique that globally each animal species. Even if only a small proportion of tional peptides with the propensity to function as regu-
probes RNA molecules these peptides have biological activity, we could be miss- lators of canonical proteins. Before we elaborate on the
that are being actively ing hundreds of peptides that could shed light on many basis of this classification and its functional i­ mplications,
translated by ribosomes
by analysing ribosome-
aspects of biology and medicine. Thus, the challenge is we introduce the five classes below (FIG. 1).
protected RNA fragments how to identify the biologically active smORFs and their First, intergenic ORFs are the most numerous (96% of
(ribosomal footprints). peptides, or the “beautiful needles in the haystack” (REF. 4). smORFs (FIG. 2a)). These intergenic ORFs are stretches

576 | SEPTEMBER 2017 | VOLUME 18 www.nature.com/nrm


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Median size
ORF class RNA type (codons) Translation15 Conservation Coding features Function
Intergenic None 22 None None6,8 Non-canonical AA None
ORFs
uORFs 5´ UTRs of mRNAs 22 Low None8,30 • Nonrandom AA • Non-coding
• No domains • Translation
regulation

Transcribed smORFs
lncORFs lncRNAs 24 Low None8,10 • Nonrandom AA Non-coding or
• No domains coding

Short Short mRNAs 79 High Class • Positively charged AA • Coding


CDSs • Transmembrane • Regulators of
α-helices canonical proteins
Short Spliced mRNAs 79 High Kingdom • Canonical AA • Coding
isoforms • Protein domain • Small interfering
loss peptides
Canonical mRNAs 491 High Kingdom • Canonical AA • Coding42
ORFs • Multiple protein • Structural,
domains enzymatic,
regulatory
Untranslated region DNA RNA splicing
ORFs Other coding sequences Ribosome profiling signal
Figure 1 | Properties of smORFs in fruitflies and mammals. Most open reading frames (ORFs) in animal genomes are
Nature Reviews | Molecular Cell Biology
small ORFs (smORFs) in untranscribed regions (intergenic ORFs; light blue) and are considered non-functional. ORFs of
101 codons or more (canonical ORFs; purple) are generally translated and produce annotated proteins with predictable
functions. In between these two extremes, our genome also encodes transcribed and putatively functional smORFs of
10–100 codons, which can be divided into different classes according to the type of their encoding transcript: upstream
ORFs (uORFs; teal) in the 5ʹ untranslated regions (5′ UTRs) of canonical mRNAs; long non-coding ORFs (lncORFs), which are
present in long non-coding RNAs (green); short coding sequences (short CDSs), which are annotated ORFs of 100 codons
or fewer that are present in short mRNAs (yellow); and short isoform ORFs of 100 codons or fewer, which are generated by
alternative splicing of canonical mRNAs (pink). We extracted all ORFs from both the annotated transcriptomes and the
non-transcribed regions of Drosophila melanogaster, Mus musculus and Homo sapiens, and divided the ORFs into these
classes, which differ in several characteristics: size (indicated as the median number of codons per ORF), average rate of
translation, average taxonomic level of ORF conservation, and features of the encoded amino acids, such as the prevalence
of transmembrane α-helices. We propose that these characteristics correlate with a favoured function for each smORF
class and that they are conserved in flies, mice and humans. The data presented is from this work unless indicated.
The translation and conservation characteristics of the short isoforms is extrapolated from their long isoforms.

of DNA between an ATG and a stop codon, and have Third, long non-coding ORFs (lncORFs) are smORFs
a median size of 22 codons (FIG.  2b). Judging from that are found in putative lncRNAs and are the third
high-throughput data, they do not seem to undergo most abundant class (FIG. 2a). They have a median size of
transcription or translation, and thus it is likely that most 24 codons (FIG. 2b) and their translation character­istics
are non-functional, and are instead simple, random con- are similar to those of uORFs: low efficiency and only
sequences of nucleotide permutations in regulatory or detected in one-third of the lncORFs assessed. A typical
‘junk’ DNA. We suggest the name intergenic ORFs in lncRNA of approximately 3 Kb could theor­etically con-
order to distinguish them from the transcribed, putatively tain 100 lncORFs of some 100 codons (30 per frame);
functional smORFs that constitute the other classes. in fact, 98% of annotated lncRNAs contain at least one
Second, upstream ORFs (uORFs) constitute the smORF in the metazoan species that we have analysed,
second most abundant class, potentially doubling with a median of six smORFs per lncRNA (FIG. 3b). Even
the number of currently annotated coding sequences the 40 lncRNAs with a non-­coding function that have
(FIG. 2a). uORFs are smORFs found in the 5ʹ untranslated been characterized in humans36 contain between 1 and
regions (UTRs) of mRNAs that encode canonical pro- 15 lncORFs. Thus, lncORFs are typically found in poly-
teins, and they also have a median length of 22 codons cistronic, sometimes overlapping, arrangements, which
(FIG. 2b). Almost 50% of annotated animal mRNAs con- hinders their experimental characterization. Their
tain uORFs (FIG. 3a; see also REFS 14,30), and the trans- amino acid usage is nonrandom, but different from
lation of a proportion of these uORFs has been reported canonical proteins. Their function is currently gener-
in all organisms in which it has been examined, includ- ally unknown, and there is considerable debate about
Translation efficiency
A measure of the rate of ing yeast, flies, zebrafish and mice31–34, albeit with low whether lncORFs are translated, and whether such
translation for a given mRNA translation efficiency8,28,30,35. uORFs are thought to regu­ translation is productive27,34,37–39. lncRNAs do not seem
feature, obtained in ribosome late the translation of the downstream, canonical to be conserved3,8–10, but several RNAs that were initially
profiling experiments. It usually ORFs in their transcripts. uORFs have low average classified as lncRNAs were later shown to encode and
consists of the ratio between
ribosomal footprints and RNA
conservation8,30 and their amino acid usage is differ- translate peptides with biomedically important func-
sequencing reads generated ent from random values, but is subtly different from tions in development and ­physiology, and to be highly
by the mRNA region. canonical proteins. conserved in evolution12,13,40.

NATURE REVIEWS | MOLECULAR CELL BIOLOGY VOLUME 18 | SEPTEMBER 2017 | 577


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

a Drosophila melanogaster Mus musculus Homo sapiens short isoforms may be higher than currently appreciated
smORFs Annotated because detecting them depends on experimental data,
ORFs 855
18,328 27,988 50,270
and the unambiguous assignment of proteomic and ribo-
828
somal profiling sequences to specific mRNA isoforms
1,207
remains difficult.
16,904 20,430
129 128,307 31,954 43,206 Coding versus non-coding functions
172,115
520 Evidence of transcription and/or translation are two
1,651 objective criteria for assuming that smORFs are func-
tional, whether coding or non-coding. We analysed
Intergenic ORFs: Intergenic ORFs: Intergenic ORFs:
635,402 24,119,916 21,369,201 the genomes and transcriptomes of flies, mice and
humans26,41 and compared the characteristics of differ-
ent RNAs that contain smORFs, and the characteristics
uORFs IncORFs Short CDSs Short isoforms Canonical ORFs of the smORFs themselves.

b Median length (codons) Transcribed smORFs Transcription of smORFs. Extensive RNA-seq studies
in metazoans have been carried out, especially in model
Intergenic Short Short Canonical organisms such as D. melanogaster and Mus musculus.
ORFs uORFs lncORFs CDSs isoforms ORFs
The repertoire of transcribed sequences that these
D. melanogaster 23 20 25 79 82 490
studies provide may not be complete, as not all organs
M. musculus 22 22 23 81 79 424
and cell types have been sampled, but in general most
H. sapiens 22 23 24 78 75 421 transcription should have been detected. The small
Predicted Annotated populations of short CDSs of approximately 79 codons
Figure 2 | Conservation of smORF numbers and lengths in animals. a | Small open (FIG. 2a) in these data sets are found in polyadenylated
reading frame (smORF) classes are colour-codedNature Reviews | Molecular Cell Biology
as in FIG. 1. Circle sizes are proportional monocistronic transcripts that are often annotated as
to the number of smORFs. The number of annotated, short coding sequences (short putatively coding 26,41, even though direct experimental
CDSs) is similar in all the animals analysed, and low when compared with both annotated corroboration of their translation is lacking in most
canonical ORFs and non-annotated transcribed smORFs (upstream ORFs (uORFs) and cases8. Short CDS transcripts are shorter and have fewer
long non-coding ORFs (lncORFs)). The higher number of uORFs and lncORFs in mammals exons than canonical mRNAs (FIG. 3c,d). This could
is related to their higher number of mRNAs and long non-coding RNAs (lncRNAs), ­follow a trend that was observed in eukary­otes — fewer
whereas the number of intergenic, non-transcribed smORFs is related to the genome
exons are found in shorter coding transcripts42. More
size in each species. b | The lengths of different smORF classes are conserved across
metazoans. Intergenic ORFs, uORFs and lncORFs have a median length of 20–25 codons; surprisingly, a large proportion of the transcripts that
short CDSs and short isoforms have a median length of 75–82 codons; and canonical were detected by RNA-seq lacked a canonical ‘long’
ORFs have a median length of 421–490 codons. ORF, and thus have been considered lncRNAs2,3, even
though they contain lncORFs and display coding-like
mRNA features, such as similar length and structure to
Fourth, short coding sequences (short CDSs; previ- short CDSs (FIG. 3c,d), transcription by RNA polymer-
ously known as longer smORFs8,15), have a median size ase II, capping, polyadenylation and accumulation in
of 79 codons (FIG. 2b) and are preferentially found in the cytoplasm43,44,45.
functionally monocistronic transcripts that have mRNA As large numbers of intergenic ORFs are found in
characteristics, albeit they are shorter and s­ impler in non-transcribed genomic regions (FIG. 2a), do they also
structure than canonical mRNAs. They seem to be trans- represent smORFs in uncharacterized transcripts?
lated as efficiently, and are detected as often, as canonical What is their origin and function? Their median size
ORFs8. There are hundreds of short CDSs in flies and of 23 codons across species is expected by chance:
humans (FIG. 2a), but only a small proportion has been among 60 possible codons, 3 are stop codons, which is
functionally characterized. The characterized examples a ratio of 0.05. Counting from an ATG start codon, the
and the average amino acid features of the class suggest length of the resulting ORF depends on the probability
that they have membrane-related functions as regulators of encountering a stop codon. This probability is inde-
of canonical proteins. pendent at each codon, but the accumulated probability
Fifth, short isoforms, which are generated as alter- of encountering any stop codon obviously increases with
native transcripts or as splice forms of longer, canoni- length. Thus, the following exponential decay function
cal protein-coding genes, are the least abundant class of f(x) = λe–λx
smORFs, according to annotated data (FIG. 2a). Annotated where λ (the decay rate parameter) is 0.05 and x is
short isoforms are translated and have a median size of the length of the ORF in codons, indicates the f­ requency
79 codons (FIG. 2b), thereby resembling short CDSs in at which ORFs of each size are expected to occur at
size and in transcript structure, although their amino random, and generates a size distribution that closely
Protein isoforms acid sequences are closer to those of canonical proteins, fits that observed for intergenic ORFs (FIG. 3e). Thus,
Variants of a given protein as expected. Short isoforms merit separate classification intergenic ORFs, unlike short CDSs (FIG.  3f), seem
generated by the translation of
alternative mRNA sequences,
and study on two bases: the certainty of their genetic to be randomly generated by our genomes, so it would
in distinct mRNAs produced by ­origins, and their potential for functions that are directly be expected that most are not functional (just as most
the same gene. related to their canonical protein isoforms. The number of mutations are not advantageous).

578 | SEPTEMBER 2017 | VOLUME 18 www.nature.com/nrm


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Other data indicate that we do not have millions of A fundamental filter for considering a smORF as ‘genic’,
genes in our genomes. Although computational gene-­ or as putatively functional, must be the existence of solid
annotation protocols are biased against smORFs6,46, transcriptional data. For example, computational and
classi­cal estimates of gene numbers that were obtained RNA-seq evidence that was used to re‑examine previ-
from biochemical and genetic studies 47 are com­ ous computational estimates of putatively functional
patible with the annotated numbers of genes that are non-annotated smORFs in various species could only
obtained with computational methods, but differ by corroborate a small percentage of these smORFs21. This
several orders of magnitude from the high numbers transcriptional filter does not contradict the possibility
of intergenic ORFs. Therefore, it is likely that most of that some intergenic ORFs could be transcribed and even
the intergenic ORF sequences are not active genes and translated in tissues that have not yet been subjected to
thus, in order to avoid inflating the estimates of func- RNA-seq, but focuses experimental and ­computational
tional smORFs, they should not be considered functional. efforts towards more likely functional targets.

a b
60 15
Relative frequency (%)

Relative frequency (%)


50 uORFs Median IncORFs
40 10
30
20 5
10
0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 ≥21
uORFs per mRNA IncORFs per IncRNA
c 30 d 50
Canonical Canonical
25 40
Short isoforms Short isoforms
Relative frequency (%)

Relative frequency (%)


20 Short CDSs Short CDSs
IncRNAs 30 IncRNAs
15
20
10
10
5

0 0
0 500 1,000 1,500 2,000 2,500 3,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Length (bp) Exons (absolute number per mRNA)
e 6 f 6
Intergenic ORFs Short isoforms
5 5
uORFs Short CDSs
Relative frequency (%)

Relative frequency (%)

4 IncORFs 4 Exponential distribution


Median Exponential distribution λe–λx (λ = –0.05) Median
3 λe–λx (λ = –0.05) 3

2 2

1 1

0 0
20 40 60 80 100 20 40 60 80 100
Length (codons) Length (codons)
Figure 3 | RNA characteristics of transcribed smORFs. The mean and standard error of the mean (SEM) of fruit fly,
Nature Reviews | Molecular Cell Biology
mouse and human values are plotted in dark and light shades, respectively. a | Number of predicted upstream ORFs
(uORFs) per annotated mRNA. Nearly 50% of the animal mRNAs analysed contain one uORF or more. b | Most annotated
long non-coding RNAs (lncRNAs) contain one or more small open reading frames (smORFs) (median = 6). c | Metazoan RNA
transcript length according to ORF classes. mRNAs containing short coding sequences (short CDSs) and lncRNAs are
similar in size (on average, 400 bp in length). Short isoform transcripts are, on average, 600 bp in length, whereas canonical
ORF-containing mRNAs are larger on average (3.1 kb). d | Short CDS-containing mRNAs and lncRNAs have lower
transcript complexity, as measured by the number of different exons in annotated transcripts of each class, than mRNAs
that contain shorter isoforms; canonical mRNAs are, on average, more complex than all of the other classes. e | The size
distributions of uORFs and long non-coding ORFs (lncORFs) in animal genomes are similar to that of intergenic ORFs.
Interestingly, their distributions fit an exponential decay distribution, suggesting that intergenic ORFs, uORFs and
lncORFs are randomly generated in animal genomes. The median (23 codons) of the exponential decay distribution for
smORFs of 10–100 codons is indicated (compare with FIG. 2b). f | The size distributions of short CDSs and short isoforms.
The median length (79 codons) of short CDSs is indicated. Short CDS and short isoform smORFs have similar size
distributions, which are different from the exponential distribution of other smORFs (dotted line, compare with part e).

NATURE REVIEWS | MOLECULAR CELL BIOLOGY VOLUME 18 | SEPTEMBER 2017 | 579


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Two modes of smORF translation: lncRNAs and short there is a linear correlation between the ­levels of canon-
CDS mRNAs. Ribosome profiling provides quantita- ical mRNAs in polysomes and their ribosome profil-
tive and qualitative measures of translation, both at the ing signal8,11,51, such a positive correlation could not be
single gene scale and at the genomic scale27,28. It offers observed with lncRNAs. Many l­ncRNAs were present
a direct readout of ribosome occupancy on mRNAs at in high quantities in polysomes, but only some ­lncRNAs
the single nucleotide level, and can be compared with produced a ribosomal profiling ­s ignal 8,11, suggest­
RNA-seq data of the same biological sample to provide ing that the ribosome profiling signal of ­lncRNAs
quantitative metrics, such as translation efficiency, is not prod­uced by generic background noise, but by
which are directly related to the rate of translation at the specific translation of a subset of l­ncRNAs, even
each ORF (see REF. 48 for a review). Results obtained if this occurs at a modest efficiency. Further­more,
from a wide variety of species show that translation lncORF translation has been corroborated by ORF
occurs in a more pervasive manner than expected, with tagging and proteomics 8,9,11,46 (see below). Finally,
numerous ribosome footprints detected in l­ncRNAs smORFs in transcripts that were initially annotated as­
and in the UTRs of annotated transcripts; many of lncRNAs can produce biologically active peptides with
these newly identified translated regions coincide with important functions12–14,20,25,40,52.
smORFs8,9,27,49. Although non-coding functions of sev- An alternative high-throughput method to detect
eral lncRNAs are well established3,50, the non-coding translation is proteomics, which matches mass spectro­
functions of the vast majority are currently unknown, metry signatures of digested peptides to expected pro-
and it is plausible that some of these lncRNAs actually tein sequences53. Improvements in proteomics methods
encode translated smORFs. have allowed the detection of SEPs, but, in general,
Using a variation of ribosomal profiling, smORF proteomics has lagged behind ribosome profiling in
translation in D. melanogaster was shown to occur in smORF detection, and the detection of lncORF peptides
two different modes, which correlated with smORF has been lacking. Even with the use of specific and new
class8. There were 220 annotated short CDSs transcribed size fractionation methods and custom libraries that
in a fly cell line, which were translated at a ­similar effi- include non-annotated smORFs, ‘peptidomic’ studies
ciency, and are detected as often (about 80% of tran- have only corroborated 8 new SEPs in D. melanogaster
scripts were detected as translated in both cases), as brains54 and 23 SEPs in human cell lines46. Peptides
canonical ORFs. Short CDS translation correlated with smaller than 80 amino acids are preferentially degraded
mRNA abundance and followed canonical models, by endogenous proteases55, which hampers their detec-
with multiple ribosomes covering the ORF with regu­ tion, but they have fewer amino acids to generate two
lar spacing. Unlike short CDSs, approximately 2,000 non-overlapping peptides after in vitro trypsinization,
uORFs and lncORFs did not follow this canonical mode which is required by standard proteomics validation.
of translation, but were detected as translated in one- Two ribosomal profiling studies failed to obtain paral-
third of the cases and showed about half the translation lel mass-spectrometry evidence for any peptide smaller
efficiency of canonical ORFs8. These differences have than 50 amino acids8,9, which is a common limitation of
also been observed in vertebrates (reviewed in REF. 15). proteomic studies of smORF translation. These factors
Ribosomal profiling of zebrafish embryos validated the could explain the low SEP detection, but it could also be
translation of 302 previously annotated smORFs, and possible that lncORFs only produce unstable peptides
identified 190 novel smORFs in previously uncharacter­ in small quantities, which would be in agreement with
ized transcripts and putative lncRNAs, as well as 311 their low ribosome profiling metrics.
uORFs and 93 ORFs in 3ʹ UTRs9. These smORFs tended Could lncORF-encoded peptides in general have a
to be more than 50 codons in length and showed con- biological role or, in other words, are lncORFs functional
servation across vertebrates, which classifies them as and do they have a function that is conveyed by the pep-
short CDSs. However, the authors noted the existence tides that they encode? Or are the peptides irrele­vant to
of a class of ORFs of fewer than 20 codons with a ribo- lncORF function? Long ‘non-coding’ RNAs that actu-
somal profiling signal in lncRNAs, which would belong ally produce biologically active ­peptides could simply be
to the lncORF class. Similarly, up to 50% of lncRNAs in misclassified and the annotated ­lncRNAs could actually
mouse embryonic stem cells had a ribosome profiling comprise two subpopulations: true non-coding lncRNAs
signal28, which could potentially give rise to thousands and protein-coding short CDS mRNAs. Alternatively,
of peptides49. some, many or all ­lncRNAs could have dual coding
There has been a rather technical debate on whether and non-coding functions, as in the case of the plant
a low level of ribosome profiling signal represents pri‑MIR171b (which produces both a microRNA
prod­uctive translation27,34,37–39. It has been suggested and peptides 56), or the mammalian humanin and
that some lncRNAs could be associated with the trans­ MOTS‑c peptides that are produced by mito­chondrial
lation machinery to regulate the translation of canon- ribosomal RNA (rRNA) 57,58; or they could simply
ORF tagging ical mRNAs, without actually being translated; or that produce inactive peptides that are quickly degraded.
A technique to probe the the detected ribosome binding is incidental and non-­ lncORF translation may be the functionally relevant pro-
translation of a specific open productive; or that the footprints detected are not cess, whereas the peptides themselves may convey little
reading frame (ORF), whereby
a reporter sequence without a
gener­ated by ribosomes. In summary, it has been sug- or no function; that is, lncORFs could have a non-coding
start codon is cloned in‑frame gested that the ribosome profiling signals in ­lncRNAs function that involves ribosomes. A precedent for such a
with the assessed ORF. are noise, which yields false positives. However, whereas scenario is presented by uORFs (see below).

580 | SEPTEMBER 2017 | VOLUME 18 www.nature.com/nrm


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

uORFs are cis-regulators of translation in canonical SEPs regulate canonical proteins


genes. The function of uORFs is connected to two fail- The small size of SEPs does not support the typical,
safe features of eukaryotic translation: re‑­initiation multidomain structure of canonical proteins, but instead
and leaky scanning 31,59. In eukaryotic translation, the accommodates only one or, at most, two simple domains
small ribosomal subunit (40S) binds to the mRNA at (considering the simplest domain is a transmembrane
the 5ʹ cap complex and scans the transcript until it α‑helix (TMH) that is 30 amino acids in length, and
encounters an AUG codon that is preceded by a 4 nt the need for an unstructured spacer region between
CA‑rich Kozak sequence. The 60S ribosomal subunit domains)64. Interestingly, protein domains in isolation
then joins to form a complete ribosome (80S) and a or incomplete protein domains can have functions that
tRNAMet complex joins to initiate translation. Once are unrelated to those observed in their native configur­
translation is terminated at the stop codon, the ribo- ation, as part of large multidomain proteins65. For
some dissociates from the mRNA, but the 40S sub­ example, artificially expressed peptides that contain the
unit can re-initiate scanning for downstream ORFs to Antennapedia homeodomain or the HIV TAT domain
translate. Such re‑initiation of translation can occur if are cell-penetrating peptides, a function that is unrelated
the initially translated ORF is no longer than 30 amino to that of their endogenous counterparts66. It follows
acids, and if an additional ORF is found approx­ that even the function of smORFs that contain known
imately 100–200 bp downstream of the stop codon of protein domains cannot be easily predicted; in fact, in
the initially translated ORF. In addition, weak Kozak most cases, it is still unknown. A bioinformatics examin­
sequences are occasionally not recognized, which leads ation of smORF peptide sequences, informed by exam-
to leaky scanning in which ORFs may be scanned but ples for which the functions have been experimentally
not translated, thereby allowing continued scanning ­characterized, might clarify their roles.
and translation of d ­ ownstream ORFs.
Both processes are stochastic, but they can facilitate Characterized SEPs. The molecular and organismal
the translation of polycistronic eukaryotic genes, and can functions of several SEPs have been characterized in
act as a canonical translation regulatory mechanism31,32. metazoans and in unicellular organisms. We have identi­
The classical model of uORF function is the yeast gene fied a group of approximately 80 short CDS-encoded
GCN4, in which four uORFs repress the translation of peptides, which are conserved from flies to humans
the Gcn4 protein59,60. uORF translation precludes the and, in some cases, even in yeasts and plants41. The
translation of the downstream Gcn4‑encoding ORF, most ­common function of these ancient SEPs is as pos-
but in conditions of starvation, the uORFs are scanned itive regulators of cytoplasmic processes (FIG. 4). These
and not translated, allowing the translation of the Gcn4 processes include ubiquitin-like post-translational
protein. In this case, the sequences of the peptides that modifi­cation67, cytoskeleton dynamics24, translation68
are produced by the Gcn4 uORFs are irrelevant, and and cyclin function in mitosis69 (FIG. 4a). The second most
the physical presence of the uORFs themselves acts common function is related to mitochondria, such as
as the regulatory mechanism; in other genes, however, apoptosis-related functions70 and mitochondrial respir­
the uORF-encoded peptides can stall the ribosome ation16,71 (FIG. 4b). These peptides offer concrete examples
upstream of the main ORFs in a sequence-dependent of crucial functions that are conserved for hundreds of
manner 61. This inhibitory cis-regulatory uORF function ­millions of years, but they constitute a minority (8–9%)
does not necessarily preclude that some uORF peptides of short CDSs in each species. As most smORFs are not
could have functions in trans that are independent of as widely conserved, it is unclear whether these ancient
their locus62. A pure cis-regulatory role for uORFs fits smORFs offer a functional blueprint for all short CDSs
with their low translation levels8,30,39, their low sequence and, even less so, for lncORFs and uORFs.
conservation8,30 and their lack of propensity to form Other functionally and molecularly well-character­
known protein domains (see below). A prediction of ized smORFs are perhaps not as ancient but still show
having a repressive cis-regulatory function is that there conservation comparable to canonical proteins, and have
should be a negative correlation between the transla- a wider range of functions, as negative regulators. Plants
tion of the uORFs and that of the main ORFs. Such have short CDSs and short isoforms that encode small
a genome-wide negative correlation has been found interfering peptides (also known as micro­proteins72)
in mammals32 and zebrafish30, but not in D. melano­ that function predominantly as transcription repres-
gaster 8 or yeast 35,63, perhaps hinting at undiscovered sors73,74. These small interfering peptides are short,
uORF functions. dominant-negative isoforms or duplications of canon-
In summary, there is evidence for the translation for ical transcription factors, and they usually contain one
all types of transcribed smORFs, albeit with different known protein domain. They interfere with the function
translation efficiencies and chance of detection, which of canonical transcription factors, either by sequestering
correlate with their proposed class; in other words, them into unproductive dimers (when the small inter-
their size and type of transcript-of‑origin. There are fering peptide contains a dimerization domain but not
also well-established smORF functions, which can be a DNA-binding domain) or by competing with them
separated into ‘coding’ functions — the production of for binding to DNA (when the small interfering pep-
a biologically active peptide — and ‘non-coding’ regu­ tide contains a DNA-binding domain but lacks other
latory effects of engaging the translation machinery. domains that are required for its activity)72–74. No small
We explore the functions of SEPs below. interfering peptides have been characterized in animals,

NATURE REVIEWS | MOLECULAR CELL BIOLOGY VOLUME 18 | SEPTEMBER 2017 | 581


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Helix–loop–helix but there is no reason why they should not exist. There and the circadian rhythms in organisms from flies to
(HLH). A DNA-binding domain are examples in humans of dominant-negative peptides humans, and these peptides have also been implicated in
that characterizes members that are close to 100 amino acids in length. The inhibitor stem cell renewal and cancer75. Small interfering peptides
of a transcription factor family. of DNA-binding (Id) family of helix–loop–helix (HLH)- could be as prevalent in animals as they are in plants, but
It is composed of two α-helices
connected by a short loop.
like peptides, which are 119–168 amino acids in length, these peptides and the short RNAs that encode them may
sequesters basic HLH proteins into inactive complexes have been discarded as artefacts or as non-functional in
(FIG. 4c). They thereby regulate development, the cell cycle studies, even if they are detected experimentally.

a b BCL-2
MOMP Caspase
activation
5′ cap 60S
AAA CKS85A
Humanin Electron transfer
RPS21 40S CDK1
Mitochondrion complexes
CDK2
IV

Z600 I III UQCR10

ATP
Eukaryotic translation Cell cycle progression Oxidative Apoptosis
phosphorylation

c Active Id
Inactive d
homodimer homodimer
PLN or SclA Muscle
HLH Ca2+ contraction
SERCA
On Off
Promoter Promoter ER Cytoplasm Myofibril
ER
membrane
e f Defensin Drosocin
Bacterial cell Transmembrane
Plasma membrane membrane diffusion
CDC42 Phagocytosis Cytoplasm
Hemotin
14-3-3ζ Cytoplasm Pore

SPEC2 Efflux of
cytoplasmic DnaK
PI3K components

Early endosome Bacterial


Mature Degradation metabolism
endosome
Figure 4 | Functions of smORF-encoded peptides. Ancient small open reading frame (smORF)-encoded peptides
Nature
(SEPs; blue ellipses) are conserved across eukaryotes (yeast, insects and vertebrates) Reviews
and act | Molecular
as positive Cell
regulators ofBiology
canonical proteins. Evolutionarily more-recent SEPs (pink ellipses) act as negative regulators. a | Ancient cytoplasmic SEPs
promote the activity of canonical protein complexes. 40S ribosomal protein S21 (RPS21; 83 amino acids in length) is a
component of the small ribosomal subunit (40S) and aids in mRNA translation; cyclin-dependent kinase subunit 85A
(CKS85A; 96 amino acids in length) interacts with cyclin-dependent kinase 1 (CDK1) and CDK2 and promotes cell cycle
progression. Its activity is opposed by the negative regulator Z600, which is an SEP that represses CDK1. b | Ancient SEPs
that are produced in the cytoplasm can promote mitochondrial processes. BCL-2 peptides are localized to the
mitochondrial outer membrane to promote mitochondrial outer membrane permeabilization (MOMP) and the subsequent
release of apoptotic signals through caspase activation. Other SEPs, such as ubiquinol-cytochrome c reductase complex III
subunit 10 (UQCR10), are involved in electron transport at the inner mitochondrial membrane. The evolutionarily recent
humanin peptide is produced by a mitochondrial ribosomal RNA (rRNA) and represses BCL-2 function. c | Small interfering
SEPs act as dominant-negative repressors of transcription factors. The Id (inhibitor of DNA-binding) peptides contain a
helix–loop–helix (HLH) domain, but lack a basic DNA-binding domain. Id peptides bind to HLH transcription factors
and sequester them in inactive complexes that cannot bind to DNA. d | Sarcolamban A (SclA; 28 amino acids in length)
is a Drosophila melanogaster SEP. Similarly to its human orthologue phospholamban (PLN), SclA regulates calcium uptake
in the endoplasmic reticulum (ER) of muscle cells through repression of SERCA, which is the Ca2+ pump that transfers Ca2+
from the cytosol to the ER to terminate muscle contraction. e | SEPs can repress the activity of canonical proteins in
phagocytic membranes and organelles. Hemotin29 represses the phosphoinositide 3‑kinase (PI3K) activator 14‑3‑3ζ, thereby
slowing endosomal maturation to allow phagocytic digestion, whereas small effector of CDC42 protein 2 (SPEC2) represses
the small GTPase CDC42 to regulate the formation of phagocytic pockets. f | Antimicrobial peptides penetrate the cell
membranes of microorganisms. Defensin creates membrane pores that lead to the leakage of cytoplasmic content;
drosocin binds to and interferes with cytoplasmic proteins, such as the chaperone DnaK, to slow bacterial metabolism.

582 | SEPTEMBER 2017 | VOLUME 18 www.nature.com/nrm


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Small interfering and regulatory peptides could be a in each smORF class (FIG. 5b). Pooling the data from flies,
general feature of SEP function, as dominant-negative mice and humans reveals significant correlations and
interference does not need to be limited to transcription differences in the amino acid usage of different smORF
factors. This fits well with the small size of SEPs, which classes, and when compared with canonical proteins
cannot form the large globular proteins with ­buried and randomized RNA. As observed in flies, the amino
active sites that are characteristic of enzymes, or the acid compositions of metazoan canonical proteins, short
large multidomain structural proteins64,76. However, isoforms and short CDSs resemble each other and differ
SEPs could be perfect for interfering with larger pro- significantly from those of randomized RNA sequences
teins. Indeed, some short CDS-encoded peptides have and intergenic smORFs (FIG. 5c,d).
a demonstrated negative regulatory role in mitosis77 Despite their overall similarity to canonical proteins,
(FIG. 4a), apoptosis78 (FIG. 4b), ubiquitylation52,79, muscle short CDSs have an increased frequency of some pos­
contraction13,20 (FIG. 4d), phagocytosis29,80 (FIG. 4e) and as itively charged amino acids, and that of some negatively
antimicrobial peptides22,81,82 (FIG. 4f). Interestingly, most charged amino acids is decreased, thereby producing
of these functions involve cell membranes. an overall positive charge bias (FIG. 5d). Artificial cell-­
penetrating peptides have an overall positive charge and
Protein structure and amino acid usage of smORFs can cross plasma membranes, which is of great interest
without annotated functions. In baker’s yeast (Saccharo­ to the pharmaceutical industry 84,85. Given the preva-
myces cerevisiae), 299 non-annotated smORFs with lence of TMHs in short CDS peptides (FIG. 5a) and the
evidence of transcription, translation or sequence con- membrane localization of short CDS peptides (FIG. 4),
servation were tested in genetic experiments5. Of these, in particular to the negatively charged mitochondria8,16,
247 were required for cell growth during starv­ation it is tempting to speculate that this charge bias similarly
and stress conditions. In Escherichia coli, 217 smORFs favours their traffic across membranes and organelles.
were identified by bioinformatics criteria, of which Given that amphipathic cations (molecules that contain
24 were tested and 18 were found to be required for both positively charged and hydrophobic regions) can
bacterial growth17. Of these, 10 SEPs were observed cross the mitochondrial outer membrane86, short CDS
at membranes and were predicted to encode TMHs, peptides may also be able to cross membranes.
­similarly to 65% of annotated bacterial proteins of fewer The amino acid usage of short CDSs would also fit
than 50 amino acids17. These genetic studies indicate with a role as antimicrobial peptides. Antimicrobial pep-
that many uncharacterized SEPs may not be absolutely tides belong to the humoral (macromolecule) branch of
required for survival, but may improve overall fitness, innate immunity, and their production is regulated by
similarly to the genetic requirement for the negative a signalling mechanism that is conserved from flies to
regulatory SEPs (discussed above)14,22. These studies humans81,87. Antimicrobial peptides tend to be amphi­
also suggest that many uncharacterized SEPs may be pathic, approximately 50–150 amino acids in length and
located at cellular membranes (thereby complicating have a propensity to form TMHs88,89. Their amphipathic
their biochemical detection). nature confers solubility and the ability to bind to and
In D. melanogaster, a bioinformatics analysis revealed integrate into microbial membranes. These character­
more predicted TMHs in short CDSs than in canoni- istics, which are identical to those found for short CDS
cal proteins, lncORFs and uORFs8. This trend is shared peptides (FIG. 5a,d), are sufficient for the design of artifi­
by vertebrate smORFs (FIG. 5a). TMHs could easily be cial antimicrobial peptides that act even more effici­
related to putative functions in cell membranes and ently than natural peptides88. Antimicrobial peptides
organelles13,17,20,29. Indeed, the limited gene ontology data can form pores in the membranes of microorganisms,
available in flies displays an enrichment for membrane- leading to cell leakage and death81, but can also behave
r­elated terms8, and tagging a sample of short CDS pep- as cell-penetrating peptides that, once inside the cells,
tides revealed a tendency for these peptides to localize to can interfere with vital cellular processes81,82,89. In this
cell membranes, including mitochondrial membranes8. regard, antimicrobial peptides could also be seen as
Finally, the characterization of hemotin29 (FIG. 4e), as well negative regulators89. As the overall molecular character­
as that of other short CDSs (E. Magny and J. I. Pueyo, istics of antimicrobial peptides are identical to those
personal communication) corroborates the predic- of short CDS peptides, some of the hundreds of short
tion of membrane- and organelle-related functions for CDS peptides with currently unknown function might
TMH‑containing short CDS-encoded peptides. act as antimicrobial peptides. Indeed, several well-­
The amino acid composition of smORFs could also characterized antimicrobial peptides are encoded by
be a source of information pertaining to their potential short CDSs22,81,82 (FIG. 4f).
functionality. In flies, we observed putatively different Another possible function of positively charged
amino acid usage among canonical proteins, short CDSs, peptides is to bind to nucleic acids. DNA and RNA
lncORFs, uORFs and random RNA sequences8. These are negatively charged, and transcription factors and
differences could underlie different molecular functions other DNA-binding proteins, such as histones, bind to
and thus could be an indicator of coding potential83. DNA and to RNA through positively charged domains.
We analysed the amino acid usage of mouse and human As discussed above, plant small interfering peptides
smORFs and randomized RNA sequences in compari- are transcription regulators, and animal SEPs with
son to flies. We found a remarkable degree of similarity possible DNA-binding activities have been described,
between the fly and the vertebrate amino acid frequencies such as the human MRI‑2, which is a regulator of the

NATURE REVIEWS | MOLECULAR CELL BIOLOGY VOLUME 18 | SEPTEMBER 2017 | 583


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

DNA end-joining protein KU90; the fly SEP Pgc, which of randomized RNA (FIG. 5c). uORFs resemble short
epigenetically represses transcription25; and, interest- CDSs in having a slight positive charge bias (FIG. 5d).
ingly, those produced by the dual-function (coding and If  uORF-derived peptides have no function, their
non-coding) RNA pri‑MIR171b56. ‘coding-­like’ amino acid usage might be an irrelevant
The amino acid usage of lncORFs and uORFs is consequence of their location near canonical ORFs. It is
intriguing. Overall uORF amino acid propensities corre- possible that uORFs are derived ‘nonsense’ fragments
late highly with those of canonical ORFs, short isoforms of nearby canonical ORFs. An interesting, alternative
and short CDSs (FIG. 5c), but they also correlate highly possibility is that uORF peptide sequences could reflect
with those of lncORFs, and somewhat less with those a specific but as-yet undiscovered function62.

a 0.5 P = 0.0065

0.4 ns
TMHs per 100 AAs

b
P = 0.0133
0.3
ORF class Fly–mouse Fly–human Mouse–human
ns uORFs 0.77 0.76 0.99
0.2
lncORFs 0.92 0.92 0.98
Short CDSs 0.96 0.96 0.99
0.1
Short isoforms 0.94 0.95 0.99
Canonical ORFs 0.97 0.97 0.99
0.0
s

s
RF

RF

RF

rm

DS
uO

cO

lO

fo

tC
so
ca
In

or
ti
ni

Sh

d
or
no

Sh
Ca

Short CDSs
Difference from canonical AA frequencies (%)

3
c lncORFs
Intergenic

isoforms
Random

lncORFs

uORFs
uORFs

2
ORFS

Short

Short
CDSs

1
Canonical ns
ORFs
0
Short ns
CDSs
–1
Short ns
isoforms
–2
lncORFs
–3
uORFs
MC K RH EDQTSN W Y F I V P G A L
Intergenic
ORFs MC K ED S Short CDSs
Significant

MC ED A IncORFs
AAs

MC R ED uORFs
0 0.2 0.4 0.6 0.8 1
Pearson’s r S + – P, H
Figure 5 | Coding features of smORFs. a | Number of putative transmembrane α-helix domains
Nature (TMHs;
Reviews predicted
| Molecular Cellby
Biology
TMHMM 2.0) encoded per 100 amino acids (AAs) in each open reading frame (ORF) class. Short coding sequences
(short CDSs) are significantly enriched in TMHs compared with canonical ORFs (p values for t‑tests indicated), whereas
upstream ORFs (uORFs) are significantly depleted. Mean and standard error of the mean (SEM) of fly, mouse and human
values are plotted. b | Correlation of amino acid composition across flies, mice and humans by small ORF (smORF) class,
quantified as Pearson’s r values for each pairwise correlation. All correlations are significant (P < 0.05). c | Correlation of
overall amino acid composition among smORF classes using pooled values from flies, mice and humans, quantified as in
part b. Short CDSs and short isoforms most resemble canonical ORFs and differ from random sequences; long non-coding
ORFs (lncORFs) and uORFs display an amino acid composition that is intermediate between canonical ORFs and
intergenic ORFs and random values. d | Amino acid propensities of smORFs, normalized to canonical propensities
(data as in part a). Some short CDS amino acid frequencies differ significantly from canonical ORFs (multiple t‑tests with
Bonferroni corrections, P < 0.05). Some short CDSs encode more sulfur-containing amino acids, as well as being enriched
in positively charged and depleted in negatively charged amino acids. lncORFs and uORFs resemble each other and
display a pattern related to short CDSs, but with more extreme variations. Short isoforms are not plotted, as their only
significant difference from canonical ORFs is Met usage, the higher frequency of which is attributable to their shorter
length. Amino acids with significantly different frequencies (enrichment or depletion) from canonical ORFs are indicated.
ns denotes correlations that are not statistically significant (P > 0.05). +, positively charged amino acids; –, negatively
charged amino acids; H, hydrophobic; P, polar; S, sulfur-containing amino acids.

584 | SEPTEMBER 2017 | VOLUME 18 www.nature.com/nrm


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Transcription
Translation ‘Elongation’

Intergenic ORFs

lncORFs
Short CDSs
Random smORF
generation

Duplication Canonical ORFs

Random
ORF loss
Transcribed
pseudogenes Short isoforms
Pseudogenes

Pseudogenization Truncation
Loss of transcription

Coding potential Translation Coding Protein Canonical


efficiency function structure protein
content
DNA RNA splicing
Protein
Untranslated region Ribosome function
ORF profiling signal

Figure 6 | Stepwise evolution of smORFs towards proteins. Stepwise model for small open reading frame (smORF)
evolution towards new canonical ORFs. It has been suggested that protein-coding Nature
genesReviews
emerge| Molecular Cell Biology
from a continuum of
non-functional sequences (‘protogenes’ (REF. 100)). We propose that the various classes of smORFs (FIG. 1) represent
different steps in this process. Intergenic ORFs appear at random in non-transcribed DNA, but during evolution they can
become part of a transcription unit, giving rise to long non-coding ORFs (lncORFs). lncORFs (and upstream ORFs (uORFs))
can also appear randomly in transcribed sequences, and have been shown to be translated in up to a third of the cases and
with a low translation efficiency. The main functional outcome of this low level of translation may not be to produce
biologically active peptides, but to provide the organism with a reservoir of peptides that are translated at a low level,
which can be subjected to natural selection. The translation efficiency of these coding lncORFs could increase, giving rise
to short coding sequences (short CDSs). In turn, short CDSs could grow or integrate into canonical proteins (‘elongation’).
The creation of new canonical ORFs is counterbalanced by their conversion into shorter isoforms, pseudogenes and
perhaps lncRNAs111, and also by the random disappearance of intergenic ORFs through the loss of ATG codons.

The amino acid usage of lncORFs resembles that of in length), which acts outside the mitochondria to regu-
short CDSs and uORFs in its higher proportion of sulfur-­ late insulin activity 58, also implying that it crosses mem-
containing amino acids (Met and Cys) and lower pro- branes. However, the short lncORF peptides might be
portion of the negatively charged residues Asp and Glu avidly degraded55, which would hinder a possible cell-­
(FIG. 5d). However, lncORFs are overall most similar to penetrating function92. There are too few examples of
random amino acid usage (FIG. 5d). It is not clear that these characterized lncORF-­encoded peptides, and more func-
propensities would confer an overall positive or amphi- tional studies of lncORFs are needed. Similarly to uORFs,
pathic nature to lncORF-derived peptides, and thus we lncORFs might be a ‘mixed bag’, containing c­ oding
cannot speculate on a putative membrane function for smORFs (thus actually being short CDSs) and non-­
lncORFs. Nevertheless, the sarcolamban (Scl) family of coding smORFs, or else they might represent a group of
SEPs, which function in the endoplasmic reticulum (ER) sequences that are poised for coding function but that
of muscle cells (FIG. 4d), are encoded by transcripts that have not yet done so. We explore this possibility below.
were previously annotated as lncRNAs, and are approx-
imately 30 amino acids in length13,20, which is typical of The genomic role of smORFs in the birth of genes
lncORFs. Similarly, the Tal SEPs are 11–32 amino acids There are indications that smORFs have a general geno­
in length and influence adjacent cells23,91, implying that mic function as a source of new protein-coding genes.
they diffuse across membranes. Translated lncORF Despite the marked differences in conservation, average
peptides were observed at mitochondria and other lengths, amino acid usage, translation efficiency, protein
organ­elles8, and human mitochondrial rRNAs prod­ structure and functionality of the different classes of
uce SEPs, such as humanin (24 amino acids in length) smORFs, there is a degree of overlap among the classes
(FIG. 4b), which is a generic inhibitor of cell death of bio- (FIG. 3), which may suggest the existence of an evolution-
medical importance57, and MOTS‑c (16 amino acids ary continuum among classes (FIG. 6). We examine below

NATURE REVIEWS | MOLECULAR CELL BIOLOGY VOLUME 18 | SEPTEMBER 2017 | 585


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

the two possibilities for the generation of smORFs: from moderating elongation to 25‑amino-acid-long steps.
existing protein-coding sequences, or de novo formation Alternatively, amino‑terminal elongation (as observed
from previously non-coding sequences. in the PLN peptides of the Scl family (FIG. 7b)) could also
smORFs can emerge as fragments of longer protein-­ occur, although this necessitates a new in‑frame ATG in
coding genes through alternative RNA processing, an appropriate Kozak context. Either way, if such elong­
intron retention and premature stop codons. Short pro- ated peptides preserve their original function, then
tein isoforms can be generated by alternative transcrip- they could be positively selected owing to the higher
tion, splicing and polyadenylation (or a combination of stability that results from increased length55. In time,
the three) from canonical proteins. It seems that higher the elong­ated ends of the peptide would be subjected
eukaryotes have leaky splicing mechanisms93, and this to selection to improve the peptide function or to add
can lead to the production of short isoforms with proper new functions.
mRNA and translation features. We have seen that short The size distribution of lncORFs and uORFs fits
isoforms can produce dominant-negative peptides, closely the exponential random distribution of inter-
which could have deleterious consequences, in a simi- genic ORFs (FIG. 3e), suggesting that they also appear ran-
lar manner to amyloid-β peptides of 36–43 amino acids domly in the genome. lncORFs and uORFs may evolve
in length, which form plaques in Alzheimer disease94. from intergenic ORFs that become transcribed (ORF
However, if the deleterious consequences are small, first), or appear directly in non-coding RNA sequences
late-onset (past reproductive age), or pleio­tropically (RNA first). The possible de novo generation of proteins
linked to positive traits95, even such ‘deleter­ious iso- from previously non-coding sequences is increasingly
forms’ could be temporarily carried by our genomes. debated100,101. Up to 1% of protein-coding genes could
A short isoform could be selected as a negative regu- be species-specific (without homologues) and therefore
lator and could eventually become an independent of recent origin102–105, but this idea is controversial106 and
gene following a retrotransposition or a gene dupli- strongly depends on the computational ability to detect
cation event. After further evolution and divergence, homologues. The mechanism for the emergence and
this duplicated isoform could become a pseudogene the spread of such de novo genes has not been clarified,
or, alternatively, a new short CDS (FIG. 6). The similar although a role for lncORFs has been proposed10,107.
sizes (FIG. 3f) and amino acid usage (FIG. 5) of short CDSs Most lncORFs and de novo genes seem devoid of dis-
and short isoforms could indicate that short CDSs have tant homologues, appearing and disappearing in the
been generated in this manner, especially those with genome of different species in the same Order 10,104,107,108,
protein domains or with clear homology to canonical suggesting both recent origins and dynamic evolution-
proteins. Alternatively, such paralogue short CDSs could ary behaviour. However, the Tal and Scl families of short
also arise directly from canonical ORFs. Splice junction CDSs have been conserved over hundreds of millions of
mutations and intron retention can introduce intronic years of evolution12,13 (FIG. 7a,b), showing that lncORFs
sequences into a protein, but introns contain stop codons can become short CDSs and can be fixed in the genome,
either by chance, or by selection, presumably to stop the perhaps in correlation with increased coding potential.
production of abnormal, long proteins93. Thus, intron Further observations are compatible with the evolution
translation would lead to the production of truncated of lncORF and uORFs to short CDSs and canonical
proteins with new carboxyl termini, and long 3ʹ UTRs. coding-genes (FIG. 6). First, short CDSs have an inter-
In this scenario, evolution would be expected to favour mediate evolutionary age, as they are more conserved
smORF-producing transcripts, if the encoded peptides than uORFs8,30 and lncORFs8,10, but less conserved than
are advantageous. Altogether, paralogue short CDSs canonical ORFs (FIG. 7c). Second, smORFs reveal a pro-
that emerge from canonical proteins could be part of gression towards canonical amino acid usage, starting
canonical protein evolution, and represent an opposing from intergenic ORFs to lncORF and uORFs to short
mechanism to processes that usually inactivate proteins CDSs, and then to short isoforms and canonical pro-
and that result in the formation of pseudogenes (FIG. 6). teins (FIG. 5c,d). Finally, although the size of lncORFs and
Other short CDSs, such as hemotin 29,96 or toddler uORFs suggest a random origin from intergenic ORFs
(also known as elabela)97,98, seem to have no paralogues or non-coding RNA sequences (FIG. 3e), their amino acid
in the genome, so we cannot assign their origin to a pre-­ sequences do not fully reflect such an origin, but instead
existing canonical protein-coding gene. Where do these more closely resemble coding ORFs (FIG. 5c). In conclu-
‘singletons’ come from? Short CDSs could evolve from sion, lncORFs and uORFs may appear at random, but
shorter lncORFs and uORFs, by mechanisms favouring once they appear their nucleotide sequences are sub-
‘ORF extension’. In principle, the extension of any ORF jected to selection. Whether this selection is initially
only necessitates changing the stop codon to the codon due to a coding or a non-coding function needs to be
Pseudogene
for an amino acid, which is the most likely outcome ascertained; however, it can result in the development of
A paralogue of a functional
protein-coding gene, which has (95%) in the event of stop codon mutation, as observed amino acid sequences with full peptide function12,13,20,40,52.
lost its gene expression and/or in the Tal genes12 (FIG. 7a). Stop codon readthrough, In summary, some smORFs might have emerged
protein-coding capacities. an event that is readily detected in vivo by ribosome from canonical genes, and others from non-coding
profiling 99, could be an alternative means, or an inter­ sequences. Either way, the processes that drive the
Paralogue
Homologous gene within
mediate step, to stop codon mutation during ORF exten- evolution of canonical proteins (gene duplication and
a given species, usually sion. In either case, such elongated ORFs will again be protein neofunctionalization109) should also act on
generated by gene duplication. subjected to the exponential function λe–λx (see above), smORFs and give rise to smORF ‘families’, offering yet

586 | SEPTEMBER 2017 | VOLUME 18 www.nature.com/nrm


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

more raw materials for smORF evolution. Indeed, short Conclusions and future perspectives
CDSs and short isoforms can undergo duplication, both Bioinformatic and experimental limitations have
inside their transcript and as part of gene families12,13,72–75 ­hampered the study of smORFs and the identification
(FIG. 7a,b), and can be generally classified according to of functional smORFs. Overcoming these limitations
their sequence similarity (J.-P.C. and P.P., unpublished and increasing the pool of experimentally characterized
observations). Finally, smORFs might not only appear smORFs is the foremost challenge in the field. CRISPR–
from or develop into new canonical proteins, but might Cas-based gene editing should herald a new phase of
also be added to existing canonical proteins by exon faster progress, by allowing the targeted manipulation
shuffling 110, thereby driving the development of new of individual smORFs. This is especially important in the
protein domains. In our view, smORFs could repre- case of lncORFs, uORFs and small isoforms, in which
sent a genomic protein factory, which utilizes materials specific ORFs within the transcript or gene must be
both new (lncORFs and intergenic ORFs) and recycled mutated individually. Nonetheless, our analysis of
(canonical genes), constantly producing putative new the available data suggests some emerging principles
peptides and protein domains. of smORF classification and function.

a c
100 Canonical ORFs
tal-1A Short CDSs
RNA atggcagcctacttggatcccactggccagtactaa
80
Protein M A A Y L D P T G Q Y * Internal
duplications
and 60

% of ORFs
1A 2A 3A AA START and
tal STOP codon
degeneracy 40

tal-AA 20
RNA atgctggatcccactggaacataccggcgaccacgcgacacgcaggac
Protein M L D P T G T Y R R P R D T Q D 0

n
m
r

s
tcccgccaaaagaggcgacaggactgcctggatccaaccgggcagtactag

ai
as
RNA

do
ylu
rd

m
Cl

ng
O

Do
Ph
Protein S R Q K R R Q D C L D P T G Q Y *

Ki
Conservation level

b Disordered region TMH


Structure

Human PLN MEKVQYLTRSAIRRASTIEMPQQARQKLQNLFIN--FCLILICLL-LICIIVMLL--


Fruit fly SclA MSEARNLFTT--FCILAICLF-FLYLIYAVL-- Multiple sequence
Human SLN MGINTRELFIN--FTIVLITVI-LMWLLVRSYQY alignment
Human MRLN -----------MTGKNWILISTTTPKSLEDEIVGRLLKILFVIFVDLISIIYVVITS
Human DWORF---------------------MAKGSTFSHLLVP--ILLLIGWIVGCIIMIYVVFS-

Conservation level
Figure 7 | Evidence of smORF evolution. a | Evolution of tarsal-less (tal) small open reading frames (smORFs) in
Drosophila melanogaster. tal is a polycistronic protein-coding gene, which had been initially
Nature annotated
Reviews as a long
| Molecular Cell Biology
non-coding RNA (lncRNA) but which contains four short coding sequences (short CDSs). Three of these short CDSs are
11 and 12 codons in length (Tal‑1A, Tal‑2A and Tal‑3A) and include a conserved functional heptapeptide LDPTGXY,
indicating that they were formed through tandem duplications, a process corroborated by the tal phylogenetic tree12.
The smORF tal‑AA (32 codons in length) is the most evolutionarily recent Tal short CDS12 and encodes a sequence with
two heptapeptides, separated by sequence that is 17 amino acids in length; it includes degenerate STOP (orange shading)
and START codons (green shading), which is consistent with the elongation and fusion of two shorter tal‑1A‑like smORFs
by STOP codon loss. * denotes extant STOP codons. b | The sarcolamban (Scl) family of short CDSs show low amino acid
sequence conservation, but their structure and molecular function are conserved13,20. The family includes two fly peptides
(only SclA is shown) and four mammalian peptides. They all repress the activity of SERCA (FIG. 4d), except DWORF, which,
on the contrary, promotes SERCA activity by acting as a competitive inhibitor of the other Scl family smORF-encoded
peptides40. Both myoregulin (MRLN) and phospholamban (PLN) show amino‑terminal extension of their sequences.
c | The evolutionary conservation of canonical ORFs and short CDSs, indicating the proportion of ORFs that are conserved
at each taxonomic level (data extracted from Ensembl and Flybase). Solid graph lines indicate the mean and coloured
areas indicate the standard error of the mean (SEM), for data from fruit flies, mice and humans. One-half of canonical ORFs
are conserved across the Kingdom, whereas short CDSs display a 50% conservation at the Class level. However, note that
approximately 75 (8%) of ancient short CDSs are conserved across the domain Eukarya. Conservation in the following
species was used to indicate conservation at each taxonomic level. D. melanogaster ORFs: Kingdom if conserved in
Mus musculus or Homo sapiens; Phylum if in Daphnia pulex or Ixodes scapularis; Class if in Tribolium castaneum;
Order if in Anopheles gambiae. M. musculus ORFs: Kingdom conservation if in D. melanogaster; Phylum if in Danio rerio;
Class if in H. sapiens; Order if in Rattus norvegicus. H. sapiens ORFs: Kingdom if in D. melanogaster; Phylum if in D. rerio;
Class if in M. musculus; Order if in Pan troglodytes. For all ORFs, conservation in Saccharomyces cerevisiae indicated
conservation across the domain Eukarya. TMH, transmembrane α‑helix.

NATURE REVIEWS | MOLECULAR CELL BIOLOGY VOLUME 18 | SEPTEMBER 2017 | 587


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

smORFs across animal species can be classified steps in the evolution of proteins from inert intergenic
according to sequence length and transcript structure. sequences (FIG. 6).
These features correlate with other characteristics, such Understanding the origin, evolution and function
as their evolutionary conservation and amino acid of smORFs would be needed to clarify this crucial and,
usage (FIG. 1). Furthermore, these smORF classes seem so far, underappreciated function of the genome; a far
to have specific cellular and molecular functions, facili­ more dynamic and living genome than we currently
tating more detailed studies and an understanding of contemplate. In addition, considering the current inter-
the genomic role of smORFs. Non-transcribed inter- est in artificial peptides as new clinical agents, smORFs
genic ORFs probably have no function; uORFs act as could provide an unexplored reservoir of peptides that
cis-regulators of the translation of downstream canon- either already function as new drugs, delivery vectors
ical ORFs; lncORFs can give rise to novel biologically and antimicrobial agents (such as short CDS), or are cur-
active peptides; short CDSs produce peptide regulators rently inactive but naturally primed for such functions
of canonical proteins in the cytoplasm and membranes by virtue of their amino acid composition (such as uORF
(FIG. 4); and short isoforms can produce peptides that and lncORF peptides). The conservation of individual
interfere with homologous proteins (FIG. 4c). Altogether, smORFs and smORF classes across animal species,
smORFs may generate new protein sequences during which we discuss in this Analysis, offers experimental
evolution, with the different smORF classes representing model systems to explore this fascinating field.

1. Gerstein, M. B. et al. What is a gene, post-ENCODE? 14. Andrews, S. J. & Rothnagel, J. A. Emerging evidence reveals the complexity and dynamics of mammalian
History and updated definition. Genome Res. 17, for functional peptides encoded by short open proteomes. Cell 147, 789–802 (2011).
669–681 (2007). reading frames. Nat. Rev. Genet. 15, 193–204 This work uses ribosome profiling in mouse
2. Derrien, T. et al. The GENCODE v7 catalog of human (2014). embryonic stem cells to show pervasive translation
long noncoding RNAs: analysis of their gene structure, This work further confirms the existence of from alternative start sites, non-canonical start
evolution, and expression. Genome Res. 22, functional smORFs and reviews smORF functions codon usage, and uORF and lncRNA translation.
1775–1789 (2012). and current testing techniques. 29. Pueyo, J. I. et al. Hemotin, a regulator of phagocytosis
3. Guttman, M. & Rinn, J. L. Modular regulatory 15. Pueyo, J. I., Magny, E. G. & Couso, J. P. New peptides encoded by a small ORF and conserved across
principles of large non-coding RNAs. Nature 482, under the s(ORF)ace of the genome. Trends Biochem. metazoans. PLoS Biol. 14, e1002395 (2016).
339–346 (2012). Sci. 41, 665–678 (2016). 30. Johnstone, T. G., Bazzini, A. A. & Giraldez, A. J.
4. Basrai, M. A., Hieter, P. & Boeke, J. D. Small open 16. Saghatelian, A. & Couso, J. P. Discovery and Upstream ORFs are prevalent translational
reading frames: beautiful needles in the haystack. characterization of smORF-encoded bioactive repressors in vertebrates. EMBO J. 35, 706–723
Genome Res. 7, 768–771 (1997). polypeptides. Nat. Chem. Biol. 11, 909–916 (2015). (2016).
This seminal work effectively establishes the field 17. Hemm, M. R., Paul, B. J., Schneider, T. D., Storz, G. 31. Wang, X. Q. & Rothnagel, J. A. 5ʹ‑untranslated regions
of smORF studies by arguing that smORFs exist in & Rudd, K. E. Small membrane proteins found by with multiple upstream AUG codons can support low-
large numbers and can encode functional peptides. comparative genomics and ribosome binding site level translation via leaky scanning and reinitiation.
5. Kastenmayer, J. P. et al. Functional genomics of genes models. Mol. Microbiol. 70, 1487–1501 (2008). Nucleic Acids Res. 32, 1382–1391 (2004).
with small open reading frames (sORFs) in 18. Hanada, K. et al. Small open reading frames 32. Calvo, S. E., Pagliarini, D. J. & Mootha, V. K.
S. cerevisiae. Genome Res. 16, 365–373 (2006). associated with morphogenesis are hidden in plant Upstream open reading frames cause widespread
The only genome-wide assessment of smORF genomes. Proc. Natl Acad. Sci. USA 110, 2395–2400 reduction of protein expression and are polymorphic
function, demonstrating smORF function in (2013). among humans. Proc. Natl Acad. Sci. USA 106,
approximately 5% of baker’s yeast genes. 19. Ma, J. et al. Discovery of human sORF-encoded 7507–7512 (2009).
6. Ladoukakis, E., Pereira, V., Magny, E. G., Eyre‑Walker, A. polypeptides (SEPs) in cell lines and tissue. 33. Fritsch, C. et al. Genome-wide search for novel human
& Couso, J. P. Hundreds of putatively functional small J. Proteome Res. 13, 1757–1765 (2014). uORFs and N‑terminal protein extensions using
open reading frames in Drosophila. Genome Biol. 12, 20. Anderson, D. M. et al. A micropeptide encoded by ribosomal footprinting. Genome Res. 22, 2208–2218
R118 (2011). a putative long noncoding RNA regulates muscle (2012).
7. Frith, M. C. et al. The abundance of short proteins in performance. Cell 160, 595–606 (2015). 34. Ji, Z., Song, R., Regev, A. & Struhl, K. Many lncRNAs,
the mammalian proteome. Plos Genet. 2, 515–528 21. Mackowiak, S. D. et al. Extensive identification 5’UTRs, and pseudogenes are translated and some
(2006). and  analysis of conserved small ORFs in animals. are likely to express functional proteins. eLife 4,
8. Aspden, J. L. et al. Extensive translation of small Genome Biol. 16, 179 (2015). e08890 (2015).
open≈reading frames revealed by Poly-Ribo-Seq. This study uses a stringent computational 35. Duncan, C. D. & Mata, J. The translational landscape
eLife 3, e03528 (2014). approach to identify hundreds of conserved of fission-yeast meiosis and sporulation. Nat. Struct.
9. Bazzini, A. A. et al. Identification of small ORFs smORFs in lncRNAs and UTRs in several Mol. Biol. 21, 641–647 (2014).
in vertebrates using ribosome footprinting and model animals, and re‑evaluates previous 36. Pegueroles, C. & Gabaldon, T. Secondary structure
evolutionary conservation EMBO J. 33, 981–993 computational studies. impacts patterns of selection in human lncRNAs.
(2014). 22. Lemaitre, B., Reichhart, J. M. & Hoffmann, J. A. BMC Biol. 14, 60 (2016).
References 8 and 9 represent the first two studies Drosophila host defense: differential induction of 37. Guttman, M., Russell, P., Ingolia, N. T., Weissman, J. S.
of smORF translation using ribosome profiling in antimicrobial peptide genes after infection by various & Lander, E. S. Ribosome profiling provides evidence
animals. Reference 8 introduces the concept that classes of microorganisms. Proc. Natl Acad. Sci. USA that large noncoding RNAs do not encode proteins.
smORFs can be divided into different categories 94, 14614–14619 (1997). Cell 154, 240–251 (2013).
according to sequence features and translation 23. Pueyo, J. I. & Couso, J. P. The 11‑aminoacid 38. Banfai, B. et al. Long noncoding RNAs are rarely
efficiency. long Tarsal-less peptides trigger a cell signal in translated in two human cell lines. Genome Res. 22,
10. Ruiz-Orera, J., Messeguer, X., Subirana, J. A. Drosophila leg development. Dev. Biol. 324, 1646–1657 (2012).
& Alba, M. M. Long non-coding RNAs as a source 192–201 (2008). 39. Chew, G. L. et al. Ribosome profiling reveals
of new peptides. eLife 3, e03523 (2014). 24. Djakovic, S., Dyachok, J., Burke, M., Frank, M. J. resemblance between long non-coding RNAs
This computational study shows that the & Smith, L. G. BRICK1/HSPC300 functions with SCAR and 5ʹ leaders of coding RNAs. Development 140,
conservation and translation metrics of lncORFs and the ARP2/3 complex to regulate epidermal cell 2828–2834 (2013).
resemble those of evolutionarily young proteins. shape in Arabidopsis. Development 133, 1091–1100 40. Nelson, B. R. et al. A peptide encoded by a transcript
11. Smith, J. E. et al. Translation of small open reading (2006). annotated as long noncoding RNA enhances SERCA
frames within unannotated RNA transcripts in 25. Hanyu-Nakamura, K., Sonobe-Nojima, H., activity in muscle. Science 351, 271–275 (2016).
Saccharomyces cerevisiae. Cell Rep. 7, 1858–1866 Tanigawa, A., Lasko, P. & Nakamura, A. Drosophila 41. Yates, A. et al. Ensembl 2016. Nucleic Acids Res. 44,
(2014). Pgc protein inhibits P‑TEFb recruitment to chromatin D710–D716 (2016).
12. Galindo, M. I., Pueyo, J. I., Fouix, S., Bishop, S. A. in primordial germ cells. Nature 451, 730–733 42. Rubin, G. M. et al. Comparative genomics of the
& Couso, J. P. Peptides encoded by short ORFs (2008). eukaryotes. Science 287, 2204–2215 (2000).
control development and define a new eukaryotic 26. FlyBase Consortium. The FlyBase database of 43. Li, Z. et al. Detection of intergenic non-coding RNAs
gene family. PLoS Biol. 5, 1052–1062 (2007). the Drosophila genome projects and community expressed in the main developmental stages in
13. Magny, E. G. et al. Conserved regulation of cardiac literature. Nucleic Acids Res. 27, 85–88 (1999). Drosophila melanogaster. Nucleic Acids Res. 37,
calcium uptake by peptides encoded in small open 27. Ingolia, N. T. et al. Ribosome profiling reveals 4308–4314 (2009).
reading frames. Science 341, 1116–1120 (2013). pervasive translation outside of annotated protein- 44. van Heesch, S. et al. Extensive localization of long
This study finds that smORFs can be conserved coding genes. Cell Rep. 8, 1365–1379 (2014). noncoding RNAs to the cytosol and mono- and
across hundreds of millions of years of evolution 28. Ingolia, N. T., Lareau, L. F. & Weissman, J. S. polyribosomal complexes. Genome Biol. 15, R6
at the levels of peptide structure and function. Ribosome profiling of mouse embryonic stem cells (2014).

588 | SEPTEMBER 2017 | VOLUME 18 www.nature.com/nrm


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

45. Wang, H., Wang, Y., Xie, S., Liu, Y. & Xie, Z. signalling pathways. Nat. Rev. Mol. Cell Biol. 10, determinant for cell-penetrating peptide uptake.
Global and cell-type specific properties of lincRNAs 319–331 (2009). Biochim. Biophys. Acta 1768, 1769–1776 (2007).
with ribosome occupancy. Nucleic Acids Res. 45, 68. Alonso, J. & Santaren, J. F. Characterization of 93. Jaillon, O. et al. Translational control of intron splicing
2786–2796 (2017). the Drosophila melanogaster ribosomal proteome. in eukaryotes. Nature 451, 359–362 (2008).
46. Slavoff, S. A. et al. Peptidomic discovery of short open J. Proteome Res. 5, 2025–2032 (2006). 94. Pimplikar, S. W. Reassessing the amyloid cascade
reading frame-encoded peptides in human cells. 69. Ghiglione, C., Perrimon, N. & Perkins, L. A. hypothesis of Alzheimer’s disease. Int. J. Biochem.
Nat. Chem. Biol. 9, 59–64 (2013). Quantitative variations in the level of MAPK activity Cell Biol. 41, 1261–1268 (2009).
This work applies a novel proteomics approach control patterning of the embryonic termini in 95. Waxman, D. & Peck, J. R. Pleiotropy and the
to discover SEPs in human cells. Drosophila. J. Dev. Biol. 205, 181–193 (1999). preservation of perfection. Science 279, 1210–1213
47. Miklos, G. L. G. & Rubin, G. M. The role of the 70. Vaux, D. L. & Korsmeyer, S. J. Cell death in (1998).
genome project in determining gene function: development. Cell 96, 245–254 (1999). 96. Billingsley, M. L. et al. Functional and structural
insights from model organisms. Cell 86, 521–529 71. Itoh, K., Nakamura, K., Iijima, M. & Sesaki, H. properties of stannin: roles in cellular growth,
(1996). Mitochondrial dynamics in neurodegeneration. selective toxicity, and mitochondrial responses
48. Mumtaz, M. A. & Couso, J. P. Ribosomal profiling Trends Cell Biol. 23, 64–71 (2012). to injury. J. Cell. Biochem. 98, 243–250 (2006).
adds new coding sequences to the proteome. 72. Staudt, A. C. & Wenkel, S. Regulation of protein 97. Chng, S. C., Ho, L., Tian, J. & Reversade, B. ELABELA:
Biochem. Soc. Trans. 43, 1271–1276 (2015). function by ‘microProteins’. EMBO Rep. 12, 35–42 a hormone essential for heart development signals via
49. Crappe, J. et al. Combining in silico prediction and (2011). the apelin receptor. Dev. Cell 27, 672–680 (2013).
ribosome profiling in a genome-wide search for novel 73. Seo, P. J., Hong, S. Y., Kim, S. G. & Park, C. M. 98. Pauli, A. et al. Toddler: an embryonic signal that
putatively coding sORFs. BMC Genomics 14, 648 Competitive inhibition of transcription factors by small promotes cell movement via Apelin receptors. Science
(2013). interfering peptides. Trends Plant Sci. 16, 541–549 343, 1248636 (2014).
50. Fatica, A. & Bozzoni, I. Long non-coding RNAs: (2011). References 97 and 98 characterize the
new players in cell differentiation and development. 74. Graeff, M. et al. Microprotein-mediated recruitment 32‑amino-acid-long SEP toddler, which acts
Nat. Rev. Genet. 15, 7–21 (2013). of CONSTANS into a TOPLESS trimeric complex as a hormone in the zebrafish heart.
51. Kronja, I. et al. Widespread changes in the represses flowering in Arabidopsis. PLoS Genet. 12, 99. Dunn, J. G., Foo, C. K., Belletier, N. G., Gavis, E. R.
posttranscriptional landscape at the Drosophila e1005959 (2016). & Weissman, J. S. Ribosome profiling reveals
oocyte-to‑embryo transition. Cell Rep. 7, 1495–1508 75. Ling, F., Kang, B. & Sun, X.‑H. Id proteins: small pervasive and regulated stop codon readthrough
(2014). molecules, mighty regulators. Curr. Top. Dev. Biol. in Drosophila melanogaster. eLife 2, e01179 (2013).
52. Zanet, J. et al. Pri sORF peptides induce selective 110, 189–216 (2014). 100. Carvunis, A. R. et al. Proto-genes and de novo gene
proteasome-mediated protein processing. Science 76. Au, Y. The muscle ultrastructure: a structural birth. Nature 487, 370–374 (2012).
349, 1356–1358 (2015). perspective of the sarcomere. Cell. Mol. Life Sci. 61, This study proposes a model for the de novo
53. Kessler, M. M. et al. Systematic discovery of new 3016–3033 (2004). emergence of protein-coding genes from
genes in the Saccharomyces cerevisiae genome. 77. Gawlin´ski, P. et al. The Drosophila mitotic inhibitor proto-genes or sequences, forming a continuum
Genome Res. 13, 264–271 (2003). Frühstart specifically binds to the hydrophobic patch between noncoding DNA and fully coding genes.
54. Baggerman, G., Cerstiaens, A., De Loof, A. of cyclins. EMBO Rep. 8, 490–496 (2007). 101. McLysaght, A. & Guerzoni, D. New genes from non-
& Schoofs, L. Peptidomics of the larval Drosophila 78. Guo, B. et al. Humanin peptide suppresses apoptosis coding sequence: the role of de novo protein-coding
melanogaster central nervous system. J. Biol. Chem. by interfering with Bax activation. Nature 423, genes in eukaryotic evolutionary innovation.
277, 40368–40374 (2002). 456–461 (2003). Phil. Trans. R. Soc. B http://dx.doi.org/10.1098/
55. Loose, C. R., Langer, R. S. & Stephanopoulos, G. N. 79. Weinmaster, G. & Fischer, J. A. Notch ligand rstb.2014.0332 (2015).
Optimization of protein fusion partner length for ubiquitylation: what is it good for? Dev. Cell 21, 102. Zhao, L., Saelao, P., Jones, C. D. & Begun, D. J.
maximizing in vitro translation of peptides. Biotechnol. 134–144 (2011). Origin and Spread of de novo genes in Drosophila
Prog. 23, 444–451 (2007). 80. Ching, K. H., Kisailus, A. E. & Burbelo, P. D. melanogaster populations. Science 343, 769–772
56. Lauressergues, D. et al. Primary transcripts of Biochemical characterization of distinct regions (2014).
microRNAs encode regulatory peptides. Nature 520, of SPEC molecules and their role in phagocytosis. 103. Zhou, Q. et al. On the origin of new genes in
90–93 (2015). Exp. Cell Res. 313, 10–21 (2007). Drosophila. Genome Res. 18, 1446–1455 (2008).
57. Paharkova, V., Alvarez, G., Nakamura, H., Cohen, P. 81. Zasloff, M. Antimicrobial peptides of multicellular 104. Neme, R. & Tautz, D. Phylogenetic patterns of
& Lee, K. W. Rat Humanin is encoded and translated organisms. Nature 415, 389–395 (2002). emergence of new genes support a model of frequent
in mitochondria and is localized to the mitochondrial 82. Brogden, K. A. Antimicrobial peptides: pore formers de novo evolution. BMC Genomics 14, 117 (2013).
compartment where it regulates ROS production. or metabolic inhibitors in bacteria? Nat. Rev. 105. Reinhardt, J. A. et al. De novo ORFs in Drosophila are
Mol. Cell. Endocrinol. 413, 96–100 (2015). Microbiol. 3, 238–250 (2005). important to organismal fitness and evolved rapidly
58. Lee, C. et al. The mitochondrial-derived peptide 83. D’Onofrio, G., Mouchiroud, D., Aissani, B., from previously non-coding sequences. PLoS Genet. 9,
MOTS‑c promotes metabolic homeostasis and reduces Gautier, C. & Bernardi, G. Correlations between e1003860 (2013).
obesity and insulin resistance. Cell Metab. 21, the compositional properties of human genes, codon 106. Moyers, B. A. & Zhang, J. Evaluating phylostratigraphic
443–454 (2015). usage, and amino acid composition of proteins. J. Mol. evidence for widespread de novo gene birth in genome
59. Kozak, M. Regulation of translation via mRNA Evol. 32, 504–510 (1991). evolution. Mol. Biol. Evol. 33, 1245–1256 (2016).
structure in prokaryotes and eukaryotes. Gene 361, This study correlates the overall nucleotide and 107. Xie, C. et al. Hominoid-specific de novo protein-coding
13–37 (2005). amino acid compositions of protein-coding genes originating from long non-coding RNAs.
60. Szamecz, B. et al. eIF3a cooperates with sequences sequences, highlighting and attempting to explain PLoS Genet. 8, e1002942 (2012).
5ʹ of uORF1 to promote resumption of scanning by the biased nonrandom amino acid usage of 108. Schlotterer, C. Genes from scratch — the evolutionary
post-termination ribosomes for reinitiation on GCN4 canonical proteins. fate of de novo genes. Trends Genet. 31, 215–219
mRNA. Genes Dev. 22, 2414–2425 (2008). 84. Hansen, M., Kilk, K. & Langel, U. Predicting (2015).
61. Ebina, I. et al. Identification of novel Arabidopsis cell‑penetrating peptides. Adv. Drug Deliv. Rev. 60, 109. Sommer, R. J. The future of evo-devo: model systems
thaliana upstream open reading frames that control 572–579 (2008). and evolutionary theory. Nat. Rev. Genet. 10,
expression of the main coding sequences in a peptide 85. Jones, S. W. et al. Characterisation of cell-penetrating 416–422 (2009).
sequence-dependent manner. Nucleic Acids Res. 43, peptide-mediated peptide delivery. Br. J. Pharmacol. 110. Yang, S. & Bourne, P. E. The evolutionary history
1562–1576 (2015). 145, 1093–1102 (2005). of protein domains viewed by species phylogeny.
62. Combier, J. P., de Billy, F., Gamas, P., Niebel, A. 86. Murphy, M. P. Targeting lipophilic cations to PLoS ONE 4, e8378 (2009).
& Rivas, S. Trans-regulation of the expression mitochondria. Biochim. Biophys. Acta 1777, 111. Milligan, M. J. et al. Global intersection of long
of the transcription factor MtHAP2‑1 by a uORF 1028–1031 (2008). non‑coding RNAs with processed and unprocessed
controls root nodule development. Genes Dev. 22, 87. Hoffmann, J. A., Kafatos, F. C., Janeway, C. A. pseudogenes in the human genome. Front. Genet. 7,
1549–1559 (2008). & Ezekowitz, R. A. Phylogenetic perspectives in innate 26 (2016).
63. Zhang, Z. & Dietrich, F. Identification and immunity. Science 284, 1313–1318 (1999).
characterization of upstream open reading frames 88. Fan, L. et al. DRAMP: a comprehensive data Acknowledgements
(uORF) in the 5ʹ untranslated regions (UTR) of genes repository of antimicrobial peptides. Sci. Rep. The authors thank their colleagues J. Pueyo, E. Magny,
in Saccharomyces cerevisiae. Curr. Genet. 48, 77–87 6, 24482 (2016). S. Bishop and F. Casares for helpful suggestions about
(2005). 89. Wenzel, M. et al. Small cationic antimicrobial peptides the manuscript. This work was funded by grants from the
64. Abrusan, G. Integration of new genes into cellular delocalize peripheral membrane proteins. Proc. Natl Spanish Ministerio de Economía, Industria y Competit­ividad
networks, and their structural maturation. Genetics Acad. Sci. USA 111, E1409–E1418 (2014). (MINECO; ref. BFU/2016-77793-P) and the British Bio­
195, 1407–1417 (2013). 90. Slavoff, S. A., Heo, J., Budnik, B. A., Hanakahi, L. A. technology and Biological Sciences Research Council (BBSRC;
65. Kelley, L. A. & Sternberg, M. J. Partial protein & Saghatelian, A. A human short open reading frame ref. BB/N001753/1) to J.-P.C.
domains: evolutionary insights and bioinformatics (sORF)-encoded polypeptide that stimulates DNA end
challenges. Genome Biol. 16, 100 (2015). joining. J. Biol. Chem. 289, 10950–10957 (2014). Competing interests statement
66. Joliot, A. & Prochiantz, A. Transduction peptides: from 91. Pueyo, J. I. & Couso, J. P. Tarsal-less peptides control The authors declare no competing interests.
technology to physiology. Nat. Cell Biol. 6, 189–196 Notch signalling through the Shavenbaby transcription
(2004). factor. Dev. Biol. 355, 183–193 (2011). Publisher’s note
67. Schulman, B. A. & Harper, J. W. Ubiquitin-like protein 92. Palm, C., Jayamanne, M., Kjellander, M. Springer Nature remains neutral with regard to jurisdictional
activation by E1 enzymes: the apex for downstream & Hallbrink, M. Peptide degradation is a critical claims in published maps and institutional affiliations.

NATURE REVIEWS | MOLECULAR CELL BIOLOGY VOLUME 18 | SEPTEMBER 2017 | 589


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
CORRECTION

CORRECTION

Classification and function of small open reading frames


Juan-Pablo Couso & Pedro Patraquim
Nature Reviews Molecular Cell Biology doi:10.1038/nrm.2017.58 (2017)
The original online version of this article contained four errors, which have now been corrected. The corrections included
two typos in the main text, the addition of a missing point in the X axis in Figure 3b, and the exchange of the position of two
column headers in Figure 5c.

NATURE REVIEWS | MOLECULAR CELL BIOLOGY www.nature.com/nrm


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.

View publication stats

You might also like