Pan 08 Nature Genet

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

B R I E F C O M M U N I C AT I O N S

Deep surveying of alternative all hypothetical, additional 5 to 3 pairings of splice sites in the same
set of genes (Fig. 1a; and Supplementary Methods online). Mining of
splicing complexity in the a dataset of 15,702 multiexon UniGene clusters, each containing one
or more locus-specific RefSeq cDNA, resulted in the compilation of
human transcriptome by
2008 Nature Publishing Group http://www.nature.com/naturegenetics

257,257 known splice junctions and 2,459,306 candidate new junc-


high-throughput sequencing tions. These junction libraries were searched using reads from the six
tissues, and estimates for true-positive junctions were derived as
Qun Pan1, Ofer Shai1,2, Leo J Lee1,2, Brendan J Frey1,2 & ranges (see below) that exclude or include repeat junction reads,
Benjamin J Blencowe1,3 that is, those reads that map to more than one splice junction in
transcripts from the 15,702 surveyed genes.
We carried out the first analysis of alternative splicing In order to assess which reads represent true splice junctions, we
complexity in human tissues using mRNA-Seq data. New splice trained a logistic regression classifier to discriminate between known
junctions were detected in B20% of multiexon genes, many junction sequences and a set of control reverse junction sequences, in
of which are tissue specific. By combining mRNA-Seq and which the 3 half of each detected known junction sequence is located
EST-cDNA sequence data, we estimate that transcripts from upstream of the 5 half. These reverse junction sequences were used as
~95% of multiexon genes undergo alternative splicing and controls to maintain inherent codon, dinucleotide and possible other
that there are B100,000 intermediate- to high-abundance compositional biases when discriminating between true and false
alternative splicing events in major human tissues. From a junctions. The classifier was trained using five features that reflect
comparison with quantitative alternative splicing microarray important parameters when discriminating true- and false-positive
profiling data, we also show that mRNA-Seq data provide alignments between mRNA-Seq reads and splice junction regions
reliable measurements for exon inclusion levels. (Supplementary Methods). The classifier achieves 94% sensitivity at
a specificity of 95%, as determined by tenfold cross-validation
Alternative splicing is considered to be a key factor underlying (Supplementary Fig. 1 online). The parameters learned by the
increased cellular and functional complexity in higher eukaryotes13. classifier were applied to the known and new junctions on the basis
From analyses of microarray profiling and EST-cDNA sequence data, of statistics obtained from sequence reads from each tissue, and the
it has been estimated that two-thirds of human genes contain one or numbers of true versus false junctions were determined in the known
more alternatively spliced exon4. However, because of the limited and new junction datasets for all six tissues.
depth of coverage and sensitivity afforded by conventional sequencing When combining the mRNA-Seq data from the six tissues we
and microarray profiling methods, the extent of human alternative detected between 128,395 (49.9%) and 130,854 (50.9%) of the
splicing is not known5. High-throughput or next generation sequen- 257,257 known junctions (Fig. 1b), whereas only 121 (0.04%) to
cing technologies offer the potential to address this question6, and 135 (0.052%) of the corresponding control junctions were detected,
several very recent studies have applied analyses of short cDNA read respectively. Each tissue dataset contributed between 18% and 31% of
(mRNA-Seq) data from these technologies to survey alternative the detected known junctions. Thus, from profiling only six human
splicing in mouse tissues and in human and mouse cell lines710. In tissues, we were able to detect approximately half of the splice
this study, we used the Genome Analyzer system of Illumina to survey junctions represented in EST-cDNA databases. This observation
splicing complexity in diverse, normal human tissues using mRNA- could reflect previous results from EST-cDNA analyses and microarray
Seq datasets consisting of 1732 million 32-nucleotide-long reads. We profiling studies indicating that most tissues, including those analyzed
also assessed the potential of these datasets to provide quantitative in the present study, express B6,000 to 10,000 mRNA genes, and that
measurements for alternative splicing levels. brain and liver show relatively high frequencies of alternative splicing
To assess human tissue alternative splicing complexity, we used compared to other tissues1113.
mRNA-Seq datasets from whole brain, cerebral cortex, heart, skeletal From the mRNA-Seq data, we also detected between 4,294 and
muscle, lung and liver to search libraries of splice junction sequences 11,099 new splice junctions (Fig. 1b), which corresponds to a
that represent known splicing events and candidate new splicing detection rate of one or more new splice junction in 2,948 (18.8%)
events. Junction sequences designated as known below are those to 3,788 (24.1%) of the surveyed genes. When combining EST, cDNA
supported by the analysis of aligned EST and cDNA sequences, and and mRNA-Seq data, we observed that more than 85% of the multi-
candidate new splice junction sequences are those corresponding to exon genes analyzed contain at least one alternative splicing event.

1Banting and Best Department of Medical Research, University of Toronto, Toronto M5S 3E1, Canada. 2Department of Electrical and Computer Engineering, University

of Toronto, Toronto M5S 3G4, Canada. 3Department of Molecular Genetics, University of Toronto, Toronto M5S 3E1, Canada. Correspondence should be addressed to
B.J.B. (b.blencowe@utoronto.ca).
Received 17 July; accepted 19 August; published online 2 November 2008; addendum published after print 28 April 2009; doi:10.1038/ng.259

NATURE GENETICS VOLUME 40 [ NUMBER 12 [ DECEMBER 2008 1413


B R I E F C O M M U N I C AT I O N S

Figure 1 Assessing human alternative splicing complexity using mRNA-Seq


a b Known (repeats)
data. (a) Diagram showing a gene with known splice junctions (blue lines)
160,000 Known (non repeats) 30,000

Number of known junctions


New (repeats) supported by cDNA-EST evidence. Dashed pink lines represent all

Number of new junctions


New
140,000 New (non repeats) 25,000 hypothetically possible new splice junctions, and the solid pink lines
120,000
20,000 indicate a new junction detected using Illumina mRNA-Seq data. Alternative
100,000
80,000 15,000 exons are indicated in red. (b) Numbers of known and new splice junctions
60,000 detected using mRNA-Seq data from human tissues. Each point in the four
10,000
Known 40,000 plots indicates the mean number of junctions detected when comparing data
20,000 5,000 from all possible combinations of the specified numbers of tissues. The light
0 0 blue and dark blue plots show the numbers of detected known junctions
1 2 3 4 5 6
Number of tissues
when junction sequences that are repeated elsewhere in the surveyed genes
are either included or excluded, respectively. The pink and purple plots show
c 90
Known AS junctions
d the numbers of new junctions detected when including or excluding repeated
80 Br sequences. (c) Histograms of the tissue distribution of known and new
70 New AS junctions
alternatively spliced junctions. We detected 7,917 known and 2,368 new
AS junctions (%)

Ce
2008 Nature Publishing Group http://www.nature.com/naturegenetics

60
He
splice junctions representing evidence for skipping of one or more alternative
50
cassette exons in mRNA-Seq read alignments. The tissue distribution of
40 Li
30
these junction reads was plotted as the percentage of junctions that appear
20
Lu in one to all six tissues. (d) Tissue distributions of new splice junctions
10 Sk detected in pairs of tissues. The size of each blue box indicates the number
0
Br Ce He Li Lu Sk
of junctions shared between a given pair of tissues, with the highest number
1 2 3 4 5 6 of shared junctions corresponding to the largest box. Br, whole brain; Ce,
Number of tissues
cerebral cortex; He, heart; Li, liver; Lu, lung; Sk, skeletal muscle.

However, at increased levels of read coverage (that is, 16 to >500 reads tissue-specific alternative splicing events, in addition to new alter-
per 100 nucleotides), alternative splicing events can be detected in 92 native splicing events in transcripts with tissue-restricted expression
97% of multiexon genes (Supplementary Methods). This represents a patterns. Supplementary Table 1 online lists genes with more than
substantial increase over a previous estimate (74%)4 for the proportion five new splice junctions. Many of these genes encode giant and other
of multiexon genes that contain one or more alternative splicing event. muscle-specific proteins, thus revealing a previously unappreciated
Given that our analysis of mRNA-Seq data detects approximately half degree of alternative splicing complexity in transcripts from muscle-
of known junctions, and that there is an almost linear increase in the specific genes. These findings are consistent with previous proposals
detection rate of new junctions as data from each tissue is added that alternative splicing of transcripts encoding some of these proteins
(Fig. 1b), we predict that with full coverage the numbers of new has an important role in controlling fundamental mechanical proper-
junctions would be at least twice those detected in the present data. ties of muscle, such as tension and contractility14.
To assess the degree to which known and new junctions detected in As mRNA-Seq data affords the detection of alternative splicing
the mRNA-Seq data may represent tissue-dependent splicing events, events in transcripts irrespective of their length and associated splicing
we investigated the frequencies at which
detected splice junctions formed by skipping
of one or more exons are unique to indivi- a b
Number of AS events

dual tissues, or are common to two or more 0.5


5
0.4
tissues (Fig. 1c). In each case, there were
per exon

4.5
Number of AS events per exon

0.3
significantly more junctions that were 0.2 4
detected in only one tissue, and these could 0.1 3.5
represent tissue-specific or tissue-restricted 0 3
4
8

13 2

17 6
21 0
25 4
8

37 6
0

0
0

splice variants, as well as possible lower- 2.5


1
5

1
1

2
2
2

5
>5
9

29

33

41

abundance splice variants that are widely 2


Number of reads
per 100 bases

150 1.5
expressed but only detected in a single tissue.
100 1
We also examined the tissue specificity of all
0.5
(n 439) new splice junctions that were 50
0
detected in two of the six tissues. In the plot 0
4 4
>2

4
8

16 6
32 2
64 4

8 8
25 256

1, 1, 2

2 8
48
02 02
2
4

1
3
6

12 12

2 1

04

shown in Figure 1d, the sizes of the blue


4
8

13 2
6

21 0

25 4
29 8

33 2

37 6
41 0
0

51 5

,0
1
5

1
1

2
3

4
5

>5

2,
9

6
17

squares indicate the proportions of the over-


Number of exons per gene Number of reads per 100 bases
laps between new splice junctions when com-
paring pairs of tissues. This shows that a Figure 2 Assessing alternative splicing frequency using mRNA-Seq data. (a) Box plots showing the
greater proportion of new splice junctions number of alternative splicing events per exon (AS frequency, upper panel) and mRNA-Seq read
are commonly detected in whole brain and coverage (lower panel) for genes with different numbers of exons. The alternative splicing frequency is
cerebral cortex, or in skeletal muscle and calculated on the basis of mRNA-Seq data only. Each box shows the lower and upper quartile values,
heart, than the proportion of new junctions and the white line indicates the median value. The error bars indicate the variation for the rest of the
commonly detected in any of the other pairs data, and outliers are indicated as black pluses. (b) Alternative splicing frequency (number of
alternative splicing events per exon) for genes with different mRNA-Seq read coverage. cDNA, EST and
of tissues. This indicates that many of
mRNA-Seq data were combined to calculate the number of alternative splicing events in each gene.
the new splice junctions reflect the physiolo- The median alternative splicing frequency was determined for each gene group and a scale factor was
gical origin of the tissues analyzed and applied to new junctions detected in mRNA-Seq data to account for missing new junctions expected
therefore likely represent examples of new when surveying additional tissues.

1414 VOLUME 40 [ NUMBER 12 [ DECEMBER 2008 NATURE GENETICS


B R I E F C O M M U N I C AT I O N S

complexity, the relationship between alternative splicing frequency predictions for percent inclusion. The correlation increases (r 0.85,
relative to exon number in genes can be accurately assessed. Accord- n 546) when a threshold of 50 or more junction matching reads is
ingly, we determined the median number of alternative splicing events used. Predictions for percent inclusion levels from the two
per exon for genes with different numbers of total exons and with systems also agree well for tissue-regulated alternative exons (Supple-
similar overall levels of Illumina read coverage (Fig. 2a). Despite the mentary Fig. 2b and Supplementary Information). Together, the
theoretical possibility of a quadratic (n2) increase in the number of results described above show that mRNA-Seq data can be used to
alternative splicing possibilities as the number of exons per gene reliably measure alternative splicing levels, in addition to revealing
increases, our results indicate that the number of alternative splicing important new insights into alternative splicing complexity in the
events per gene increases in a near linear fashion (Fig. 2a). Thus, human transcriptome.
notably, the frequency of alternative splicing detection per exon does
Note: Supplementary information is available on the Nature Genetics website.
not rise in genes with increasing numbers of exons, and this observa-
tion suggests that selection pressure may act to generally limit splicing ACKNOWLEDGMENTS
complexity in large genes. This observation facilitates assessment of We thank S. Luo, I. Khrebtukova and G. Schroth of Illumina Inc. for providing
2008 Nature Publishing Group http://www.nature.com/naturegenetics

the total number of alternative splicing events in human tissues. some of the mRNA-Seq datasets used in this analysis. We also thank M. Brudno,
When considering that the frequency of new alternative splicing Y. Barash, J. Calarco and S. Ahmad for helpful suggestions and comments on the
manuscript. B.J.B and B.J.F. acknowledge support from the Canadian Institutes
events detected in the six different tissues can be extrapolated to other of Health Research and from Genome Canada through the Ontario
human tissues, it is possible to derive an estimate for the total number Genomics Institute.
of alternative splicing events that can be detected by comparable
methods. By combining the rates of detection of new and known AUTHOR CONTRIBUTIONS
alternative splicing events afforded by mRNA-Seq and EST-cDNA Q.P. created the exon and splice junction libraries and performed analyses of the
mRNA-Seq, cDNA-EST and microarray data. O.S., L.J.L. and B.J.F. designed and
data, respectively, we observed that the median number of alternative implemented the logistic regression classifier and contributed to the analyses of
splicing events per exon is between 0.5 and 0.75 for genes with tissue-specific alternative splicing events. The study was coordinated by B.J.B.
intermediate to high levels of Illumina sequence coverage (32256 The manuscript was prepared by B.J.B. and Q.P., with the participation of O.S.,
reads per 100 bases; see Fig. 2b). Given that 175,944 exons were mined L.J.L. and B.J.F.
from 15,702 multiexon human genes, we predict that on the order of
Published online at http://www.nature.com/naturegenetics/
88,000132,000 alternative splicing events of comparable abundance
Reprints and permissions information is available online at http://npg.nature.com/
as those detected in the present study are expressed in major human reprintsandpermissions/
tissues. This estimate further suggests that, on average, there are at
least seven alternative splicing events per multiexon human gene.
1. Matlin, A.J., Clark, F. & Smith, C.W. Nat. Rev. Mol. Cell Biol. 6, 386398
An important question concerning high-throughput sequencing
(2005).
technologies is their capacity to generate reliable quantitative mea- 2. Blencowe, B.J. Cell 126, 3747 (2006).
surements for alternative splicing levels. To address this question, we 3. Ben-Dov, C., Hartmann, B., Lundgren, J. & Valcarcel, J. J. Biol. Chem. 283,
12291233 (2008).
compared estimates for percent exon inclusion from the mRNA-Seq 4. Johnson, J.M. et al. Science 302, 21412144 (2003).
data described above with percent inclusion estimates generated from 5. Sorek, R., Dror, G. & Shamir, R. BMC Genomics 7, 273 (2006).
profiling B5000 cassette-type alternative exons in the same six human 6. Calarco, J.A. et al. Adv. Exp. Med. Biol. 623, 6484 (2007).
7. Bainbridge, M.N. et al. BMC Genomics 7, 246 (2006).
tissues using our previously validated15, quantitative alternative 8. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5,
splicing microarray system (unpublished data, Supplementary 621628 (2008).
9. Cloonan, N. et al. Nat. Methods 5, 613619 (2008).
Fig. 2a and Supplementary Methods online). When applying a 10. Sultan, M. et al. Science 321, 956960 (2008).
threshold of 20 or more reads per tissue that match any one of the 11. Su, A.I. et al. Proc. Natl. Acad. Sci. USA 101, 60626067 (2004).
three splice junction sequences representing inclusion and skipping of 12. Zhang, W. et al. J. Biol. 3, 21 (2004).
13. Yeo, G., Holste, D., Kreiman, G. & Burge, C.B. Genome Biol. 5, R74 (2004).
a cassette exon, there is a high correlation (r 0.80, n 1,548) 14. Schiaffino, S. & Reggiani, C. Physiol. Rev. 76, 371423 (1996).
between the alternative splicing microarray- and mRNA-Seqderived 15. Pan, Q. et al. Mol. Cell 16, 929941 (2004).

NATURE GENETICS VOLUME 40 [ NUMBER 12 [ DECEMBER 2008 1415


addendum

Addendum: Deep surveying of alternative splicing complexity in the human


transcriptome by high-throughput sequencing
Qun Pan, Ofer Shai, Leo J Lee, Brendan J Frey & Benjamin J Blencowe
Nat. Genet. 40, 14131415 (2008), published online 2 November 2008; addendum published after print 28 April 2009

The GEO accession number for the mRNA-Seq datasets is GSE13652.


2009 Nature America, Inc. All rights reserved.

You might also like