Download as pdf
Download as pdf
You are on page 1of 12
Downloaded from genome.cship.org on May 21, 2018 - Published by Cold Spring Harbor Laboratory Press Method. MMAPPR: Mutation Mapping Analysis Pipeline for Pooled RNA-seq Jonathon T. Hill, Bradley L. Demarest, Brent W. Bisgrove, Bushra Gorsi, Yi-Chu Su, and H. Joseph Yost! Department of Neurobiology and Anatomy, University of Utah Molecular Medicine Program, University of Utah, Salt Lake City, Utah 84112, USA Forward genetic screens in model organisms are vital for identifying novel genes essential for developmental or disease processes. One drawback of these screens is the labor-intensive and sometimes inconclusive process of mapping the causative mutation. To leverage high-throughput techniques to improve this mapping process, we have developed 3 Mutation Mapping Analysis Pipeline for Pooled RNA-seq (MMAPPR) that works witout parental strain information or ‘requiring a preexisting SNP map ofthe organism, and adapts to diferential recombination frequencies across the genome. MMAPPR accommodates the considerable amount of noise in RNA-seq data sets, calculates allelic frequency by Euclidean distance followed by Loess regression analysis identifies the region where the mutation lies, and generates lst of putative «coding region mutations inthe linked genomic segment. MMAPPR can exploit RNA-seq datasets from isolated tissues or whole organisms that are used for gene expression and transcriptome analysis in novel mutants, We cested MMAPPR on ‘wo known mutant lines in zebrafish, nla25 and thx, and used it to map two novel ENULinduced cardiovascular mutants, With mutations found in the cr? and cdi? genes. MMAPPR can be directly applied to other model organisms, such 28 Drosophila and Caenorhabditis elegans, that are amenable to both forward genetic screens and pooled RNA-Seq experiments ‘Thus, MMAPRR is a rapid, costeffcient, and highly automated pipeline, available to perform mutant mapping in any ‘organism with a well-assembled genome. [Supplemental material is available for this article Forward genetic screens in zebrafish have identified a large numberof genes essential for organogenesis (Driever et al. 19%; Hpffer et a. 1996), laterality (Chen etal 2001), axon guidance (Xiao et al. 2008), and cancer development (Moore et al. 2006), many of which have been linked to human disease, Similar large-scale forward-genetic screens have been and continue to be performed in mice (Yu et al 2004; Gateia-G Drosophila (Nusslein-Volhard and Wieschaus 1980; Medina et al. 2006), and Caenorhabditis elegans (Brenner 1974; Hughes et al. 2011), However, mapping mutants sn many species traditionally hhas been Iabor intensive and often inconclusive, especially in of ‘ganisms with elaively complex genomes Several methods exist to expedite genetic mapping. For example, genotyping DNA pooled from phenotype-sorted in- sdviduals (bulk segregant analysis) has long been a standard method for low-resolution genetic mapping. Bulk segregant Snalysi provides a qualitative estimate of the linkage between 4 given marker and the mutant locus, while greatly reducing the fume and expense of genotyping. However, this method is stil labor intensive because it requires that each marker be analyzed ‘individually The development of techniques using genotyping arrays (labemero et al. 2012), genomic resequencing of individuals (Warren etal, 2012), and exome-capture sequencing (Lin et al, 2012) have made mapping mutations much more rapid in hu ‘man populations by allowing multiple markers to be analyzed ‘corresponding author Exmal yostegenetics.utah ed Dele pookned one ofore pnt Aric, supplemental mate ap publ ‘ison gate ae at hep von genome org/eg do 10 101i 16336 112 Frey sabe ani hoougn se Gerame Neto Open Accs option simultaneously, but they have been less widely adopted in many model organisms because of incomplete genomic annotation, high polymorphism rates, ad the cost associated with performing these analyses on large numbers of individuals. Recently, several methods to use whole genome sequencing techniques to model organisms hhave been propoted for Arabidopsis thaliana (Schneeberger et al. 2009; Cuperus etal. 2010, Austin etal. 2011; Uchida etal. 2011), aebralish Bowen et a.2012; Leshehiner etal, 2012; Voz etal. 2012), mice (Aanold etal. 2011), and elegans (Doitsidou et al. 2010; ‘rye eal, 2010). ‘Analtemative to whole-genome sequencing (WGS)s RNA-Se9, which is less expensive becasse the transcriptome is smaller than che genome, allowing greater read depth tobe achieved with fewer reads. The utlty of RNA-seq analysis for mapping has been dem- onstrated recently in seltpollinated individuals derived from in- bred mapping strains in maize (Liu et al. 2012), but this has not been tested in more noisy data sets ftom outbred animal pop- Ulations.In addition to mapping, RNAvseq is becoming standard analysis method in model organisms for determining the gene expression and splicing changes underlying phenotypes derived from both forward and reverse genetics (Aanes etal. 2011; Rosel etal 2011; Vesterhand etal, 2011), Becatise RNA-seq of pooled individuals can be wed for it. ferential expression analysis to further understand the phenotypes fof novel mutants from forward genetic sereens, we sought to de- velop amethod to use these data to identify the eausativemutation ‘underlying the observed phenotype, thus creating an inexpensive and rape alternative to traditional mapping procedures, We have Gerigned our method, wich we call MMAPPR (Mutant Mapping Analysis Pipeline for Pooled RNA-sea), to use the data and exper- imental design typical in RNA.seq-based transcriptome experi- ments directly. Although this study goes through the principles $a00-000 © 2019, Publshed by Cold ping Hair Laboratory Prey SSN 1088-90513; ww genameorg Genome Research 1 ‘wo geneme org Downloaded from genome.cship.org on May 21, 2018 - Published by Cold Spring Harbor Laboratory Press. Hill et al. and optimization of MMAPPR analysis as well as the details of secefsfully mapping four mutants, the average user will not be requited to have this level of expertise and can simply process their data sets through our program (available at http://yost ‘genetics ah edusoftware php) in order to identify their mu- tant genes, We have validated MMAPPR on two known mutants, nkx2.5 (KV Targoff, unpubl.) and thxl (Piotrowski et al. 2003), and ‘ogo unknown mutant lines, 2713 and 2714, identified in an ENU sereen performed in our laboratory. MMAPPR was then used to ‘identify a genomic region containing the mutation and generate 2 list of nonsynonymous mutations that serve as candidates for the gene encoding the causative mutation. In each case, the identified causative mutation was <1 eM from the maximum score generated by MMAPPR, indicating that MMAPPR is able to Identify mutations derived from a forward genetic screen in zebrafish successfully and accurately. In addition to zebrafish, MMAPPR can be ditectly applied to other organisms, such as Drosophila melanogaster and C. elegans, in which both forward {Eenetic screens and pooled RNA-seq experiments are common, ths removing a signilicant barier for performing mutagenesis screens in model organisms Results ‘We developed a novel method, MMAPPR, for identifying recessive ‘mutations identified in forward genetic screens (outlined in Fig. 1) Briefly, MMAPPR uses RNA-seq data from F2 embryos separated by phenotype into two pools wild-type phenotype (which includes homozygous wildtype and heterozygotes), and mutant pheno- type (which includes homozygous mutants) Candidate molec smitalions ae then identied based on three criteria: physical lo ‘ation in the linked region, expression at the time of tissue col lection, and effect on protein amino acid sequence. RNAvseq data contain many thousand of single nucleotide polymorphic markers (SNPs) spread across the entire genome, ‘aking stan ideal source for high-throughput mapping. However, these data sets ate extremely noisy du fo the variable expression levels of individual genes across the genome at the time of RNA collection, PCR amplification artifacts, sequencing errors, map ping error, and genome annotation errors. MMAPPR compen- sates for the noise inherent in RNA.seq data sets to map muta- tions, MMAPPR encompasses five steps (1) RNA sequencing and ‘mapping, (2) allele frequency distance calculation, (3) signal * hee i pone orn Coan) -—- x eo prenawen iy “a cin Yn a : = |S —SSS== = ‘alslation . i cow mp |g 5 ewe 2 aca OE ee a Figure 1. Schematic overview of MMAPPR, (A) Mating scheme. Fh rom the mutant ine are outconed and then FI progeny, heterozygous forthe mutaben, ae assed. Pool of phenolypialy mutant ana phenotypicaly wl-type fish ae then sorted. (6) Schematic representation of alli segte Salon between wié-ype and mutant peas. (lack) Genome regions inherited fem the mutant care ine; (gry) regions nerd rom the outers line: (Baton panel A plat ofthe expected Euclidean dstances calculated rom the allel requencies nthe two pook.(C)Flowehar of the anal steps incorporated nthe MMAPPR algeritnm. Each phase othe pipeline i shown by gay boxes, and the portion a the pipeline processed by the MIMAPPR sot fare package i shown by vera gray Bar 2 Genome Research ‘wrgenome org Downloaded from genome.cship.org on May 21, 2018 - Published by Cold Spring Harbor Laboratory Press Mutation mapping by RNA-seq processing, (#) candidate single nucleotide polymorphism (SNP) Identification, and (8) candidate confirmation. We have created ‘the MMAPPR software package to perform steps 2-4, while steps 1 and 5 involve bench work and preexisting software packages however, because all five steps ae integral othe process, each one 45 covered in more detail below. RNA sequencing and mapping Any pool of individuals derived from a cross between two helero- zygous carriers ofa SNP will contain a Mendelian distribution of ‘genotypes (expected frequencies: 0.25 AA, 0.5 Aa, and 0.25 a3) at every SNP locus where both parents were heterozygous, Simiary the expected allele frequencies of such a SNP ae fa = 0.5 and 10.5. However, when the pool of individuals is subdivided into to pools bised on a mutant phenotype, the expected allele frequen es for any SNP depend on it linkage to the mutation easing the phenotype. For example, a SNP located 10 a sufficient number (~20) of F2 offspring that can be pooled by phenotype for RNA seq, These F2 offspring do not have to be siblings, but can be generated from crosses of multiple FL cacier siblings, as we did ‘with the nkx2.5 and thxT Hines In organ Chromosome Isms that havea genome asembly but aot a strong transcriptome annotation, tools are available tobuilda transcriptome fom, Loess fitof ep’ [RNA-seq (Grabher et al. 2011), Although MMAPPR worked well for the four examples shown here, # has several limitations. As currently imple. mented, is unable to directly kdentify Indels that are small enough not to affect Chromosome 7 Base Position (MB) overall gene expression levels, MMAPPR Is capable of mapping larger deletions (data not shown). It is unable to identify Sid. Dev. genes that are missing from the reference build or are incorrectly annotated. nally, it is unable to dizecty identity the causative lesion ifthe pooled samples are collected after the gene is no longer “expressed or the mutation lies in untran- senbed genomic regions. Nonetheless it Chromosome 7 Base Position (MB) Chromosome 7 Base Position (MB) ‘Chromosome 7 Genetic Position (cM) Figure 5._ Results for MMAPPR mapping of the 213 Ine. (AB) Genome-wide (A) and chromosome 7 {B}Loes it eults(*) Location af the mutation. (C) RADSeq mapping fel fot chromosome 7 (0) Location of SNP and microsatelte genetic map markers on the 29 genome genes map pton (atom ax) tthe pal map poston (po) data generated for the linkage analysts. Second, gene expression, analysis can be done using a number of available tools (USeq, Cufflinks, Bloconductor) to identify putative mutations that te “duce mRNA levels, due ether to mations in the coding regions Unat lead {© nonsensemediated decay of to mutations in gene regulatory regions. Finally, differential splicing can be analyzed ‘using several tools (USeq, SpliceGraphser, SpliceSeq, ISSPLICE), By finding the intersection between the lists generated by these tools and the MMAPPR identified region, one can generate a robust list fof candidate mutations underlying the phenotype in question. “The use ofthese lists will differ on a case-by-case basis “MMAPPR can be used fora wide variety of model and non. model organisms. For any organism, the criteria are a moderately wellassembled genome, a sufficient level of sequence poly limportant to note that in each of these cases, MMAPR will dentiy the genomic region containing the lesion, and in some cases the affected gene can be identified using differential gene expression or splicing analysis, as described above. Any RNAweg-based mapping method also cannot be wed in caves in which the ality to isolate RNA for ibrary construc: ton s destroyed by tissue fixation or other processes necessary to identify the mutant phenotype, ‘We suggest several experimental de sign decisions to help increase one's odds of success. Fist, RNA should be extracted, a5 soon as possible after onset of the phenotype to increase the likelihood that the causative gene is expressed and cap- turedin the RNA-seq libraries, Second, ve found that MMAPPR works with RNA, {golated from whole animals of from ti se, Of nate, the peaks for ths and zy14, which both fll near each ‘other on chromosome S, were similar tn slze, even though Thx ‘was identified from Isolated ussue RNA-seqand 2714 was identified from whole-embryo RNA-seq, However, we recommend when feasible that RNA isolated from the cells of Ussues of snterest be used for RNA-seq, In this case, MMAPPK will provide a smaller ‘candidate poo! because it only identifies candidates in the ge- rnomic region that are expressed at the right time in the right tissue, Here, the nkx2.5 and thx data sets were derived from hearts distected from embryonte zebrafish, This tissue still pro- vided a sufficient number of expressed SNPs for accurate map- ping and, concurtentiy, avery stall list of candidates within the ‘mapped region (Table 1), Finally, increasing the read depth will Increase the likelihood of detecting causative mutations in Jow- ) ‘ould Lines connect tne each marker (@) Cer= 8 Genome Research ‘wrgenome org Downloaded from genome.cshlp.org on May 21, 2018 - Published by Cold Spring Harbor Laboratory Press Mutation mapping by RNA-seq eee chromosome. ff ic a eqn S Slenece tear © cae > casey CACGATOG :s 8 Ww Figure 6. MMAPPR resus for 2y14. (48) Genomeswide (A) and ce- rmasome’ (8) Loess results from mapping the 2yT#Iine (@) centromere Toeaton ("in location ote mutation (C0) Sanger sequencing aces from ild-ype sbing and22y74 mutant. respectively. (Aron) Angle [base changed in the rautant (and undering) the resting stp codon. (Cr) Fluorescent mages ofthe vatculr phenotype visualized using the “athe £GFP ne 340 hp, whieh marke varculture (0) Wisp #774 sibling. (9 2y!4 mutant (6) 214 mutantreseued by nection o wl. ype ‘isd mRNA, (Fare of cor2 mutant mANA injecwan to reve 2714 ‘mutant embry: In tee experiment, uninjected embryos rom helre- ygote crosses had wild-type phenotypes In 39/51, wheres 109/108 ‘embryos injected with wlype cs2 mRNA had a will-ype Phenotype, Indieative of rescue In contas, mutant cds2 mRNA with the single-base Change seen 2714 was not ale to rescue: 18/26 injected embryos has ‘le fype phenotype, compared vith 14/20 uninjected sibngs. expressed genes, Its commonly recommended to use a least three replicates for accurate differential gene analysis, Based on our results, three biological replicates containing a least 10-20 individuals each should be suifcient for this analysis, These replicates can be barcoded and multiplexed toeduce sequencing Im conclusion, MMAPPRis a robust and effective way to map ‘mutations generated from forward genetic screens or that spon- taneously arse in a population. Adoption of this method might remove many of the barniers for researchers hestant to conduct laige-scale mutagenesis screens due tothe labor-intensive and te- dios mapping process, Methods ‘Animal care and mating All fish were kept in the University of Utah Centralized Zebrafish Animal Resource facility or inthe Yost Laboratory Zebrafish facility according to IACUCapproved protocols. For the nkx2S and thx1"2 line, fish were received from the Yelon laboratory and the Trede laboratory, respectively, and mated into the canc2:GFP line maintained on an AB background. Offspring carrying both the ‘role2:GFP and the appropriate mutation Were selected and mated, Tae 2y13 and 24 lines were identified from a standard F2 muta- genesis screen carried out in the AB strain, Mutant les were Fubsequently maintained by outcrossing to the Wik zebrafish Kine (Gy13, 714) and the Tg(Mi1-EGFP) and Tg(kdr- EGFP) lines (2y14) (Lawson and Weinstein 2002; Beis et al. 2005), RNA collection and sequencing Offspring from zebrafish matings were raised to 30 hours post- ferilization (hp) @y12 and zy14), 48 byt (nkx2.S), or 72 hp (bx under standard conditions. These time points represent the earliest stage at which we could confidently identify the phe- nnotypes. Embryos were then segregated into mutant and pheno- typically wildtype groups based on morphologieal phenotype a follows: Nkx2.5—enlarged atrium and diminished ventricle; ‘Toxl—loss of heart looping, zy13—pericardial edema; and, 2y14loss of intersegmental vessels. Pools of 20 whole embryos were collected fr the 213 and zy14 lines. For the other two lines, “S00 hearts were isolated as previously deserved (Geoftrey Burns and MacRae 2008) and placed in TRIzol. Both whole embryos and isolated heats were processed using TRIzol extraction fllawed by ‘he QIAGEN RNeasy Min kit (QIAGEN), Isolated RNA was run on a Bloanalyzer 2100 Pico Chip (Agilent) to confirm RNA quantity and quality and then used to generate cDNA libraries as previously published (Christodoulou etal, 2011) at the Harvard Biopolymers Facility orusing the lumina ruseq kt (lumina) at the University of Utah Microarray and Genome Analysis Shared Resource. Sat- ples were barcoded as wild-type (WT) versus mutant palrs (two barcodes per lane), and single-end So-bp reads were generated on. a Tliseq 2000 machine at the University of Utal Mieroaray and Genomic Analysis Shared Resource followed by processing wing Casava 16 pipeline. Mapping was done using Novealign (Novocraft) with default parameters except output was set to Sam format and FASTQ scoring to ILMFQ. Mapping was done using the 29 zebrafish build with splice junctions derived from the UCSC Refseq refflat gene table, Data on the number of reads obtained fom the RNA-feq data sete aze summarized in Table Data processing A software implementation of MMAPPR was created using ‘combination of Python 3 and R and is available t hetp//yost. genetics. utah edusoftware php. Briefly, the software package per forms the following steps First, Bam files are pasted through the pileup tool in the SAMools package (Li et al. 2009) to create a pileup file. The pileup format converts the file to postion-based format showing the bases sequenced a each position, Reads a each position are filtered by the minimum base quality and minimum ‘mapping quality set by the user, and dhen the frequency of each alleles calculated These data are subsequenty passed to Rforsignal processing and peak identification, Fst, the Buelidean distance is ‘alculated at each SNP location using the equation: = [hg Aad HG = Co 4m = Gan 2 Ti Fe Genome Research 9 ‘wo geneme org Downloaded from genome.cship.org on May 21, 2018 - Published by Cold Spring Harbor Laboratory Press. Hill et al. where the letters (A, C,G, T) represent their corresponding bases. This distance is then fased to a power set by the user. Next, the data are fit uring a Loess curve with a polynomial exponent of ‘and a span parameter determined by minimizing the AICe, Peak regions are defined as regions where the Loess fitted values are {greater than thtee standard deviations above the genome-wide ‘median R then plots the Loes fits and returns alist of SNPs within the identified region(s) that ate enriched inthe mutant poo! (an allele frequency >0.75 and a Euclidean distance 905). The identi fied SNP are then passed to the Alleler program (part of the USeq package) to filter for nonsynonymous SNPs using the provided {gene annotation, The positon and effect ofthese SNPs ate finally exported to an output file, MMAPPR uses the optimized values reported here a defaults but allows the following variables to be ‘modified by the user: mapping quality (default = 30), base quality {default =20), minimums read dept (efault= 10), power:hat EDs, ‘alsed to (default = 4), and whether tepetitive regions are masked (default = not masked), Causative allele confirmation ‘twas chosen ftom the list of 2713 candidates based on its po- sition relative (othe peak of the mapped region and its known, function and expression patter. Because it was identified as aAmutation that might lead tononsense.mediated decay, twas ts sequenced to identify a G-A mutation resulting in a nonsense mutation at amino acid 580, The functional significance of the ‘mutation was confirmed by phenotypic rescue using wilé

You might also like