PP-604 assignment 1

Unit – I : Gene Discovery
Indira Gandhi Krishi Vishwavidyalaya Raipur

(Chhattisgarh)
An Assignment on
Unit - I
“Gene Discovery: Finding Genes in Complex Plant System, Constructing Gene-
Enriched Plant Genomic Libraries, In Silico Prediction of Plant Gene Function,
Quantitative Trait Locus Analysis as a Gene Discovery Tool”
Course No : PP-604
Credit : 3(1+2)
Session : 2017-18
“Functional Genomics and Gene Associated With Few Physiological Processes”

SUBMITTED TO: Dr. Alice Tirkey and Dr. Arti Guhey
Faculty, Dept. of Plant Physiology, College of Agriculture, IGKV, Raipur
SUBMITTED BY:
Mr. Pawankumar S. Kharate
(PhD Scholar)
Department of Plant Molecular
Biology and Biotechnology
College of Agriculture, Raipur (C.G.)
Functional Genomics and Gene Associated With Few Physiological Processes Page 1
Index
Page
S. No. Content
No.
1. Finding genes in complex plant systems 1
2. Constructing Gene-Enriched Plant Genomic Libraries 2
3. In silico Prediction of Plant Gene Function 6
4. Quantitative Trait Locus Analysis as a Gene Discovery 12
Tool
5. Construction of linkage maps 12
6. Production of a Mapping populations 14
7. Identification of polymorphism 16
8. Linkage analysis of markers 17
9. Genetic distance and mapping functions 18
10. Identification of QTLs 19
11. Methods to detect QTLs 21
12. Understanding interval mapping results 22
13. Reporting and describing QTLs detected from interval 23

mapping
14. Factors influencing the detection of QTLs 23
15. QTL Information – Utility and Prospects 25
16. References 26
FINDING GENES IN COMPLEX PLANT SYSTEMS

Constructing Gene-Enriched Plant Genomic Libraries
Constructing Gene-Enriched Plant Genomic Libraries Using Methylation Filtration
Technology
Full genome sequencing in higher plants is a very difficult task, because their
genomes are often very large and repetitive. For this reason, gene targeted partial genomic
sequencing becomes a realistic option. The method reported here is a simple approach to
generate gene enriched plant genomic libraries called methylation filtration. This technique
takes advantage of the fact that repetitive DNA is heavily methylated and genes are
hypomethylated. Then, by simply using an Escherichia coli host strain harboring a wild-type
modified cytosine restriction (McrBC) system, which cuts DNA containing methylcytosine,
repetitive DNA is eliminated from these genomic libraries, while low copy DNA (i.e., genes)
is recovered. To prevent cloning significant proportions of organelle DNA, a crude nuclear
preparation must be performed prior to purifying genomic DNA. Adaptor mediated cloning
and DNA size fractionation are necessary for optimal results.
Highly accurate full genomic sequencing like that performed for example in
Saccharomyces cerevisiae and Caenorhabditis elegans has proven to be an invaluable
resource to accelerate all areas of biological research. In particular in plants, the Arabidopsis
thaliana genome sequence has been deciphered, meeting the highest standards of accuracy.
Undoubtedly, the availability of this information had an immense impact not only in the
Arabidopsis community, but in research in all other plant systems as well. Unfortunately, the
production of such a high quality genomic resource is not an easy task. It implies a significant
amount of sequence redundancy only achievable by producing a huge number of sequence
reads. Such reads are assembled and processed to produce as long contiguous stretches as
possible, called contigs. In order to link these contigs in the right order and orientation, a
large insert genomic library (using bacterial artificial chromosome [BAC] or P1-derived
artificial chromosome [PAC] vectors) needs to be constructed, at least partially sequenced,
and physically mapped. A major obstacle to obtain the complete and accurate sequence of a
complex (i.e., eukaryote) genome is the presence of large amounts of repetitive DNA. This
DNA is composed of satellite DNA, transposons and retrotransposons, among other repeats,
which often show a high degree of sequence conservation. For this reason, the computer
software designed to assemble random sequence reads fails to build correct contigs of
repetitive sequences, usually assembling most members of a repeat family in a single contig,
regardless of their actual location in the genome. In the early 1980s by the time the idea of
sequencing the human genome was opened to discussion for the first time, Putney et al.
reported a method that allowed to discover new genes simply by cDNA sequencing, later
called expressed sequence tag (EST) sequencing.
This widely used technique allows obtaining gene sequence information getting
around the problem of sequencing repetitive DNA. However, the EST approach has two main
limitations. The first is the redundancy of cDNA libraries. Some cDNAs are often
overrepresented and will be sequenced many times before a cDNA corresponding to a weakly
expressed gene is found. The second limitation is the partial representation due to the tissue-
specific and developmental regulation of gene expression. Some genes are expressed only in
certain tissues or cells, and some are developmentally regulated. In order to recover the
corresponding ESTs, libraries from several different tissues and developmental stages need to
be constructed. Another although minor, disadvantage of EST sequencing is that repetitive
elements are often transcribed and thus included in EST collections. One way to solve the
problem of the redundancy is to use normalized libraries. Normalization techniques are based
on reassociation kinetics and have been improved to avoid the elimination of members of
gene families. However, it is not trivial to obtain a normalized library where representation is
acceptable. Regardless of these limitations, EST projects are being conducted for many
organisms and are a key tool for gene discovery, annotation of genes, cross-species
comparative analysis, and definition of intron–exon boundaries among many other uses. In
particular for plants, ESTs have been the alternative to full genome sequence, because the
genomes of many plants, often important crop species, are very large and repetitive. Usually,
the genome size (or subgenome size in the case of polyploids) correlates with the proportion
of repetitive DNA. It has been proposed that all diploid higher plant genomes share
essentially the same set of genes, called the “gene space”. Then, the bigger the genome, the
higher sequencing cost per gene, due to the amount of nongenic (e.g., repetitive) DNA that
needs to be sequenced before reaching a gene. The conservation of coding sequences across
different species allows identifying genes simply by comparing two different genomes.
Frequently, gene modeling software fails to identify genes that can be spotted with this
comparative genomics approach.
Furthermore, once the complete genomic sequence is obtained for one organism, it
can be compared to a draft (lowly redundant and discontinuous) sequence of a related
organism. This approach yields a lot of new information for both species under analysis. The
additional advantage of genomic vs cDNA sequencing in terms of representation makes the
lowly redundant genomic sequencing a cost-effective process. In the case of plants however,
the large genome sizes prevent the pursuit of full or even draft genomic sequencing projects.
For these reasons, alternatives to obtain genomic sequences enriched in genes avoiding the
repetitive DNA have been developed. In maize for example, the very active transposon
Mutator shows a strong bias to insert in low copy DNA (i.e., genes). By generating large
Mutator-induced insertional mutagenesis, it is possible to collect genomic sequences flanking
transposon insertion sites, which will mainly correspond to genes. Although Mutator
insertions may not be completely at random in the genome, it can be a good complement to
an EST project. Another alternative for gene enriched genomic sequencing of plants is the
methylation filtration technique, which takes advantage of the fact that most of the repetitive
elements in plants are heavily methylated, while genes are hypomethylated. Because of their
methylation status, repeats are sensitive to bacterial restriction-modification systems, in
particular the Mcr system, which includes two restriction enzymes: McrA and McrBC.
McrBC recognizes DNA containing 5-methylcytosine preceded by a purine. Restriction
requires two of these sites separated by 40–2000 nucleotides. Such recognition sites are very
frequent in any methylated genomic DNA. Thus, by the selecting a mcrBC+ Escherichia coli
host strain, repetitive DNA can be largely excluded from genomic shotgun libraries,
preserving the low copy DNA. Basically, methylation filtration consists in shearing and size
fractionation of genomic DNA to select fragments smaller than the estimated size of the
genes. Larger fragments have a high probability of including some portion of repetitive DNA,
which would be methylated and thus counter-selected in the filtered library. On the other
hand, if fragments are too small, there are more chances to recover small fragments of
repetitive DNA with low GC content. Such fragments may be poor in methylated sites
susceptible to restriction by McrBC and then can be frequently recovered in filtered libraries.
The selected fragments are then end-repaired and cloned into a standard sequencing vector.
Subsequently, the ligation is introduced in a mcrBC+ E. coli host. The recombinant clones
isolated after plating are picked for automatic sequencing. The same ligation mixture can be
transformed into a mcrBC- E. coli strain to obtain an unfiltered control library. The technique
works very well for maize, and there is evidence that it works for many other plants. The
advantage of methylation-filtered libraries vs cDNA and transposon insertion libraries is that
there is no bias towards a certain region of the genome or a given fraction of the genes. It is
possible though, that methylated genes are not recovered in filtered libraries. However, gene
methylation is often restricted to defined regions of the gene, mainly the ends. This would
allow tonclone at least most of the coding sequence of methylated genes. Furthermore, genes
that are regulated by methylation may become demethylated during different developmental
stages. In these cases, the construction of methylation filtered libraries from a couple of
developmental stages of a given plant would likely overcome the problem. For larger scale
projects, another problem is posed by the cloning efficiency. In plants with very large
genomes, repetitive DNA may account for more than 90% of the nuclear DNA. Then, most of
the DNA is likely to be methylated leaving a very small fraction of the genome to be
recovered in methylation-filtered libraries. As a result, the number of recombinant clones
recovered after plating a filtered library may be <10% of the number of clones obtained in the
corresponding unfiltered control library. Furthermore, the proportion of nonrecombinant
background (blue colonies) may become significant. The use of adaptors often improves the
cloning efficiency in addition to reduce the formation of chimerical clones.
The cloning protocol presented here uses three-nucleotide overhang adaptors and a
compatible sticky-end vector made by filling in one nucleotide in the fournucleotide 5'
overhang generated by a restriction nuclease. The advantage of using three- vs four-
nucleotide overhang is that the nonrecombinant background is highly reduced because the
vector ends become incompatible.
Methods
Nuclear DNA Preparation: Plastids are very abundant, not only in green tissues, and
their DNA is unmethylated. Thus, if chloroplast DNA is present in a DNA sample, it will be
selected during the filtering process. For this reason, it is important to purify nuclei from the
rest of the cell organelles before purifying the genomic DNA- DNA Shearing and End-
Repairing- Adaptor Ligation- Vector Preparation- Preparation of Electrocompetent
JM107 or JM107MA2 Cells- Ligation- Electroporation: Electroporation cuvettes 0.1 cm,
Electroporator (Bio-Rad)- Checking the Average Library Insert Size by Colony PCR.
RescueMu Protocols for Maize Functional Genomics

RescueMu is a modified Mu1 transposon transformed into maize to permit
mutagenesis and subsequent recovery of mutant alleles by plasmid rescue. RescueMu
elements insert late in the germline as well as in terminally dividing somatic (e.g., leaf) cells.
Germinal insertions may result in a mutant phenotype, and RescueMu permits recovery of 5–
25 kb of transposon-flanking genomic DNA without having to construct and screen genomic
DNA libraries. Late somatic insertions of RescueMu do not result in a visible phenotype, but
they are instead used to construct plasmid libraries of gene-enriched maize genomic DNA to
facilitate the identification and sequencing of the euchromatic portion of the maize genome.
This is because maize leaves contain abundant independent RescueMu somatic insertions, and
70–90% of these insertions occur preferentially into genes and not repetitive DNA. This
chapter describes detailed protocols on how to obtain, generate, and use RescueMu for maize
genomics, including resources developed by the Maize Gene Discovery Project (MGDP)
consortium available online at ZmDB.
Precious Cells Contain Precious Information
Expression analysis, often encompassed in the term “functional genomics,” is the link
between physiology and molecular biology. Often, specific physiological changes in plant
development are due to a limited number of genes, expressed exclusively in very few cells of
an organ or organism. Compounding the situation, these physiological changes may also be
transient. Therefore, searching for the responsible genes, though exciting and necessary to
understand important processes is hindered primarily by the scarcity of “precious” cells in the
desired physiological state. Used judiciously, molecular methods such as reverse transcription
polymerase chain reaction (RT-PCR), microarray analysis, or subtractive hybridization allow
analysis of rare or special cells. Each of these methods has its advantages and pitfalls.
Working with precious cells entails special biological strategies to avoid excessive work in
obtaining the data and misinterpretation of it. To illustrate the logic and methods involved in
working with precious cells–tissues, we describe how subtractive hybridization followed by
expressed sequence tag (EST) sequencing can be used to search for a few genes specific to a
few available cells.
IN SILICO PREDICTION OF PLANT GENE FUNCTION

Computer Software to Find Genes in Plant Genomic DNA
Gene finding is the most important phase of genome annotation. Eukaryotic genomes
contain thousands of protein coding genes, and computational gene prediction would rapidly
increase the pace of experimental confirmation of expressed genes at the bench. The purpose
of this information is to discuss the use of different computer programs that identify protein-
coding genes in large genomic sequences. We describe most commonly used gene prediction
programs that are available on the World Wide Web and demonstrate the use of some of
these programs by an example. We provide a list of these programs along with their Web
uniform resource locators (URLs) and suggest guidelines for successful gene finding.
The human and Arabidopsis genome projects and the advancement of sequencing
technologies within the last decade are driving many other genome projects. The complete
genome sequences of more than 800 organisms (many microbes, fungi, plants, and animals)
are either complete or being sequenced (http://www.ncbi.nlm.nih.gov). One of the primary

goals of any genome project is to provide a single continuous sequence for each of the
chromosomes and demarcate the positions of all genes (Fig. 1A), along with the annotation of
each component of a gene (Fig. 1B). Furthermore, recent advances in high-throughput
technologies, such as genome-wide micro-array expression analysis, are starting to provide
greater insights into the transcript tional regulation of eukaryotic cell. Integrating the genome
sequence information (e.g., gene promoters) and microarray expression data would provide
an initial link to functional genomics. The identification and annotation of genes at genome
level will contribute to the understanding of genome-wide gene expression studies. The major
focus of this chapter is to introduce different bioinformatics tools that identify genes in
genomic sequences. Gene, defined as a transcribed unit, is usually split into pieces (called
exons) that are separated by intervening sequences (called introns) in the eukaryotic genomes
(Fig. 1B). The identification of genes by computational approaches is relatively
straightforward for organisms with compact genomes (such as bacteria and yeast), because
exons tend to be large, and the introns are either nonexistent or short. The challenge is much
greater for larger genomes (such as those of rice or maize), because the exonic “signal” is
buried under nongenic “noise.” In the past few years, the accuracy and reliability of
computational gene finding programs have improved to a reasonable extent, such that gene
predictions within a genomic region can give valuable guidance to more detailed
experimental studies. Computational sequence analysis methods, which detect genes in
genomic DNA, can be broadly classified into two main categories: homology-based methods,
and ab initio method.
Materials
User must have access to a computer with Internet access, e.g., a personal computer
(PC) running Microsoft® Windows™ or Linux, an Apple® Macintosh®, or a UNIX®
workstation. The user should be familiar with the use of Netscape Navigator or Microsoft
Internet Explorer. The list of commonly used gene finding and sequence alignment programs
and their Web uniform resource locators (URLs) are provided in Table 1.
Gene Prediction by Homology-Based Methods

Sequence homology is a very powerful type of evidence used to detect functional
elements in genomic sequences. The homology-based methods to detect genes use either
intraspecies or interspecies sequence comparison in at least four different ways, as
summarized below. Comparison with Expressed Sequence Tags/cDNA Database A direct
comparison of a genomic sequence (query) with expressed sequence tags (ESTs) or cDNA
(Fig. 2) can identify regions of the query sequence that correspond to processed mRNA.
BLASTN (6) is a common program that iden tifies similar nucleotide sequences that exist in
the databases (nr/EST) to the query sequence. BLASTN algorithm finds similar sequences by
generating an indexed table or dictionary of short subsequences called words for both the
query and the database (see Basic Local Alignment Search Tool [BLAST] help at
[http://www.ncbi.nlm.nih.gov/BLAST] for further details). For identification of gene regions
in the query sequence, choose low complexity repeat filter and select expected value as 0.1. If
the query sequence is very long MegaBLAST is a better choice, as it is specifically designed
to efficiently find long alignments between very similar sequences. MegaBLAST is also
optimized for aligning sequences that differ slightly as a result of sequencing errors. The user
can select different options. We suggest the use of expected value (e-value) of 0.1 and choose
filter for low complexity repeats. When larger word size is used (default value is 28), it
increases the search speed and limits the number of database hits. For BLASTN, the word
size can be reduced from the default value of 11 to a minimum of 7 to increase sensitivity.
BLASTN is mainly used to pull out similar sequences from the database, and most of the
times it is hard to interpret the exon boundaries. After finding a cDNA or EST match to the
query sequence, one can use spliced alignment programs such as SIM4, which efficiently
aligns an EST or cDNA with the genomic sequence. RiceHMM is another program that
predicts gene domains in rice genome sequence, based on a Hidden Markov Model using a
database of rice ESTs, composed of nearly 15,000 cDNAs.
Comparison with Protein Sequence Databases

Comparison of genomic sequence with protein sequence database by programs, such
as BLASTX, can identify probable protein coding regions. Subsequently, spliced alignment
programs such as Genewise, GeneSeqer, or PROCRUSTES can be used to find the gene
structure by comparing the genomic DNA sequence to the target protein sequences. These
programs derive an optimal alignment based on sequence similarity score of the predicted
gene product to the protein sequence and intrinsic splice site strength of the predicted introns.
Comparison of a Translated Genomic Sequence with Translated Nucleotide Database

A comparison of a translated genomic sequence with nucleotide database, which has
been translated in all six reading frames, using TBLASTX can identify similarities among
protein coding regions. TBLASTX can be run by selecting “Nucleotide query—Translated db
[tblastx]” option from the BLAST Web page. TBLASTX takes a nucleotide query sequence,
translates it in all six frames, and compares the translations to a nucleotide database (e.g., nr,
est, est_human, est_others, etc.) sequences that are dynamically translated in all six frames.
Comparison of Genomic Sequence with Homologous Genomic Sequences from Related Species
Protein coding DNA from closely related plant species, such as sorghum and maize,
show considerable sequence similarity. With the availability of genomes of many different
organisms, comparative genomic approaches are gaining importance. VISTA/AVID and
PipMaker can be used to compare large genomic sequences to find orthologous genomic
sequences from closely related species. For example, sequence analysis of orthologous genes
from rice, maize, and sorghum showed that the exons are more conserved than introns. The
degree of sequence conservation, in terms of sequence identity, nacross species has been
shown to be consistent with the divergence timesn of the respective species. The rice genes
are considerably more diverged than their counterparts in maize and sorghum. For gene
prediction programs, it would be best to compare two genomes that are very closely related,
but distant enough that their intergenic repeat elements differ significantly. As a rule of
thumb, consider two species as closely related, if those two are diverged within the last 25
million yr. For example, maize and sorghum are closely related species as they were diverged
15–20 million yr ago. If homologous genomic sequences from two species are known, then a
recently developed gene prediction tool called SGP-1 can be used to find protein-coding
genes.
Worked Example
We discussed various gene-finding strategies in the previous sections. Now let us
discuss which programs to choose and how to use those programs in a real practical scenario.
Given a large genomic sequence, we suggest the following steps in arriving at probable exons
that the sequence may contain.
1) Blast the sequence against nr and EST databases by using BLASTN (Megablast in
case of very long sequence) program. Note the list of accession numbers of cDNAs or
ESTs with “% identity” score ≥ 99, from the blast output.
2) Use SIM4 program to align each of the cDNA/ESTs with the genomic sequence so as
to identify exons with canonical splice sites.
3) Blast the sequence against nr database by using the BLASTX program. From the
output, note down the BLASTX matches that may belong to genes.
4) Submit the sequence to at least 4 different gene prediction programs and select the
consensus predictions (exons). We consider a prediction as consensus prediction if it
is predicted by at least half of the programs either fully (both ends of the predicted
exons are same) or partially (there exists an overlapping region among the predicted
exons).
To demonstrate the above steps, we use the genomic sequence in rice bacterial artificial
chromosome (BAC) in GenBank® with Accession no. AP005190, which has not yet been
annotated at the time writing of this review. Since the length of the sequence is very large
(138,893 bp), we used Megablast to identify the homologous sequences from the GenBank.
The program was run twice by choosing nr and EST databases. We use the list of high
scoring segment pairs (HSPs) from the Megablast output. As BLAST is mainly a sequence
similarity program, it helps us to identify the regions in the input sequence (query sequence)
that are similar to known sequences (subject sequences) in the database. As the output
suggests, it is hard to interpret the gene structure (exon–intron boundaries) from the output.
Hence, we ran SIM4 program to align each of the EST/cDNA sequences (from the output of
Megablast) with the genomic sequence AP005190. The list of exons inferred by combining
various EST/cDNA alignments with AP005190 using SIM4. BLASTX output used. The
values in columns query start and query end would give the regions in the genomic sequence
AP005190 that may belong to probable genes. Finally, we submitted the genomic sequence
AP005190 to four gene-finding programs Genscan with Arabidopsis model, Genscan with
maize model, GeneMark.hmm with rice model, and MZEF with Arabidopsis model. Default
values were selected for other parameters for each of the programs used. As none of the
programs is good enough to predict the complete gene structure, we considered only the exon
predictions. We compiled the list of all consensus exons that were predicted by at least two
programs. We consider an exon as a consensus prediction if there exists an overlapping
region among the predictions of at least two different programs.
Quantitative Trait Locus Analysis as a Gene Discovery Tool

Introduction
Many agriculturally important traits such as yield, quality and some forms of disease
resistance are controlled by many genes and are known as quantitative traits (also
‘polygenic,’ ‘multifactorial’ or ‘complex’ traits). The regions within genomes that contain
genes associated with a particular quantitative trait are known as quantitative trait loci
(QTLs). The identification of QTLs based only on conventional phenotypic evaluation is not
possible. A major breakthrough in the characterization of quantitative traits that created
opportunities to select for QTLs was initiated by the development of DNA (or molecular)
markers in the 1980s. One of the main uses of DNA markers in agricultural research has been
in the construction of linkage maps for diverse crop species. Linkage maps have been utilised
for identifying chromosomal regions that contain genes controlling simple traits (controlled
by a single gene) and quantitative traits using QTL analysis. The process of constructing
linkage maps and conducting QTL analysis–to identify genomic regions associated with
traits–is known as QTL mapping (also ‘genetic,’ ‘gene’ or ‘genome’ mapping). DNA
markers that are tightly inked to agronomically important genes (called gene ‘tagging’) may
be used as molecular tools for marker-assisted selection (MAS) in plant breeding. MAS
involves using the presence/absence of a marker as a substitute for or to assist in phenotypic
selection, in a way which may make it more efficient, effective, reliable and cost-effective
compared to the more conventional plant breeding methodology. The use of DNA markers in
plant (and animal) breeding has opened a newrealm in agriculture called ‘molecular
breeding’.
Construction of linkage maps

A linkage map may be thought of as a ‘road map’ of the chromosomes derived from
two different parents. Linkage maps indicate the position and relative genetic distances
between markers along chromosomes, which is analogous to signs or landmarks along a
highway. The most important use for linkage maps is to identify chromosomal locations
containing genes and QTLs associated with traits of interest; such maps may then be referred
to as ‘QTL’ (or ‘genetic’) maps. ‘QTL mapping’ is based on the principle that genes and
markers segregate via chromosome recombination (called crossing-over) during meiosis (i.e.
sexual reproduction), thus allowing their analysis in the progeny. Genes or markers that are
close together or tightly-linked will be transmitted together from parent to progeny more
frequently than genes or markers that are located further apart (Figure 3).
Figure 3. Diagram indicating cross-over or recombination events between homologous

chromosomes that occur during meiosis
In a segregating population, there is a mixture of parental and recombinant

genotypes. The frequency of recombinant genotypes can be used to calculate recombination
fractions, which may by used to infer the genetic distance between markers. By analyzing the
segregation of markers, the relative order and distances between markers can be determined–
the lower the frequency of recombination between two markers, the closer they are situated
on a chromosome (conversely, the higher the frequency of recombination between two
markers, the further away they are situated on a chromosome). Markers that have a
recombination frequency of 50% are described as ‘unlinked’ and assumed to be located far
apart on the same chromosome or on different chromosomes. For a more detailed explanation
of genetic linkage, the reader is encouraged to consult basic textbooks on genetics or
quantitative genetics. Mapping functions are used to convert recombination fractions into
map units called centi- Morgans (cM) (discussed later). Linkage maps are constructed from
the analysis of many segregating markers. The three main steps of linkage map construction
are:
1. Production of a mapping population
2. Identification of polymorphism
3. Linkage analysis of markers
1. Production of a Mapping populations

The construction of a linkage map requires a segregating plant population (i.e. a
population derived from sexual reproduction). The parents selected for the mapping
population will differ for one or more traits of interest. Population sizes used in preliminary
genetic mapping studies generally range from 50 to 250 individuals, however larger
populations are required for high-resolution mapping. If the map will be used for QTL
studies, then an important point to note is that the mapping population must be
phenotypically evaluated (i.e. trait data must be collected) before subsequent QTL mapping.
Generally in self-pollinating species, mapping populations originate from parents that are
both highly homozygous (inbred). In cross pollinating species, the situation is more
complicated since most of these species do not tolerate inbreeding. Many cross pollinating
plant species are also polyploid (contain several sets of chromosome pairs). Mapping
populations used for mapping cross pollinating species may be derived from a cross between
a heterozygous parent and a haploid or homozygous parent (Figure 4). F2 populations,
derived from F1 hybrids, and backcross (BC) populations, derived by crossing the F1 hybrid
to one of the parents, are the simplest types of mapping populations developed for self-
pollinating species. Their main advantages are that they are easy to construct and require only
a short time to produce. Inbreeding from individual F2 plants allows the construction of
recombinant inbred (RI) lines, which consist of a series of homozygous lines, each containing
a unique combination of chromosomal segments from the original parents. The length of time
needed for producing RI populations is the major disadvantage, because usually six to eight
generations are required. Doubled haploid (DH) populations may be produced by
regenerating plants by the induction of chromosome doubling from pollen grains, however,
the production of DH populations is only possible in species that are amenable to tissue
culture (e.g. cereal species such as rice, barley and wheat). The major advantages of RI and
DH populations are that they produce homozygous or ‘true-breeding’ lines that can be
multiplied and reproduced without genetic change occurring. This allows for the conduct of
replicated trials across different locations and years. Thus both RI and DH populations
represent ‘eternal’ resources for QTL mapping. Furthermore, seed from individual RI or DH
lines may be transferred between different laboratories for further linkage analysis and the
addition of markers to existing maps, ensuring that all collaborators examine identical
material.
Figure 4. Diagram of main types of mapping populations for self-pollinating species
2. Identification of polymorphism
The second step in the construction of a linkage map is to identify DNA markers that
reveal differences between parents (i.e. polymorphic markers). It is critical that sufficient
polymorphism exists between parents in order to construct a linkage map. In general, cross
pollinating species possess higher levels of DNA polymorphism compared to inbreeding
species; mapping in inbreeding species generally requires the selection of parents that are
distantly related. In many cases, parents that provide adequate polymorphism are selected
on the basis of the level of genetic diversity between parents.
The choice of DNA markers used for mapping may depend on the availability of
characterized markers or the appropriateness of particular markers for a particular species.
Once polymorphic markers have been identified, they must be screened across the entire
mapping population, including the parents (and F1 hybrid, if possible). This is known as
marker ‘genotyping’ of the population. Therefore, DNA must be extracted from each
individual of the mapping population when DNA markers are used. Examples of DNA
markers screened across different populations are shown in Figure 5.
Figure 5. Gel photos representing segregating codominant markers (left side) and
dominant markers (right side) for typical mapping populations
3. Linkage analysis of markers

The final step of the construction of a linkage map involves coding data for each
DNA marker on each individual of a population and conducting linkage analysis using
computer programs (Figure 6). Missing marker data can also be accepted by mapping
programs Although linkage analysis can be performed manually for a few markers, it is not
feasible to manually analyze and determine linkages between large numbers of markers that
are used to construct maps; computer programs are required for this purpose. Linkage
between markers is usually calculated using odds ratios (i.e. the ratio of linkage versus no
linkage). This ratio is more conveniently expressed as the logarithm of the ratio, and is called
a logarithm of odds (LOD) value or LOD score. LOD values of >3 are typically used to
construct linkage maps.
Figure 6. Construction of a linkage map based on a small recombinant inbred

population
A LOD value of 3 between two markers indicates that linkage is 1000 times more
likely (i.e. 1000:1) than no linkage (null hypothesis). LOD values may be lowered in order to
detect a greater level of linkage or to place additional markers within maps constructed at
higher LOD values. Commonly used software programs include Mapmaker/ EXP and
MapManager QTX and JoinMap.
A typical output of a linkage map is shown in Figure 7. Linked markers are grouped
together into ‘linkage groups,’ which represent chromosomal segments or entire
chromosomes. Referring to the road map analogy, linkage groups represent roads and
markers represent signs or landmarks. A difficulty associated with obtaining an equal number
of linkage groups and chromosomes is that the polymorphic markers detected are not
necessarily evenly distributed over the chromosome, but clustered in some regions and absent
in others. In addition to the non-random distribution of markers, the frequency of
recombination is not equal along chromosomes. The accuracy of measuring the genetic
distance and determining marker order is directly related to the number of individuals studied
in the mapping population. Ideally, mapping populations should consist of a minimum of 50
individuals for constructing linkage maps.
Genetic distance and mapping functions

Genetic distance and mapping functions The importance of the distance between
genes and markers has been discussed earlier. The greater the distance between markers,
the greater the chance of recombination occurring during meiosis. Distance along a
linkage map is measured in terms of the frequency of recombination between genetic
markers. Mapping functions are required to convert recombination fractions into centi
Morgans (cM) because recombination frequency and the frequency of crossing-over are not
linearly related. When map distances are small (<10 cM), the map distance equals the
recombination frequency. However, this relationship does not apply for map distances that
are greater than 10 cM. Two commonly used mapping functions are the Kosambi mapping
function, which assumes that recombination events influence the occurrence of adjacent
recombination events, and the Haldane mapping function, which assumes no interference
between crossover events. It should be noted that distance on a linkage map is not directly
related to the physical distance of DNA between genetic markers, but depends on the genome
size of the plant species. Furthermore, the relationship between genetic and physical distance
varies along a chromosome. For example, there are recombination ‘hot spots’ and ‘cold
spots,’ which are chromosomal regions in which recombination occurs more frequently or
less frequently, respectively.
Figure 7. Hypothetical ‘framework’ linkage map of five chromosomes (represented by linkage

groups) and 26 markers
Identification of QTLs
Principle of QTL analysis Identifying a gene or QTL within a plant genome is like
finding the proverbial needle in a haystack. However, QTL analysis can be used to divide the
haystack in manageable piles and systematically search them. In simple terms, QTL analysis
is based on the principle of detecting an association between phenotype and the genotype of
markers. Markers are used to partition the mapping population into different genotypic
groups based on the presence or absence of a particular marker locus and to determine
whether significant differences exist between groups with respect to the trait being measured.
A significant difference between phenotypic means of the groups (either 2 or 3), depending
on the marker system and type of population, indicates that the marker locus being used to
partition the mapping population is linked to a QTL controlling the trait (Figure 8).
Figure 8. Principle of QTL mapping. Markers that are linked to a gene or QTL
controlling a particular trait (e.g. plant height)
A logical question that may be asked at this point is why does a significant P value
obtained for differences between mean trait values indicate linkage between marker and
QTL?’ The answer is due to recombination. The closer a marker is from a QTL, the
lower the chance of recombination occurring between marker and QTL. Therefore, the
QTL and marker will be usually be inherited together in the progeny, and the mean of the
group with the tightly-linked marker will be significantly different (P < 0.05) to the mean of
the group without the marker. When a marker is loosely-linked or unlinked to a QTL, there is
independent segregation of the marker and QTL. In this situation, there will be no significant
difference between means of the genotype groups based on the presence or absence of the
loosely linked marker. Unlinked markers located far apart or on different chromosomes to the
QTL are randomly inherited with the QTL; therefore, no significant differences between
means of the genotype groups will be detected.
Examples of quantitative traits are:

• Days to 50 % flowering • Grain length (mm)
• Plant height (cm) • Grain Width (mm)
• Number of tillers • Yield (kg/ha)
• Number of effective tillers • Insect resistance (BPH resistance)
• Panicle length (cm) • Abiotic stresses ( Drought tolerance, salinity
• 1000 Grain weight (g) tolerance and Sub mergence Tolerance)
Methods to detect QTLs

Three widely-used methods for detecting QTLs are single-marker analysis, simple
interval mapping and composite interval mapping. Single-marker analysis (also ‘single-
point analysis’) is the simplest method for detecting QTLs associated with single markers.
The statistical methods used for single-marker analysis include t-tests, analysis of variance
(ANOVA) and linear regression. Linear regression is most commonly used because the
coefficient of determination (R2) from the marker explains the phenotypic variation arising
from the QTL linked to the marker. This method does not require a complete linkage map
and can be performed with basic statistical software programs. However, the major
disadvantage with this method is that the further a QTL is from a marker, the less likely it
will be detected. This is because recombination may occur between the marker and the QTL.
This causes the magnitude of the effect of a QTL to be underestimated. The use of a large
number of segregating DNA markers covering the entire genome (usually at intervals less
than 15 cM) may minimize both problems. The results from single-marker analysis are
usually presented in a table, which indicates the chromosome (if known) or linkage group
containing the markers, probability values, and the percentage of phenotypic variation
explained by the QTL (R2). Sometimes, the allele size of the marker is also reported. QGene
and MapManager QTX are commonly used computer programs to perform single-marker
analysis. The simple interval mapping (SIM) method makes use of linkage maps and analyses
intervals between adjacent pairs of linked markers along chromosomes simultaneously,
instead of analyzing single markers. The use of linked markers for analysis compensates for
recombination between the markers and the QTL, and is considered statistically more
powerful compared to single-point analysis. Many researchers have used MapMaker/QTL.
and QGene to conduct SIM. More recently, composite interval mapping (CIM) has become
popular for mapping QTLs. This method combines interval mapping with linear regression
and includes additional genetic markers in the statistical model in addition to an adjacent pair
of linked markers for interval mapping. The main advantage of CIM is that it is more precise
and effective at mapping QTLs compared to single-point analysis and interval mapping,
especially when linked QTLs are involved. Many researchers have used QTL Cartographer,
MapManager QTX and PLABQTL to perform CIM.
Understanding interval mapping results

Interval mapping methods produce a profile of the likely sites for a QTL between
adjacent linked markers. In other words, QTLs are located with respect to a linkage map.
The results of the test statistic for SIM and CIM are typically presented using a logarithmic
of odds (LOD) score or likelihood ratio statistic (LRS). There is a direct one-to-one
transformation between LOD scores and LRS scores (the conversion can be calculated by:
LRS = 4.6 × LOD). These LOD or LRS profiles are used to identify the most likely position
for a QTL in relation to the linkage map, which is the position where the highest LOD value
is obtained. A typical output from interval mapping is a graph with markers comprising
linkage groups on the x axis and the test statistic on the y axis (Figure 9).
Figure 9. Hypothetical output showing a LOD profile for chromosome 4. The dotted line
represents the significance threshold determined by permutation tests
The peak or maximum must also exceed a specified significance level in order for the
QTL to be declared as ‘real’ (i.e. statistically significant). The determination of significance
thresholds is most commonly performe using permutation tests. Briefly, the phenotypic
values of the population are ‘shuffled’ whilst the marker genotypic values are held constant
(i.e. all marker-trait associations are broken) and QTL analysis is performed to assess the
level of false positive marker-trait associations. This process is then repeated (e.g. 500 or
1000 times) and significance levels can then be determined based on the level of false
positive marker-trait associations. Before permutation tests were widely accepted as an
appropriate method to determine significance thresholds, a LOD score of between 2.0 to 3.0
was usually chosen as the significance threshold.
Reporting and describing QTLs detected from interval mapping

The most common way of reporting QTLs is by indicating the most closely linked
markers in a table and/or as bars (or oval shapes or arrows) on linkage maps. The
chromosomal regions represented by rectangles are usually the region that exceeds the
significance threshold (Figure 10). Usually, a pair of markers – the most tightly-linked
markers on each side of a QTL – is also reported in a table; these markers are known as
‘flanking’ markers. The reason for reporting flanking markers is that selection based on two
markers should be more reliable than selection based on a single marker. The reason for the
increased reliability is that there is a much lower chance of recombination between two
markers and QTL compared to the chance of recombination between a single marker and
QTL.
Factors influencing the detection of QTLs

There are many factors that influence the detection of QTLs segregating in a
population. The main ones are genetic properties of QTLs that control traits, environmental
effects, population size and experimental error. Genetic properties of QTLs controlling traits
include the magnitude of the effect of individual QTLs. Only QTLs with sufficiently large
phenotypic effects will be detected; QTLs with small effects may fall below the significance
threshold of detection. Another genetic property is the distance between linked QTLs. QTLs
that are closely-linked (approximately 20 cM or less) will usually be detected as a single QTL
in typical population sizes (<500).
Figure 10. QTL mapping of height and disease resistance traits. Hypothetical QTLs
were detected on chromosomes 2, 3, 4 and 5. A major QTL for height was detected on
chromosome 2. Two major QTLs for disease resistance were detected on chromosomes
4 and 5 whereas a minor QTL was detected on chromosome 3
Environmental effects may have a profound influence on the expression of

quantitative traits. Experiments that are replicated across sites and over time (e.g.different
seasons and years) may enable the researcher to investigate environmental influences on
QTLs affecting trait(s) of interest. As discussed previously, RI or DH populations are ideal
for these purposes. The most important experimental design factor is the size of the
population used in the mapping study. The larger the population, the more accurate the
mapping study and the more likely it is to allow detection of QTLs with smaller effects. An
increase in population size provides gains in statistical power, estimates of gene effects and
confidence intervals of the locations of QTLs. The main sources of experimental error are
mistakes in marker genotyping and errors in phenotypic evaluation. Genotyping errors and
missing data can affect the order and distance between markers within linkage maps. The
accuracy of phenotypic evaluation is of the utmost importance for the accuracy of QTL
mapping. A reliable QTL map can only be produced from reliable phenotypic data.
Replicated phenotypic measurements or the use of clones (via cuttings) can be used to
improve the accuracy of QTL mapping by reducing background ‘noise’. Some thorough
studies include those where phenotypic evaluations have been conducted in both field and
glasshouse trials, for ascochyta blight resistance in chickpea, bacterial brown spot in common
bean, and downy mildew resistance in pearl millet.
DTY1.1-A Major QTL for rice grain yield under reproductive stage drought stress
QTL Information – Utility and Prospects

Despite lack of precise information about the molecular nature of the QTL,
introgression of QTLs into elite lines or germplasm, and marker-assisted selection (MAS) for
QTLs in breeding could be undertaken in some crop plants such as maize, tomato and rice,
with reasonable success. Plant breeders may not need to know the precise locations of the
QTL, so long as the QTL has large effect, and can be introgressed using marker-assisted
backcrossing. The methods available will enable them to pick such useful QTL, which could
well have been missed by conventional phenotypic selection. Also, another important
advantage of the markers is in the reduction of linkage drag during the introgression of QTL
by backcrossing. At IARI, we have mapped and validated QTLs conferring resistance to
downy mildews of maize and have recently transferred two major QTLs for downy mildew
resistance into CM139, an elite but downy mildew- susceptible inbred line.
There are still some important caveats regarding QTL analysis. Only the QTLs of
largest effect and those closest to a marker locus, will show statistically reliable associations.
Particularly important is fine-mapping or high-resolution mapping of the QTL, if the QTL
information has to be effectively applied in basic/applied research. Once fine-mapped, QTLs
can also serve as useful tools for comparative genomics, functional genomics and
evolutionary studies.
While the methods for QTL analysis provide us with information about the putative
locations and effects of QTLs influencing a quantitative trait, these do not tell us anything
about the molecular nature of the QTL. In other words, are these QTL coding for specific
enzymes involved in a particular pathway, do they act as regulators of gene expression, or are
they non-coding regions that have some influence in the expression of the trait of interest?
Identification of putative QTL locations and DNA markers linked to QTLs have opened up
opportunities for isolation and molecular characterization of QTLs via map based cloning.
During the last decade, considerable progress has been made in terms of positional cloning of
several QTLs, including Brix 9-2-5, fw2.2, Hd1, Hd6, FRI in tomato, rice and Arabidopsis.
References
 Anderson, J., G. Churchill, J. Autrique, S. Tanksley & M. Sorrells, 1993. Optimizing parental
Tanksley, S.D., N.D. Young, A.H. Paterson & M. Bonierbale, 1989.
 RFLP mapping in plant breeding: New tools for an old science. Biotechnology 7: 257–
264.election for genetic linkage maps. Genome 36: 181–186.
 Young, N.D., 1996.QTLmapping and quantitative disease resistance in plants. Annu Rev
Phytopathol 34: 479–501.
 Pilet-Nayel, M.L., F.J. Muehlbauer, R.J. McGee, J.M. Kraft, A. Baranger & C.J. Coyne,
2002. Quantitative trait loci for partial resistance to Aphanomyces root rot in pea. Theor Appl
Genet 106: 28–39.
 B.C.Y. Collard, M.Z.Z. Jahufer, J.B. Brouwer & E.C.K. Pang, 2005. An introduction to
markers, quantitative trait loci (QTL) mapping and marker-assisted selection for crop
improvement: The basic concepts. Euphytica: 142: 169–196.
 Winter, P., T. Pfaff, S. Udupa, B. Huttel, P. Sharma, S. Sahi, R. Arreguin-Espinoza, F.
Weigand, F.J. Muehlbauer & G. Kahl, 1999. Characterisation and mapping of sequence-
tagged microsatellite sites in the chickpea (Cicer arietinum L.) genome. Mol Gen Genet 262:
90–101.
 Rafalski, J. & S. Tingey, 1993. Genetic diagnostics in Plant Breed: RAPDs, microsatellites
and machines. Trends Genet 9: 275–280.

PP-604 assignment 1

Uploaded by

Copyright:

Available Formats

You might also like

PP-604 assignment 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PP-604 assignment 1

Uploaded by

Copyright:

Available Formats

Unit – I : Gene Discovery

Indira Gandhi Krishi Vishwavidyalaya Raipur

“Functional Genomics and Gene Associated With Few Physiological Processes”

Faculty, Dept. of Plant Physiology, College of Agriculture, IGKV, Raipur

College of Agriculture, Raipur (C.G.)

13. Reporting and describing QTLs detected from interval 23

FINDING GENES IN COMPLEX PLANT SYSTEMS

RescueMu Protocols for Maize Functional Genomics

IN SILICO PREDICTION OF PLANT GENE FUNCTION

are either complete or being sequenced (http://www.ncbi.nlm.nih.gov). One of the primary

Gene Prediction by Homology-Based Methods

Comparison with Protein Sequence Databases

Comparison of a Translated Genomic Sequence with Translated Nucleotide Database

Quantitative Trait Locus Analysis as a Gene Discovery Tool

Construction of linkage maps

Figure 3. Diagram indicating cross-over or recombination events between homologous

In a segregating population, there is a mixture of parental and recombinant

3. Linkage analysis of markers

1. Production of a Mapping populations

Figure 4. Diagram of main types of mapping populations for self-pollinating species

3. Linkage analysis of markers

Figure 6. Construction of a linkage map based on a small recombinant inbred

Genetic distance and mapping functions

Figure 7. Hypothetical ‘framework’ linkage map of five chromosomes (represented by linkage

Examples of quantitative traits are:

Methods to detect QTLs

Understanding interval mapping results

Reporting and describing QTLs detected from interval mapping

Factors influencing the detection of QTLs

Environmental effects may have a profound influence on the expression of

QTL Information – Utility and Prospects

You might also like