Professional Documents
Culture Documents
PP-604 assignment 1
PP-604 assignment 1
PP-604 assignment 1
An Assignment on
Unit - I
“Gene Discovery: Finding Genes in Complex Plant System, Constructing Gene-
Enriched Plant Genomic Libraries, In Silico Prediction of Plant Gene Function,
Quantitative Trait Locus Analysis as a Gene Discovery Tool”
Course No : PP-604
Credit : 3(1+2)
Session : 2017-18
SUBMITTED BY:
Mr. Pawankumar S. Kharate
(PhD Scholar)
Department of Plant Molecular
Biology and Biotechnology
Functional Genomics and Gene Associated With Few Physiological Processes Page 1
Unit – I : Gene Discovery
Index
Page
S. No. Content
No.
1. Finding genes in complex plant systems 1
2. Constructing Gene-Enriched Plant Genomic Libraries 2
3. In silico Prediction of Plant Gene Function 6
4. Quantitative Trait Locus Analysis as a Gene Discovery 12
Tool
5. Construction of linkage maps 12
6. Production of a Mapping populations 14
7. Identification of polymorphism 16
8. Linkage analysis of markers 17
9. Genetic distance and mapping functions 18
10. Identification of QTLs 19
11. Methods to detect QTLs 21
12. Understanding interval mapping results 22
Functional Genomics and Gene Associated With Few Physiological Processes Page 2
Unit – I : Gene Discovery
Functional Genomics and Gene Associated With Few Physiological Processes Page 3
Unit – I : Gene Discovery
regardless of their actual location in the genome. In the early 1980s by the time the idea of
sequencing the human genome was opened to discussion for the first time, Putney et al.
reported a method that allowed to discover new genes simply by cDNA sequencing, later
called expressed sequence tag (EST) sequencing.
This widely used technique allows obtaining gene sequence information getting
around the problem of sequencing repetitive DNA. However, the EST approach has two main
limitations. The first is the redundancy of cDNA libraries. Some cDNAs are often
overrepresented and will be sequenced many times before a cDNA corresponding to a weakly
expressed gene is found. The second limitation is the partial representation due to the tissue-
specific and developmental regulation of gene expression. Some genes are expressed only in
certain tissues or cells, and some are developmentally regulated. In order to recover the
corresponding ESTs, libraries from several different tissues and developmental stages need to
be constructed. Another although minor, disadvantage of EST sequencing is that repetitive
elements are often transcribed and thus included in EST collections. One way to solve the
problem of the redundancy is to use normalized libraries. Normalization techniques are based
on reassociation kinetics and have been improved to avoid the elimination of members of
gene families. However, it is not trivial to obtain a normalized library where representation is
acceptable. Regardless of these limitations, EST projects are being conducted for many
organisms and are a key tool for gene discovery, annotation of genes, cross-species
comparative analysis, and definition of intron–exon boundaries among many other uses. In
particular for plants, ESTs have been the alternative to full genome sequence, because the
genomes of many plants, often important crop species, are very large and repetitive. Usually,
the genome size (or subgenome size in the case of polyploids) correlates with the proportion
of repetitive DNA. It has been proposed that all diploid higher plant genomes share
essentially the same set of genes, called the “gene space”. Then, the bigger the genome, the
higher sequencing cost per gene, due to the amount of nongenic (e.g., repetitive) DNA that
needs to be sequenced before reaching a gene. The conservation of coding sequences across
different species allows identifying genes simply by comparing two different genomes.
Frequently, gene modeling software fails to identify genes that can be spotted with this
comparative genomics approach.
Furthermore, once the complete genomic sequence is obtained for one organism, it
can be compared to a draft (lowly redundant and discontinuous) sequence of a related
organism. This approach yields a lot of new information for both species under analysis. The
additional advantage of genomic vs cDNA sequencing in terms of representation makes the
Functional Genomics and Gene Associated With Few Physiological Processes Page 4
Unit – I : Gene Discovery
lowly redundant genomic sequencing a cost-effective process. In the case of plants however,
the large genome sizes prevent the pursuit of full or even draft genomic sequencing projects.
For these reasons, alternatives to obtain genomic sequences enriched in genes avoiding the
repetitive DNA have been developed. In maize for example, the very active transposon
Mutator shows a strong bias to insert in low copy DNA (i.e., genes). By generating large
Mutator-induced insertional mutagenesis, it is possible to collect genomic sequences flanking
transposon insertion sites, which will mainly correspond to genes. Although Mutator
insertions may not be completely at random in the genome, it can be a good complement to
an EST project. Another alternative for gene enriched genomic sequencing of plants is the
methylation filtration technique, which takes advantage of the fact that most of the repetitive
elements in plants are heavily methylated, while genes are hypomethylated. Because of their
methylation status, repeats are sensitive to bacterial restriction-modification systems, in
particular the Mcr system, which includes two restriction enzymes: McrA and McrBC.
McrBC recognizes DNA containing 5-methylcytosine preceded by a purine. Restriction
requires two of these sites separated by 40–2000 nucleotides. Such recognition sites are very
frequent in any methylated genomic DNA. Thus, by the selecting a mcrBC+ Escherichia coli
host strain, repetitive DNA can be largely excluded from genomic shotgun libraries,
preserving the low copy DNA. Basically, methylation filtration consists in shearing and size
fractionation of genomic DNA to select fragments smaller than the estimated size of the
genes. Larger fragments have a high probability of including some portion of repetitive DNA,
which would be methylated and thus counter-selected in the filtered library. On the other
hand, if fragments are too small, there are more chances to recover small fragments of
repetitive DNA with low GC content. Such fragments may be poor in methylated sites
susceptible to restriction by McrBC and then can be frequently recovered in filtered libraries.
The selected fragments are then end-repaired and cloned into a standard sequencing vector.
Subsequently, the ligation is introduced in a mcrBC+ E. coli host. The recombinant clones
isolated after plating are picked for automatic sequencing. The same ligation mixture can be
transformed into a mcrBC- E. coli strain to obtain an unfiltered control library. The technique
works very well for maize, and there is evidence that it works for many other plants. The
advantage of methylation-filtered libraries vs cDNA and transposon insertion libraries is that
there is no bias towards a certain region of the genome or a given fraction of the genes. It is
possible though, that methylated genes are not recovered in filtered libraries. However, gene
methylation is often restricted to defined regions of the gene, mainly the ends. This would
allow tonclone at least most of the coding sequence of methylated genes. Furthermore, genes
Functional Genomics and Gene Associated With Few Physiological Processes Page 5
Unit – I : Gene Discovery
that are regulated by methylation may become demethylated during different developmental
stages. In these cases, the construction of methylation filtered libraries from a couple of
developmental stages of a given plant would likely overcome the problem. For larger scale
projects, another problem is posed by the cloning efficiency. In plants with very large
genomes, repetitive DNA may account for more than 90% of the nuclear DNA. Then, most of
the DNA is likely to be methylated leaving a very small fraction of the genome to be
recovered in methylation-filtered libraries. As a result, the number of recombinant clones
recovered after plating a filtered library may be <10% of the number of clones obtained in the
corresponding unfiltered control library. Furthermore, the proportion of nonrecombinant
background (blue colonies) may become significant. The use of adaptors often improves the
cloning efficiency in addition to reduce the formation of chimerical clones.
The cloning protocol presented here uses three-nucleotide overhang adaptors and a
compatible sticky-end vector made by filling in one nucleotide in the fournucleotide 5'
overhang generated by a restriction nuclease. The advantage of using three- vs four-
nucleotide overhang is that the nonrecombinant background is highly reduced because the
vector ends become incompatible.
Methods
Nuclear DNA Preparation: Plastids are very abundant, not only in green tissues, and
their DNA is unmethylated. Thus, if chloroplast DNA is present in a DNA sample, it will be
selected during the filtering process. For this reason, it is important to purify nuclei from the
rest of the cell organelles before purifying the genomic DNA- DNA Shearing and End-
Repairing- Adaptor Ligation- Vector Preparation- Preparation of Electrocompetent
JM107 or JM107MA2 Cells- Ligation- Electroporation: Electroporation cuvettes 0.1 cm,
Electroporator (Bio-Rad)- Checking the Average Library Insert Size by Colony PCR.
Functional Genomics and Gene Associated With Few Physiological Processes Page 6
Unit – I : Gene Discovery
This is because maize leaves contain abundant independent RescueMu somatic insertions, and
70–90% of these insertions occur preferentially into genes and not repetitive DNA. This
chapter describes detailed protocols on how to obtain, generate, and use RescueMu for maize
genomics, including resources developed by the Maize Gene Discovery Project (MGDP)
consortium available online at ZmDB.
Precious Cells Contain Precious Information
Expression analysis, often encompassed in the term “functional genomics,” is the link
between physiology and molecular biology. Often, specific physiological changes in plant
development are due to a limited number of genes, expressed exclusively in very few cells of
an organ or organism. Compounding the situation, these physiological changes may also be
transient. Therefore, searching for the responsible genes, though exciting and necessary to
understand important processes is hindered primarily by the scarcity of “precious” cells in the
desired physiological state. Used judiciously, molecular methods such as reverse transcription
polymerase chain reaction (RT-PCR), microarray analysis, or subtractive hybridization allow
analysis of rare or special cells. Each of these methods has its advantages and pitfalls.
Working with precious cells entails special biological strategies to avoid excessive work in
obtaining the data and misinterpretation of it. To illustrate the logic and methods involved in
working with precious cells–tissues, we describe how subtractive hybridization followed by
expressed sequence tag (EST) sequencing can be used to search for a few genes specific to a
few available cells.
Functional Genomics and Gene Associated With Few Physiological Processes Page 7
Unit – I : Gene Discovery
Functional Genomics and Gene Associated With Few Physiological Processes Page 8
Unit – I : Gene Discovery
Materials
User must have access to a computer with Internet access, e.g., a personal computer
(PC) running Microsoft® Windows™ or Linux, an Apple® Macintosh®, or a UNIX®
workstation. The user should be familiar with the use of Netscape Navigator or Microsoft
Internet Explorer. The list of commonly used gene finding and sequence alignment programs
and their Web uniform resource locators (URLs) are provided in Table 1.
Functional Genomics and Gene Associated With Few Physiological Processes Page 9
Unit – I : Gene Discovery
Functional Genomics and Gene Associated With Few Physiological Processes Page 10
Unit – I : Gene Discovery
Comparison of Genomic Sequence with Homologous Genomic Sequences from Related Species
Protein coding DNA from closely related plant species, such as sorghum and maize,
show considerable sequence similarity. With the availability of genomes of many different
organisms, comparative genomic approaches are gaining importance. VISTA/AVID and
PipMaker can be used to compare large genomic sequences to find orthologous genomic
sequences from closely related species. For example, sequence analysis of orthologous genes
from rice, maize, and sorghum showed that the exons are more conserved than introns. The
degree of sequence conservation, in terms of sequence identity, nacross species has been
shown to be consistent with the divergence timesn of the respective species. The rice genes
are considerably more diverged than their counterparts in maize and sorghum. For gene
prediction programs, it would be best to compare two genomes that are very closely related,
but distant enough that their intergenic repeat elements differ significantly. As a rule of
thumb, consider two species as closely related, if those two are diverged within the last 25
million yr. For example, maize and sorghum are closely related species as they were diverged
15–20 million yr ago. If homologous genomic sequences from two species are known, then a
recently developed gene prediction tool called SGP-1 can be used to find protein-coding
genes.
Functional Genomics and Gene Associated With Few Physiological Processes Page 11
Unit – I : Gene Discovery
Worked Example
We discussed various gene-finding strategies in the previous sections. Now let us
discuss which programs to choose and how to use those programs in a real practical scenario.
Given a large genomic sequence, we suggest the following steps in arriving at probable exons
that the sequence may contain.
1) Blast the sequence against nr and EST databases by using BLASTN (Megablast in
case of very long sequence) program. Note the list of accession numbers of cDNAs or
ESTs with “% identity” score ≥ 99, from the blast output.
2) Use SIM4 program to align each of the cDNA/ESTs with the genomic sequence so as
to identify exons with canonical splice sites.
3) Blast the sequence against nr database by using the BLASTX program. From the
output, note down the BLASTX matches that may belong to genes.
4) Submit the sequence to at least 4 different gene prediction programs and select the
consensus predictions (exons). We consider a prediction as consensus prediction if it
is predicted by at least half of the programs either fully (both ends of the predicted
exons are same) or partially (there exists an overlapping region among the predicted
exons).
To demonstrate the above steps, we use the genomic sequence in rice bacterial artificial
chromosome (BAC) in GenBank® with Accession no. AP005190, which has not yet been
annotated at the time writing of this review. Since the length of the sequence is very large
(138,893 bp), we used Megablast to identify the homologous sequences from the GenBank.
The program was run twice by choosing nr and EST databases. We use the list of high
scoring segment pairs (HSPs) from the Megablast output. As BLAST is mainly a sequence
similarity program, it helps us to identify the regions in the input sequence (query sequence)
that are similar to known sequences (subject sequences) in the database. As the output
suggests, it is hard to interpret the gene structure (exon–intron boundaries) from the output.
Hence, we ran SIM4 program to align each of the EST/cDNA sequences (from the output of
Megablast) with the genomic sequence AP005190. The list of exons inferred by combining
various EST/cDNA alignments with AP005190 using SIM4. BLASTX output used. The
values in columns query start and query end would give the regions in the genomic sequence
AP005190 that may belong to probable genes. Finally, we submitted the genomic sequence
AP005190 to four gene-finding programs Genscan with Arabidopsis model, Genscan with
maize model, GeneMark.hmm with rice model, and MZEF with Arabidopsis model. Default
values were selected for other parameters for each of the programs used. As none of the
programs is good enough to predict the complete gene structure, we considered only the exon
predictions. We compiled the list of all consensus exons that were predicted by at least two
programs. We consider an exon as a consensus prediction if there exists an overlapping
region among the predictions of at least two different programs.
Functional Genomics and Gene Associated With Few Physiological Processes Page 12
Unit – I : Gene Discovery
Functional Genomics and Gene Associated With Few Physiological Processes Page 13
Unit – I : Gene Discovery
2. Identification of polymorphism
Functional Genomics and Gene Associated With Few Physiological Processes Page 14
Unit – I : Gene Discovery
Functional Genomics and Gene Associated With Few Physiological Processes Page 15
Unit – I : Gene Discovery
Functional Genomics and Gene Associated With Few Physiological Processes Page 16
Unit – I : Gene Discovery
2. Identification of polymorphism
The second step in the construction of a linkage map is to identify DNA markers that
reveal differences between parents (i.e. polymorphic markers). It is critical that sufficient
polymorphism exists between parents in order to construct a linkage map. In general, cross
pollinating species possess higher levels of DNA polymorphism compared to inbreeding
species; mapping in inbreeding species generally requires the selection of parents that are
distantly related. In many cases, parents that provide adequate polymorphism are selected
on the basis of the level of genetic diversity between parents.
The choice of DNA markers used for mapping may depend on the availability of
characterized markers or the appropriateness of particular markers for a particular species.
Once polymorphic markers have been identified, they must be screened across the entire
mapping population, including the parents (and F1 hybrid, if possible). This is known as
marker ‘genotyping’ of the population. Therefore, DNA must be extracted from each
individual of the mapping population when DNA markers are used. Examples of DNA
markers screened across different populations are shown in Figure 5.
Figure 5. Gel photos representing segregating codominant markers (left side) and
dominant markers (right side) for typical mapping populations
Functional Genomics and Gene Associated With Few Physiological Processes Page 17
Unit – I : Gene Discovery
Functional Genomics and Gene Associated With Few Physiological Processes Page 18
Unit – I : Gene Discovery
A LOD value of 3 between two markers indicates that linkage is 1000 times more
likely (i.e. 1000:1) than no linkage (null hypothesis). LOD values may be lowered in order to
detect a greater level of linkage or to place additional markers within maps constructed at
higher LOD values. Commonly used software programs include Mapmaker/ EXP and
MapManager QTX and JoinMap.
A typical output of a linkage map is shown in Figure 7. Linked markers are grouped
together into ‘linkage groups,’ which represent chromosomal segments or entire
chromosomes. Referring to the road map analogy, linkage groups represent roads and
markers represent signs or landmarks. A difficulty associated with obtaining an equal number
of linkage groups and chromosomes is that the polymorphic markers detected are not
necessarily evenly distributed over the chromosome, but clustered in some regions and absent
in others. In addition to the non-random distribution of markers, the frequency of
recombination is not equal along chromosomes. The accuracy of measuring the genetic
distance and determining marker order is directly related to the number of individuals studied
in the mapping population. Ideally, mapping populations should consist of a minimum of 50
individuals for constructing linkage maps.
Functional Genomics and Gene Associated With Few Physiological Processes Page 19
Unit – I : Gene Discovery
spots,’ which are chromosomal regions in which recombination occurs more frequently or
less frequently, respectively.
Identification of QTLs
Principle of QTL analysis Identifying a gene or QTL within a plant genome is like
finding the proverbial needle in a haystack. However, QTL analysis can be used to divide the
haystack in manageable piles and systematically search them. In simple terms, QTL analysis
is based on the principle of detecting an association between phenotype and the genotype of
markers. Markers are used to partition the mapping population into different genotypic
groups based on the presence or absence of a particular marker locus and to determine
whether significant differences exist between groups with respect to the trait being measured.
A significant difference between phenotypic means of the groups (either 2 or 3), depending
Functional Genomics and Gene Associated With Few Physiological Processes Page 20
Unit – I : Gene Discovery
on the marker system and type of population, indicates that the marker locus being used to
partition the mapping population is linked to a QTL controlling the trait (Figure 8).
Figure 8. Principle of QTL mapping. Markers that are linked to a gene or QTL
controlling a particular trait (e.g. plant height)
A logical question that may be asked at this point is why does a significant P value
obtained for differences between mean trait values indicate linkage between marker and
QTL?’ The answer is due to recombination. The closer a marker is from a QTL, the
lower the chance of recombination occurring between marker and QTL. Therefore, the
QTL and marker will be usually be inherited together in the progeny, and the mean of the
group with the tightly-linked marker will be significantly different (P < 0.05) to the mean of
the group without the marker. When a marker is loosely-linked or unlinked to a QTL, there is
independent segregation of the marker and QTL. In this situation, there will be no significant
difference between means of the genotype groups based on the presence or absence of the
loosely linked marker. Unlinked markers located far apart or on different chromosomes to the
QTL are randomly inherited with the QTL; therefore, no significant differences between
means of the genotype groups will be detected.
Functional Genomics and Gene Associated With Few Physiological Processes Page 21
Unit – I : Gene Discovery
Functional Genomics and Gene Associated With Few Physiological Processes Page 22
Unit – I : Gene Discovery
of linked markers for interval mapping. The main advantage of CIM is that it is more precise
and effective at mapping QTLs compared to single-point analysis and interval mapping,
especially when linked QTLs are involved. Many researchers have used QTL Cartographer,
MapManager QTX and PLABQTL to perform CIM.
Figure 9. Hypothetical output showing a LOD profile for chromosome 4. The dotted line
represents the significance threshold determined by permutation tests
Functional Genomics and Gene Associated With Few Physiological Processes Page 23
Unit – I : Gene Discovery
The peak or maximum must also exceed a specified significance level in order for the
QTL to be declared as ‘real’ (i.e. statistically significant). The determination of significance
thresholds is most commonly performe using permutation tests. Briefly, the phenotypic
values of the population are ‘shuffled’ whilst the marker genotypic values are held constant
(i.e. all marker-trait associations are broken) and QTL analysis is performed to assess the
level of false positive marker-trait associations. This process is then repeated (e.g. 500 or
1000 times) and significance levels can then be determined based on the level of false
positive marker-trait associations. Before permutation tests were widely accepted as an
appropriate method to determine significance thresholds, a LOD score of between 2.0 to 3.0
was usually chosen as the significance threshold.
Functional Genomics and Gene Associated With Few Physiological Processes Page 24
Unit – I : Gene Discovery
that are closely-linked (approximately 20 cM or less) will usually be detected as a single QTL
in typical population sizes (<500).
Figure 10. QTL mapping of height and disease resistance traits. Hypothetical QTLs
were detected on chromosomes 2, 3, 4 and 5. A major QTL for height was detected on
chromosome 2. Two major QTLs for disease resistance were detected on chromosomes
4 and 5 whereas a minor QTL was detected on chromosome 3
Functional Genomics and Gene Associated With Few Physiological Processes Page 25
Unit – I : Gene Discovery
accuracy of phenotypic evaluation is of the utmost importance for the accuracy of QTL
mapping. A reliable QTL map can only be produced from reliable phenotypic data.
Replicated phenotypic measurements or the use of clones (via cuttings) can be used to
improve the accuracy of QTL mapping by reducing background ‘noise’. Some thorough
studies include those where phenotypic evaluations have been conducted in both field and
glasshouse trials, for ascochyta blight resistance in chickpea, bacterial brown spot in common
bean, and downy mildew resistance in pearl millet.
DTY1.1-A Major QTL for rice grain yield under reproductive stage drought stress
Functional Genomics and Gene Associated With Few Physiological Processes Page 26
Unit – I : Gene Discovery
advantage of the markers is in the reduction of linkage drag during the introgression of QTL
by backcrossing. At IARI, we have mapped and validated QTLs conferring resistance to
downy mildews of maize and have recently transferred two major QTLs for downy mildew
resistance into CM139, an elite but downy mildew- susceptible inbred line.
There are still some important caveats regarding QTL analysis. Only the QTLs of
largest effect and those closest to a marker locus, will show statistically reliable associations.
Particularly important is fine-mapping or high-resolution mapping of the QTL, if the QTL
information has to be effectively applied in basic/applied research. Once fine-mapped, QTLs
can also serve as useful tools for comparative genomics, functional genomics and
evolutionary studies.
While the methods for QTL analysis provide us with information about the putative
locations and effects of QTLs influencing a quantitative trait, these do not tell us anything
about the molecular nature of the QTL. In other words, are these QTL coding for specific
enzymes involved in a particular pathway, do they act as regulators of gene expression, or are
they non-coding regions that have some influence in the expression of the trait of interest?
Identification of putative QTL locations and DNA markers linked to QTLs have opened up
opportunities for isolation and molecular characterization of QTLs via map based cloning.
During the last decade, considerable progress has been made in terms of positional cloning of
several QTLs, including Brix 9-2-5, fw2.2, Hd1, Hd6, FRI in tomato, rice and Arabidopsis.
References
Anderson, J., G. Churchill, J. Autrique, S. Tanksley & M. Sorrells, 1993. Optimizing parental
Tanksley, S.D., N.D. Young, A.H. Paterson & M. Bonierbale, 1989.
RFLP mapping in plant breeding: New tools for an old science. Biotechnology 7: 257–
264.election for genetic linkage maps. Genome 36: 181–186.
Young, N.D., 1996.QTLmapping and quantitative disease resistance in plants. Annu Rev
Phytopathol 34: 479–501.
Pilet-Nayel, M.L., F.J. Muehlbauer, R.J. McGee, J.M. Kraft, A. Baranger & C.J. Coyne,
2002. Quantitative trait loci for partial resistance to Aphanomyces root rot in pea. Theor Appl
Genet 106: 28–39.
B.C.Y. Collard, M.Z.Z. Jahufer, J.B. Brouwer & E.C.K. Pang, 2005. An introduction to
markers, quantitative trait loci (QTL) mapping and marker-assisted selection for crop
improvement: The basic concepts. Euphytica: 142: 169–196.
Winter, P., T. Pfaff, S. Udupa, B. Huttel, P. Sharma, S. Sahi, R. Arreguin-Espinoza, F.
Weigand, F.J. Muehlbauer & G. Kahl, 1999. Characterisation and mapping of sequence-
tagged microsatellite sites in the chickpea (Cicer arietinum L.) genome. Mol Gen Genet 262:
90–101.
Rafalski, J. & S. Tingey, 1993. Genetic diagnostics in Plant Breed: RAPDs, microsatellites
and machines. Trends Genet 9: 275–280.
Functional Genomics and Gene Associated With Few Physiological Processes Page 27