Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

ESSENTIALS OF MOLECULAR GENETICS

Prepared by Faculty of the Albert Einstein College of Medicine


(September, 1993; revised September, 2002)

CONTENTS
What Is Molecular Genetics?......................................................................................................4 Classical Genetics and the Definition of the Gene ....................................................................4 Classical Genetics Defines the Gene by the Study of Mutations ...............................................4 Mutations Can Be Dominant Or Recessive ...............................................................................5 The Complementation Test Identifies the Gene as a Unit of Activity ........................................5 A Complementation Test Sometimes Gives the "Wrong" Answer..............................................6 Transmission Genetics.................................................................................................................7 Classical Genetics Defined the Rules Governing Genetic Transmission ..................................7 Cytologists Discovered the Cellular Structures That Contained the Genes..............................7 Genetic Recombination between Genes in Single Linkage Groups Results from Exchange of Material between Homologous Chromosomes ..........................................................................8 The Frequency of Genetic Recombination Can Be Used to Map Genes on Chromosomes ......9 Construction of a Genetic Map Is an Important Step in the Definition of Genes .....................9 Organisms Being Studied Today ..............................................................................................10 Genetic Mapping Techniques in Various Organisms.............................................................11 The Physical Characteristics of Genomes ...............................................................................16 Genomes Consist of DNA Molecules, and Vary Widely in Size...............................................16 Bacterial Genomes Contain Some 4300 Genes, Higher Organisms May Have As Many As 30,000 or More ........................................................................................................................16 Genome Projects ........................................................................................................................17 Current Methods Make the Sequencing of Whole Genomes Possible .....................................17 Construction of Physical Maps: Overlapping Clones. ............................................................17 The Physical Map Is Correlated with the Genetic Map ..........................................................17 Eukaryotic Genomes Contain a Large Amount of Repetitive DNA. ........................................18 There Are Several Kinds of Repeated Sequences ....................................................................18 Maintenance and transmission of the genetic material..........................................................20 Special Sequences Control the Replication and Transmission of the Genetic Material .........20 Enzymatic Mechanisms Repair DNA Damage and Recombine the DNA Strands ..................20 Recombinant DNA and the Construc-tion of Transgenic Organisms ..................................21 Genes May Be Amplified in Pure Form by "Cloning" Them in Microorganisms. ..................21 The Polymerase Chain Reaction (PCR) Is a Way to "Clone" DNA Directly In Vitro.............22 Genes Are Cloned by Isolating Them from Clone Libraries or Clone Banks .........................23 A Variety of Vectors Provide a Range of Options for the Generation of a Clone Library .....23 Clone Libraries May Be Screened in a Number of Ways ........................................................24 Constructing Transgenic Organisms .......................................................................................25 Basic Elements of Bacterial Genetics .......................................................................................26 The Genetics of Bacteria Has Several Unique Features .........................................................26 Bacterial Cells Exchange Genetic Material in a Process Known as Conjugation .................27 The Bacterial Genetic Map Is Defined by the Time of Transfer During Conjugation ............27 The Bacterial Genetic Map and the Bacterial Chromosome Are Circular .............................27 The F Plasmid Encodes Genetic Functions Required for Transfer of DNA............................28 Integration of the F Plasmid into the Bacterial Chromosome Can Result in Mobilization of the Chromosome for Transfer........................................................................................................28

3 Plasmids Can Be Used to Construct Partially-Diploid Bacterial Strains...............................28 Plasmids Play an Important Role in the Transmission of Drug Resistance............................28 In Transformation, Bacterial Cells Take Up DNA Directly ....................................................29 Bacterial Viruses Play a Role in Genetic Exchange Between Bacteria ..................................29 Study of Bacteriophages Has Played a Central Role in the Development of Molecular Biology. ..................................................................................................................................................29 Bacterial Viruses May Kill the Host Cell or Coexist with It ...................................................30 Inferring Wild Type Gene Function from Mutant Phenotype ..............................................30 To Infer Wild Type Gene Function, It Is First Necessary to Determine How the Mutation Affects Gene Activity................................................................................................................30 Types of Mutations Are Defined by Structure and by Affects on Gene Activity ......................31 Rare Spontaneous Mutations Are of All Types ........................................................................31 Chemical Mutagens Tend to Induce Point Mutations, Radiation Tends to Produce Rearrangements .......................................................................................................................31 Null Mutations Are Important in the Determination of the Biological Process in which a Gene Participates ..............................................................................................................................32 In Some Organisms the Null Phenotype Is Best Determined by Gene Knockout ....................33 Null Mutations Can Be Identified As Mutations That Behave Genetically Like a Deficiency of the Gene ...................................................................................................................................33 Null Mutations Have Several Characteristics That Distinguish Them from Non-Null Mutations ..................................................................................................................................................34 New Null Alleles May Be Isolated by a Non-Complementation Screen ..................................34 Hypomorphic Mutations Lower But Do Not Eliminate Gene Activity.....................................34 Gene Activity Is Raised by Hypermor-phic Mutations ............................................................35 Antimorphic Mutations Produce a Poison Gene Product .......................................................35 Neomorphic Mutations Result in a Novel Gene Activity .........................................................36 A Gain-of-function Mutant Phenotype May Be Eliminated by Introducing a Loss-of-function Mutation at the Same Locus.....................................................................................................36 Determining the Time and Place of Gene Action....................................................................36 The Time and Location of Gene Expression Can Be Determined by a Number of Biochemical Means.......................................................................................................................................36 Reporter Genes Provide a Sensitive and Versatile Assay of Gene Expression .......................37 Gene Knockout Frequently Reveals That a Gene's Activity Is Not Required Everywhere It Is Expressed .................................................................................................................................38 The Tissue Where Gene Activity Is Required May Be Determined by Mosaic Analysis .........38 Gene Product Synthesis and Gene Product Action Need Not Take Place in the Same Generation ...............................................................................................................................39 Parental Effects May Be Identified by Genetic Tests ..............................................................39 Temperature-Sensitive Mutations Can Be Used to Determine the Time of Gene Action ........40 Analyzing Complex Processes by Genetics..............................................................................41 Genetic Analysis Allows the Probing of Complex Biological Processes Involving Multiple Genes........................................................................................................................................41 Some Genes Involved in a Biological Process May Be Identified As Genetic Modifiers........41 Information About the Order of Gene Action in a Pathway Can Be Obtained by Epistasis Analysis ....................................................................................................................................44

What Is Molecular Genetics? Molecular genetics is an approach to understanding the functions of genes. It combines classical genetic analysis with molecular biology to probe the nature of both gene action and gene transmission. The essential characteristic of molecular genetics is that gene products are studied through the genes that encode them. This contrasts with a biochemical approach, in which the gene products themselves are purified and their activities studied in vitro. All aspects of cell and organismal structure and function are potentially amenable to a molecular genetic approach. Because genes are similar in all organisms, this approach has many essential aspects in common whether the organism being studied is a bacterium, a fungus, or a mammal. The purpose of this booklet is to define and describe these common aspects, and to point out how they are applied in practice in the diverse organisms that are being studied today. Gene cloning, that is, the isolation of a gene so that its nucleotide sequence may be determined, is central to molecular genetics. Genes identified through a classical genetic analysis of mutations may be cloned to ascertain the structure of the gene product and to permit biochemical studies of gene activity. Alternatively, genes may be defined first by the biochemical identification of their gene product. In this case gene cloning allows the isolation and study of mutant forms. In either approach, starting with a mutation or starting with a cloned gene, the techniques of classical genetic analysis are used to draw conclusions about gene function from the phenotype of mutations. In addition to gene function, molecular genetics is also concerned with the transmission of the genetic material. Genes

are carried by chromosomes, whose function is to maintain the integrity of each cell's complement of genetic information through cell division, and from one generation to the next. Chromosomes contain specialized sequences whose function is to control chromosome replication, recombination, and distribution to daughter cells. The understanding of such sequences can also be approached by cloning, sequencing, and the identification of mutations. A long term goal of molecular genetics is understand gene function in the context of the life, development, and reproduction of the individual, as well as the evolution of the species. Classical Genetics and the Definition of the Gene Classical Genetics Defines the Gene by the Study of Mutations Long before it was known that genes consisted of strings of nucleotides that determined the structure of proteins, it was possible to infer their existence and many of their properties. Different forms of genes, called alleles or mutations, were recognized by their effects on the phenotype of the organism, that is, the organism's form and function. The complete set of allelic forms of an organism's genes is termed its genotype. Classical genetic studies involving crosses between organisms with differing genotypes and phenotypes, beginning with Mendel, revealed that higher plants and animals are diploid, that is, they have two copies of each gene, one derived from each parent. Gametes, on the other hand, as well as the genomes of some higher organisms and most prokaryotes, have only one copy of each gene and are said to be haploid. With respect to a particular gene, a diploid organism is said to be homozygous if both

5 copies of the gene are the same, and heterozygous if two different allelic forms of the gene are present. A heterozygote is also known as a hybrid of the two parental forms. An otherwise diploid organism is said to be hemizygous for any gene present in only one copy, for example, genes on the X chromosome of Drosophila males. Mutations Can Be Dominant Or Recessive Since there are usually two copies of each gene per cell, it is possible to ask what will be the result if the two copies are different. Through the analysis of such heterozygotes, it has been possible to infer a great deal about the properties of genes and gene products. Consider two alleles of a single gene, a and b. Suppose the homozygote a/a has the phenotype A, and the homozygote b/b has the phenotype B. If the a/b heterozygote has the phenotype A, then a is said to be dominant with respect to b, and b is said to be recessive with respect to a. If A is the most common phenotype found in nature, then A is called the wild type, and a is the wildtype allele. In this case, b would be considered a recessive mutant allele of the gene, where the mutant phenotype is only observed when in homozygous form. However, the wild type need not be the dominant form, and it is possible to have mutant forms that are dominant over wild type. Another alternative is that the phenotype of a/b is a mixture of A and B characteristics, or has an intermediate phenotype between A and B; for example, if A is "large" and B is "small", the phenotype of the a/b organism might be "medium sized", or if A is red and B is white, the phenotype of the a/b organism might be pink. In this case, each of the allelic forms is said to be incompletely dominant or semidominant with respect to the other. If the phenotypes respectively characteristic of each allele are both expressed in the hybrid, then the two alleles are said to be codominant. This is the case, for example, with different allelic forms of blood group antigens. The Complementation Test Identifies the Gene as a Unit of Activity In addition to making it possible to determine whether one allelic form of a gene is dominant or recessive with respect to another, diploidy makes possible a fundamental genetic test to determine whether two mutations with the same or similar phenotypes are in the same gene: the complementation test. A determination of the number of genes involved is essential to begin unraveling the role of genes in a particular process. Suppose, for example, the genetic basis of fruit fly eye color is being studied. If wild type fruit fly eyes are red, and two mutant strains of flies have white eyes, it will be important to know whether the two mutations are in the same gene, or define two separate genes, both of which are necessary to make red eyes. It is by means of the complementation test that the gene as a unit of function is defined. In a complementation test, an organism that is heterozygous in trans for two mutations with similar phenotypes is constructed by genetic crosses, and its phenotype is observed. Heterozygous in trans means that one mutant allele has been obtained from one parent, and the other mutant allele has been obtained from the other parent. It is necessary that both mutations be recessive, so that the phenotype of a heterozygote for each mutant allele singly is wild type. If the trans-heterozygote is also found to be wild type, then the two mutations are said to "complement" one another. If the trans-heterozygote is found to be mutant in phenotype, then the two mutations are said to "fail to complement" one another.

6 GENETIC NOMENCLATURE IN VARIOUS ORGANISMS E. coli yeast C. elegans Drosophila mouse Phenotype Gene Allele Recessive Dominant TsWild type Gal-, Lac+ galK, lacZ galK13, lacZ23 same same not written Ade-, Cdcts ade2, cdc28 ade2-1 ADE2-27 ADE2 Dpy, Unc dpy-5 dpy-5(bx27) dpy-5(bx27d) dpy-5(bx27ts) dpy-5(+) white white w, wa Ubx wts w+, Ubx+ agouti A a, ab A, Ay, Avy +, A+

The table gives one or two examples of a gene or mutation name. Notice that among the differences in usage between the organsims, there are some consistencies: Phenotypes are written non-italicized, usually three letters with the first letter only capitalized. Gene names, alleles, genotypes generally, on the other hand, are italicized. In several systems, capitals denote dominance, small letters recessiveness.

How are these two different results to be interpreted? If the trans-heterozygote has wild type phenotype, that is, if the mutations complement one another, this implies that the trans-heterozygote has all the genetic functions needed for expression of the wild type phenotype. In other words, the chromosomes from each mutant parent make up for the deficiency present on the chromosomes of the other. If one parent is mutant in say gene a, the second parent must carry a wild type copy of gene a. Since the mutation in a is recessive, this gives wild type gene a function. If the second parent has a wild type copy of gene a, its own mutation must be in a different gene from a. Evidently, the mutations carried by the two parents are in different genes. The same kind of reasoning applies to non-complementation. In this case, neither parent makes up for the deficiency of the other; evidently they must be deficient in the same gene. Thus, the general interpretation of the complementation test is as follows: if two mutations complement, then they are likely to lie in different genes; if two mutations fail to complement, then they are likely to lie in the same gene. Note that a complementation test cannot be carried out with a dominant mutation. In order to determine the gene in which a dominant mutation lies, it is usually

first necessary to isolate a recessive allele at the same locus. This is discussed further in a later section. In diploid organisms the transheterozygote required by the complementation test is easily constructed by mating together two single mutant strains. However, there are other ways of determining the result of having multiple allelic forms in the same cell, including methods applicable to haploid organisms. For example, in bacteria a so-called merodiploid can be constructed by putting one copy of the gene being tested on a plasmid. Upon introduction of the plasmid, the organism becomes diploid over just that short segment of the chromosome carried by the plasmid. This technique is used in yeast as well. In both bacteria and yeast complementation is useful in determining whether a cloned DNA segment carries the wild type copy of a mutated gene. If it does, the cloned DNA segment will complement the mutation when the DNA segment is introduced into the cell; this is often termed "complementation rescue". Complementation rescue is also used to identify wild type genes in C. elegans, into which DNA may be introduced by microinjection. A Complementation Test Sometimes Gives the "Wrong" Answer

7 Although the reasoning used above to interpret the complementation test is valid for the majority of cases, it is not universally applicable. In some instances, the transheterozygote may have a mutant phenotype even though the two mutations being tested are in different genes. This is called second-site non-complementation, or intergenic non-complementation (these terms are equivalent). This can occur due to a cumulative effect on the trans-heterozygote of having only one wild type copy of each of two genes, or of having two mutant alleles, even though when heterozygous singly mutations in the two genes are recessive. Likewise, in some instances the trans-heterozygote may have a wild type phenotype even though the two mutations are in the same gene. This is known as intragenic complementation. This comes about if each of the two mutant genes produces a mutant gene product (as opposed to no gene product), and the two mutant gene products, when present in the same cell, can each supply the deficiency or remedy the defect of the other. Acting together, the two mutant gene products provide wild type gene function. Because of the possibility of intergenic non-complementation and intragenic complementation, the complementation test is always combined with genetic mapping to provide a less ambiguous determination whether two mutations define one or two genes. Transmission Genetics Classical Genetics Defined the Rules Governing Genetic Transmission When Mendel, and later Morgan and other geneticists discovered that there were genetic entities termed genes that could mutate to different forms, they also discovered how those genes were transmitted from generation to generation. Mendel realized that pea plants carried two copies of each gene. To maintain this number, each gamete had to contain one copy. The diploid condition was restored when two gametes joined at fertilization. Evidently, during formation of the gametes in the gonad, one of the two copies of each gene had to be selected to be incorporated into each sperm cell or egg cell. The separation of the two alleles during formation of the gametes is termed segregation. Mendel wondered how this process occurred. By studying plants carrying mutations in more than one gene, he determined that the allelic forms of the two genes underwent independent assortment when they were segregated to the gametes. That is, the particular allelic form of one gene that went into a gamete did not affect which allelic form of the other gene went into that gamete. The result was that in the next generation of plants new combinations of the allelic forms could be found in predictable ratios. When additional mutations in other organisms were studied, examples that appeared to violate this rule were soon found. In those examples, particular allelic forms of two different genes tended to stay together when gametes were formed. Such genes were said to be linked. After many examples were studied, it was shown that genes could be placed into linkage groups. Genes in one linkage group tended to stay together in the gametes, and to assort independently of genes in other linkage groups. The first genes that Mendel had studied happened all to fall into different linkage groups. Cytologists Discovered the Cellular Structures That Contained the Genes The foundation of genetics was consolidated when it was discovered that chromosomes behaved in the same way that Mendel's hy-

8 pothetical genes did. At the same time that geneticists were defining the properties of the abstract entities they called genes (at the end of the 19th and beginning of the 20th centuries), cytologists were discovering the components of cells visible with a microscope. In examining the nucleus, they found it contained multiple chromosomes ("colored bodies" seen because they accepted certain stains) present as morphological pairs. Copies of each pair were faithfully allocated to daughter cells at cell division in a process termed mitosis. During development of gametes, there was a reduction division at which only one member of each pair entered each gamete, in a process similar to the segregation of Mendels alleles. This unique form of cell division was termed meiosis. It was further shown that all of the different pairs of chromosomes were necessary for normal development of the organism. Thus chromosomes were essential and behaved like genes. However, it was found that there were many fewer chromosomes than there were genetically-definable genes. Thus each chromosome would have to be associated with many genes. Eventually it became apparent that the correct correlation was not between genes and chromosomes, but between linkage groups and chromosomes. Organisms had the same number of chromosome pairs as genetic linkage groups. Linked genes went together into gametes because they were present on a single chromosome, whereas unlinked genes were on different chromosomes which assorted independently. The two cellular copies of each chromosome are known as homologs and together constitute a homologous pair. Each member of a pair generally carries the same genes, although the allelic forms of these genes may differ. Thus the presence in the cell of two homologous chromosomes corresponds to the diploid genetic condition found by Mendel.

Genetic Recombination between Genes in Single Linkage Groups Results from Exchange of Material between Homologous Chromosomes When two marked (mutated) genes are present in a genetic cross, there is a possibility of both parental and non-parental combinations of alleles among the gametes. Suppose the two genes a and b are marked in a cross, such that one parent has the alleles A and B (genotype AB/AB) and the other parent has the alleles a and b (genotype ab/ab). The genotypes of all the F1 hybrid progeny are AB/ab. (In a cross such as this, following Mendels nomenclature, the parental generation is known as the Po generation, and the progeny of the cross constitute the F1 generation [for first filial generation]. The next generation is the F2 generation, and so forth.) Let the F1 hybrid be back crossed to the ab/ab parent. In this back cross, also known as a test cross, the ab/ab parent supplies only one type of gamete, ab. But for the F1 hybrid parent, there are several possibilities. The possibilities for the genotypes of the progeny are AB/ab, ab/ab, Ab/ab, or aB/ab, where the alleles written before the slash are from the F1 hybrid parent, and the alleles written after the slash are from the ab/ab parent. Regarding the alleles from the F1 hybrid parent, progeny with genotypes AB/ab and ab/ab are derived from F1 gametes with the parental (Po) configurations of alleles (AB and ab), whereas progeny with the genotypes Ab/ab and aB/ab are derived from gametes with non-parental configurations (Ab and aB). During meiosis in the F1 hybrid parent, the genes a and b are said to have recombined to give these non-parental combinations. By definition, unlinked genes recombine at a frequency of 50%. They are assorted randomly to the gametes, half of

9 which get the parental combination and half of which get the non-parental (recombinant) combination. Linked genes are genes for which the frequency of recombination is less than 50%. Genes on different chromosomes undergo random assortment and hence recombine at a frequency of 50%. Genes on the same chromosome also recombine. This is because, during meiosis, homologous chromosomes pair and undergo a physical exchange of material. In this way, nonparental combinations of alleles can be made even for linked genes. The frequency of the physical exchange event varies greatly from organism to organism and from chromosome to chromosome. It may be so high that two genes on the same chromosome become genetically unlinked, assorting randomly. (If the frequency of exchange is very high, the frequency of genetic recombination rises to a maximum of only 50%. This is because double, quadruple, etc., exchange events restore the parental configuration.) At the other extreme, it may be so low that two genes virtually never recombine and are said to be tightly linked. The Frequency of Genetic Recombination Can Be Used to Map Genes on Chromosomes The frequency of physical exchange, and hence of genetic recombination, between genes on single chromosomes depends not only on the organism and chromosome, but also on the physical distance between the genes on the chromosome. The probability of an exchange is higher if the genes are further apart, and lower if they are closer together. This provides the basis for constructing a genetic map. By determining the frequency of the non-parental, that is recombinant, combination of alleles among the progeny of a cross, a recombination frequency is calculated. Genes are then arrayed along a linear map depending on their recombinational "distances" from each other. A genetic map gives the linear order of genes on a chromosome determined by genetic studies. Because of the general correlation between the amount of DNA between two genes and the probability of the occurrence of an exchange event, the genetic map resembles the physical array of the genes along the chromosome. However, the resemblance is far from perfect. While the order of the genes should be correct, the relative distances between them may not reflect the actual relative physical distances between them. The probability of exchange per nucleotide is not constant, and in fact can vary a great deal from region to region. Some regions are hotspots of recombination where exchange occurs frequently, and likewise there are regions where exchange is suppressed. Genes on opposite sides of a hotspot, though physically close together, will appear far apart on the genetic map. Genes in regions of little recombination, though physically far apart, will appear close together on the genetic map. A physical map displays where genes are physically located along a chromosome or molecule of DNA, as determined by molecular as opposed to genetic studies. Correlation of genetic maps and physical maps is an important component of genome projects, as discussed further below. Construction of a Genetic Map Is an Important Step in the Definition of Genes As discussed earlier, mutations can be assigned to the same or different genes by a complementation test. This test rests on the gene as a unit of biochemical activity. However, the possibilities of intergenic noncomplementation, and intragenic comple-

10 mentation make this test not absolutely reliable. Additional information can be readily obtained as to whether two mutations lie in the same or different genes if they are genetically mapped relative to one another. Mutations that map to different linkage groups, or that lie far apart within a single linkage group, must be in separate genes. Likewise, mutations that are tightly linked and have similar phenotypes could well lie in a single gene even if they complement each other. Organisms Being Studied Today Many organisms are currently being studied using molecular genetic tech-niques. A few of the more commonly studied include the bacterium Escheri-chia coli, the yeast Saccharomyces cerevisiae, the nematode Caenorhabdites elegans, the fruit fly Drosophila melanogaster, the flowering plant Arabidopsis thaliana, the mouse Mus musculus, and the human Homo sapiens. These organisms each have special features that permit study of important aspects of biology. Other organisms are used as well, often to study some particular problem. For example, the molecular genetics of the small tropical aquarium fish, the zebrafish, is being developed. It is hoped that this organism will serve as a vertebrate amenable to the same kind of in-depth analysis as is focused on Drosophila, C. elegans, and Arabidopsis. Embryogenesis of the frog Xenopus is studied because of its large, rapidly-developing eggs, while the slime mold Dictyostelium serves as a model to study cell mobility, cell-cell signaling, and pattern formation. Ciliated protozoans have proven to be excellent for the analysis of telomeres, because their macronuclei contain a large number of small chromosomes. For many organisms, classical genetic analysis is not possible, because the sexual cycle is either too long (e.g. Xenopus), non-existent (e.g. Dictyostelium), or uncontrollable (H. sapiens). This limitation is becoming less and less of a drawback as an ever-expanding arsenal of molecular genetic techniques is developed for isolating genes, modifying them in vitro, and placing them back into the genome. Some of the special features of the important organisms follow. E. coli and the related Salmonella typhimurium were the first organisms to be studied in molecular detail and remain the best understood on a molecular level (although this is changing). Special advantages are extremely fast growth (cells can divide every 20 minutes), very small genome size, about 1/1000 that of humans with about 1/10 the number of genes. Mutations in about 1,500 genes out of a predicted total of 4,300 are already known. E. coli lacks a true sexual cycle but the technology for moving genes between different E. coli strains is very well developed and is technically simple. E. coli is good for studying detailed molecular function of proteins. Prokaryotes like E. coli perform many functions on a molecular level quite differently from eukaryotes. The eukaryote S. cerevisiae serves as a useful microorganism that has many of the advantages of E. coli, but with much greater similarity to higher organisms. S. cerevisiae also has a sexual cycle and Mendelian genetics. Gene replacement is simple in yeast and permits rapid reverse genetic as well as genetic studies. Though yeast is good for studying cellular processes, obviously it does not permit studies of how multicellular organisms develop and function. Two organisms used to study animal development are C. elegans and Drosophila (known affectionately as worms and flies). Both organisms

11 are small, develop quickly and boast a large catalog of developmental mutations and sophisticated classical and molecular genetics. The small plant Arabidopsis provides an organism for studying higher plant development. There is great interest in human biology and the mouse serves as a convenient and similar (!) mammal. The development of gene replacement technology for the mouse means that the role of genes in mammals can be tested directly. It has become much easier now to create a "knockout" mouse or a conditional knock-out mouse that lacks any gene of interest. Human genetics offers special opportunities and difficulties. Unlike the other organisms it is not ethical to experimentally manipulate humans. On the other hand, the earth has about 1010 humans who notice even subtle developmental problems and often report them to those aware of genetic diseases (doctors). Molecular pedigree analysis permits the study of human genetics. Genetic Mapping Techniques in Various Organisms While the underlying principles are the same, the approach taken to mapping mutations and constructing genetic maps varies from organism to organism. Obviously, the techniques available to the experimenter for mapping mutations in yeast, growing as a colony on a plate, will differ from those available for mapping human genes. Below are summarized briefly the steps employed for various popular experimental eukaryotes. Techniques employed with bacteria are presented in the next section. Yeast Mapping of genes in the yeast Saccharomyces cerevisiae generally occurs by cloning the relevant gene, determining the DNA sequence of only a short segment, and comparing that sequence to the yeast genomic database for identical sequences with known chromosomal locations. When the cloned gene is not available, the genetic technique known as tetrad analysis is typically used to determine the map position. Genetic Mapping in Yeast Tetrad analysis involves crossing a haploid mutant strain to a series of tester strains of the opposite mating type containing marked chromosomes. Following meiosis, four haploid spores, the meiotic products of the cross, are contained as a tetrad within a single ascus, enabling accurate analysis of a single meiotic event. The segregation of the mutant phenotype from markers specific to a given chromosome can be followed. Distribution of the mutant gene (x) and a given marker (m) to different chromosomes or to distant locations on the same chromosome yields predominantly random segregation of the two genes (X, M) within a tetrad, i.e. a tetratype with XM, Xm, xm and xM progeny. (Even though yeast chromosomes are small, the frequency of recombination is comparatively high.) If the mutant gene (x) is linked to the marker (m), then tetrads of the parental ditype are predominant, i.e. Xm, Xm, xM, xM progeny within a single tetrad. This analysis is then repeated with strains containing markers at intervals scattered along all the 16 yeast chromosomes until linkage is observed. For recessive mutations, the mapping process can be simplified by using strains carrying marked, unstable chromosomes. Loss of a specifically marked chromosome is induced by the cross to the mutant strain. A recessive mutant can exhibit its phenotype upon loss of the homologous chromosome, thereby permitting its chromosomal assignment. The location of the mutant gene along this chromosome can then be determined by the frequency of its recombination with known markers along the chromosome.

12 Mitotic cross-over mapping, resulting from reciprocal exchange of genes located distally to the cross-over point, is a rapid method to determine the arm of the chromosome on which the gene resides and can be performed in sectored colonies. The frequency of cosegregation of genes that are far apart on the same arm of the chromosome is indicative of the localization of the mutated gene to a defined region of the chromosome. Fine mapping can be achieved by meiotic mapping (tetrad analysis) with markers known to reside in the vicinity of this chromosomal region. Caenorhabditis elegans The nematode C. elegans has six linkage groups, all of about the same size. There are two sexes: hermaphrodites and males. Hermaphrodites are morphologically similar to females, but make sperm as well as oocytes. They can fertilize their own eggs internally, or they can be fertilized by males. Hermaphrodites are XX, males are XO, and there are five pairs of autosomes. Genetic analysis in C. elegans is greatly aided by the possibility of storing frozen mutant stocks indefinitely in liquid nitrogen refrigerators. Genetics in C. elegans is somewhat unusual in having the possibility of examining the self progeny of a single hermaphrodite. This simplifies certain operations. In general, genetic mapping in C. elegans consists of constructing a hermaphrodite heterozygous for mutations of interest, and then observing the self progeny of that hermaphrodite for recombination between the mutations. To map a new mutation, first the linkage group containing the mutation is determined. This is done by determining its linkage to known marker mutations. First, a hermaphrodite is constructed that is heterozygous for the mutation of interest and a morphological or behavioral mutation of known linkage. For example, a male carrying the new mutation may be mated to a marked hermaphrodite. The heterozygous hermaphrodite cross progeny are then allowed to self. The frequency with which the double homozygote is present among the self progeny reveals whether the two mutations are linked. If they are unlinked (meaning probably on different chromosomes) the frequency of the double homozygote is 1/16 (1/4 of the animals homozygous for the marker mutation will also be homozygous for the unknown mutation). If the two mutations are linked, the frequency of the double homozygote is much lower. This test is carried out with markers for each of the six linkage groups until linkage is found. Once the linkage group of the new mutation is known, its position on the linkage group is determined. In a three factor cross, segregation from a hermaphrodite carrying two known mutations on one chromosome and the unknown mutation on the homologous chromosome is analyzed. Animals carrying a chromosome recombinant for the known mutations are isolated, and the presence or absence of the unknown mutation on the recombinant chromosome is established. In this way, the location of the unknown mutation is determined to be to the left of, inside of, or to the right of the interval defined by the known mutations. If it lies inside the interval, its position within the interval can be determined from the ratios of genotypes among the recombinants. In a two factor cross, the recombination distance between the new mutation and a known mutation is determined. This is done by analyzing the frequency of recombinants among the progeny of a hermaphrodite that is heterozygous for a cisdouble mutant chromosome, that is, a chromosome bearing both the unknown mutation and a known mutation. The cisdouble is conveniently obtained as a segregant from a three-factor cross.

13 It is possible to determine the genetic map position of a cloned gene or other DNA segment by taking advantage of the C. elegans physical map. A set of overlapping cosmid and YAC clones is available covering the entire C. elegans genome. YAC grids are available consisting of a single nitrocellulose filter onto which DNA of a representative set of YAC clones has been spotted, in order, representing the six C. elegans chromosomes. The DNA fragment to be mapped is labelled and hybridized to this filter, and the subset of overlapping YACs to which it hybridizes reveals its genetic location. The physical position of the DNA may be further refined by locating its position on available cosmids. Its genetic function may be determined in a transgenic animal constructed by microinjection of the DNA. Drosophila melanogaster D. melanogaster has only 4 pairs of chromosomes: 1st (or X), 2nd, 3rd, and 4th. Determining where on a linkage group a gene maps is not usually difficult. Crosses with known markers are employed and linkage or independent assortment observed among progeny. An unusual feature is the lack of meiotic recombination in males. In practice this simplifies genetic mapping, because one can breed a mutation only from the male parent and be certain no recombination has occurred, or from the female parent and be certain all the recombination occurred in one generation. Successful freezing and thawing of Drosophila is only just being developed, and most mutations are maintained in continuous culture. Special chromosomes called Balancer chromosomes have been developed, which suppress recombination and chromosome segregation such that the progeny are always genetically identical to their parents. One very important feature is the giant polytene chromosomes of the larval salivary gland cells. These are thousands of times larger than normal chromosomes and make it routine to see chromosome rearrangements under the microscope. Labelled DNA probes can easily be hybridized to the polytene chromosomes and this allows determination of the position of a cloned sequence in the genome within a day. Mouse Gene mapping in the mouse may be carried out by the use of three different test populations. These are (1) conventional crosses, i.e., backcross (F1 x parent) or F2 (F1 x F1) populations, (2) recombinantinbred (RI) strains, or (3) interspecific backcrosses (ISB). If the gene has not been cloned and one must rely on a phenotype demonstrable only in protein gels, cells, or individual mice, mapping can be extremely tedious. If the phenotypic differences occur among mice of different inbred strains, particularly those involved in RI strains or ISB's (see below), then all three of the types of test populations may be usable for mapping purposes. If, however, the mutation is a newly detected one present only in progeny of the original mutant mouse, and if there are no hints of map location from existing experimental data, all options for mapping can be extremely costly in time and research funds. If highly specific DNA probes are available for the gene to be mapped, the first step is to seek a restriction enzyme that reveals a RFLP (restriction fragment length polymorphism) in tests with genomic DNA from mice of various inbred strains. This RFLP should permit the use of one or more of the three approaches. If no RFLP can be identified, it is possible to analyze a set of clones of interspecific hybrid cells. Progeny of the fusion of a hamster and a mouse cell begin with complete chromosomal complements from both parents, but they gradually lose most of the mouse chromosomes. There exist sets of

14 clones derived from such hamster/mouse hybrids in which each clone retains only one or two mouse chromosomes. The probe will hybridize only with DNA from clones with the mouse chromosome bearing the gene to be mapped, and the gene is said to be syntenic with that chromosome. Note that the homologous hamster gene may also hybridize with the probe, but it will usually produce a restriction fragment of a different size from that of the mouse fragment. While synteny can usually be established in this way, it is only rarely possible to place the gene on a particular portion of the chromosome by this method, e.g., when one of the clones contains a chromosome with a translocation. A more virtuosic method is physical mapping by hybridization of a radioactive probe to spreads of banded chromosomes. This procedure allows identification of the chromosome carrying the gene and gives a rough indication of its position on that chromosome. The procedure is much more difficult than with Drosophila, and less accurate, because mouse chromosomes are not polytene and have fewer bands. Conventional crosses. Backcross: P1 (AB/AB) x P2 (ab/ab) F1 (AB/ab); F1 x P2 results in 4 phenotypic combinations, AB, Ab, aB and ab in frequencies ranging from 1:1:1:1 (no linkage) to 2:0:0:2 (tight linkage); the recombination frequency is the percentage of mice with recombinant phenotypes (Ab and aB) in the total backcross population. F2: P1 (Ab/Ab) x P2 (aB/aB) F1 (Ab/aB); F1 x F1 results in the same four phenotypic combinations in frequencies ranging from 9:3:3:1 (no linkage) to 0:8:8:0 (tight linkage); the recombination frequency is still a function of these ratios, but they must be converted into the recombination frequency using mathematical formulae. RI strains. Several sets of RI strains are available. Existing RI sets have been typed for an enormous number of allelic differences. Careful comparison of the strain distribution patterns (SDP) for the gene to be mapped with other known SDP's often produces a quite precise ordering of the gene with respect to nearby genes. ISB. In any given chromosomal segment, RFLP's and other types of DNA polymorphisms are more likely to occur between individuals of different species than between individuals of the same species. Although interspecies hybrids are often sterile, hybrid females from crosses of the laboratory mouse Mus musculus) and a related species (M. spretus) are fertile, and backcrosses of the hybrid to M. musculus males can readily be obtained in large numbers. There exist sets of genomic DNAs from each of >100 individual mice of such an ISB that have already been typed for many DNA polymorphisms. If it has not been possible to identify a suitable polymorphism among mice of different inbred strains, chances are good that one can be found between the two mouse species. Testing these DNAs with the new RFLP makes it possible to compare its segregation pattern among the ISB DNAs with those of other markers in essentially the same manner used for RI strains. Humans A major effort, the Human Genome Project, was undertaken to obtain detailed physical and genetic maps and the complete nucleotide sequence of the human genome. Analysis and annotation of this sequence will eventually identify all of the estimated 50,000 human genes. Such an accomplishment will enhance investigators' ability to isolate distinct genes, particularly those in which mutations are responsible for human diseases. Many of the techniques described above for physical and genetic mapping in lower organisms are applicable

15 to humans, with the obvious exception of experimental crosses. Traditionally, human genes have been cloned by isolating the encoded protein and using this information to screen libraries with antibodies or oligonucleotide probes. When cloned nucleic acid probes are then available, standard approaches toward physical mapping may be carried out, including somatic cell hybrid analyses and more recently, in situ hybridization techniques. Genetic mapping, as in other organisms, relies upon the frequency of recombination between various genetic loci, i.e., genetic linkage analysis. The human genome comprises approximately 3000 centiMorgans, where 1 cM is defined as the genetic length over which one observes recombination 1% of the time. Assuming a haploid genome of ~3 x 109 bp, 1 cM corresponds to approximately 1 million base pairs. A genetic linkage map allows one to clone genes by virtue of a distinct phenotype or trait resulting from a mutation, even if nothing at all is known about the protein encoded by the gene. As opposed to physical mapping, this latter approach requires only that the phenotype be linked to some polymorphic marker, a technique known as positional cloning. As in lower organisms, the creation of a useful genetic linkage map depends upon the existence of polymorphic loci distributed throughout the genome. Historically, the first polymorphisms which provided a suitable approach for large scale genetic mapping in humans were based upon restriction fragment length polymorphisms (RFLPs). However, RFLPs are not found with sufficient frequency to saturate the human genome. More recently, other types of polymorphisms have become popular, including mini-satellite DNAs or variable number of tandem repeats (VNTRs), and micro-satellites, particularly "CA" repeats. Regions of DNA containing (CA)n, where the number of repeats (n) is highly polymorphic, are dispersed throughout the genome. These show a high degree of heterozygosity and are inherited in typical Mendelian fashion. By identifying the sequences which flank various "CA" repeats, polymerase chain reaction (PCR) primers can be designed which amplify fragments of differing sizes, depending upon "n". The number of PCR primer pairs which uniquely amplify distinct "CA" repeats is constantly growing. Using these and other polymorphic loci distributed throughout the genome, a highly detailed genetic linkage map of the human genome is being compiled. There are now several thousand such highly polymorphic loci which are distributed throughout the human genome, with markers spaced at less than 5 cM. Thus, finding tight linkage between a phenotypic trait and some "CA" repeat or other polymorphic locus becomes increasingly more probable. In addition to identifying polymorphic loci, PCR primers that amplify distinct segments of genomic DNA also provide an approach to physical mapping and eventually isolation of the gene of interest. Once a PCR primer pair is found which identifies a polymorphism that is tightly linked to the phenotype of interest, genomic DNA libraries can be screened using the same PCR primer pair. Clones which are identified by definition contain genomic DNA which is also tightly linked to the gene of interest. There are now several methods which permit isolation of genes or parts of genes from genomic DNA. Most prominent among these methods are conventional methods of screening cDNA libraries and newer methods such as cDNA selection by affinity hybridization and exon-amplification. Each of the genes identified in this fashion would be considered a candidate for the disease locus being studied. Based upon the properties of various candidate genes, such as their patterns of expression, the nature of the en-

16 coded proteins, or identifiable mutations, where the phenotype is some disease state, the gene of interest can be unambiguously identified. As the density of genetic and physical markers increases, maps which incorporate all types of markers (integrated maps) are emerging. These maps are facilitating isolation of genes for diseases which are inherited in a simple Mendelian fashion. These maps are expected to help in identifying genes in complex human genetic diseases. The Physical Characteristics of Genomes Genomes Consist of DNA Molecules, and Vary Widely in Size The holy grail of classical geneticists was to understand the physical structure of a gene and how this structure allowed it to carry out its two functions: to determine the characteristics of the organism and to transmit those characteristics to the next generation. By the time the molecular structure of the genetic molecule, DNA, was determined by Watson and Crick in 1952, so much was known about the properties of genes from classical genetic studies that it was immediately apparent from the DNA structure how in a general way these two functions were carried out: the information for the organism was present in the form of a code, and the information was replicated by base pair complementarity. Since that time many biologists have concentrated their efforts on determining the precise code for particular organisms, and on understanding how the code is read out, implemented, and transmitted. The total information for an organism is contained in its genome, comprising the nuclear and plastid (mitochondrial, choroplast) chromosomes. Each nuclear chromosome consists of a single DNA molecule held within a protein scaffolding. held within a protein scaffolding. There may be from one to over a hundred chromosomes in the nucleus, depending on the organism. The genomes of organisms that live independently range in size from approximately 106 base pairs in bacteria to over 1011 base pairs in some amphibians. The genomes of "quasi" organisms, such as viruses, that utilize the cellular machinery of another organism to replicate, can be much smaller. Small viral genomes, such as those of retroviruses, certain tumor viruses such as SV40, or bacteriophages such as X174 are as small as 5,000 base pairs. Some biologists even view transposable elements as a kind of organism, termed "selfish DNA", that perpetuate their own existence within host genomes. Transposable elements may be as short as 1,000 bases and encode just a single gene. Bacterial Genomes Contain Some 4300 Genes, Higher Organisms May Have As Many As 30,000 or More Because genes may be tightly packed, even overlapping, on the DNA molecule, or widely separated by "junk" sequences, the number of genes in the genome does not necessarily correlate with the amount of DNA. In general, genes are more densely packed in the genomes of prokaryotes than in those of eukaryotes. The bacteriophage genome, 50,000 base pairs, contains some 50 genes, or about one gene per 1000 base pairs (kilobase pairs, or kb). The determination of the complete nucleotide sequence of the chromosome, in 1982, was a landmark achievement in the analysis of genome structure. The frequency of genes in the E. coli genome, which is estimated to contain 4,300 genes in 4.7 mb of DNA, is somewhat lower. The number of genes in the genomes of higher organisms was been the subject of much debate and speculation. Prior to the

17 direct sequencing of large genomes, two approaches were taken to estimate the gene number. A genetic approach was to estimate the total number of genes from the frequency of lethal mutations obtained upon mutagenesis. Estimates obtained by this approach were too low for at least two reasons: many genes may encode non-essential products, and many essential products may be redundantly encoded. A biochemical approach to determine the number of genes measured the rate of annealing of mRNA to DNA. By this method it is possible to make an estimate of the complexity of the mixture. Complexity is defined as the total number of nonrepeating sequences within a mixture of nucleic acids. By means of such kinetic measurements, it was estimated that mammalian genomes contained some tens of thousands of expressed genes. Genome Projects Current Methods Make the Sequencing of Whole Genomes Possible Until the mid-1990s, the genetic and physical study of the genetic make-up of organisms proceeded in a piecemeal fashion. Genes and genetic loci were studied one at a time, as they became relevant to a particular research topic or project. However, with the advent of cloning vectors that can contain much larger inserts of intact chromosomes and improved sequencing technologies, the sequencing of entire genomes has become feasible. In this approach, the complete DNA sequence of an organism is determined, and all of the genes potentially identified by computer analysis of the sequence. By carrying out all of the cloning and sequencing at once in a unified project, the goal of obtaining a complete sequence occurs much more rapidly, allowing investigators to concentrate on analyzing their integrated function in the life of the organism. Construction of Physical Maps: Overlapping Clones. The complete DNA sequence of genomes is obtained by the automated determination and analysis of vast quantities of DNA sequence. However, before this can be done, the genome must first be obtained in fragments of sequencable length (a few hundred to a few thousand base pairs) whose relationship to one another is known. For this purpose, a complete physical map is constructed. This consists of overlapping cloned fragments of the genome, usually in cosmid, P1, BAC, or YAC vectors. As such a map is being constructed, overlapping groups of contiguous fragments, termed contigs, are built up. Contigs are progressively joined to each other as more and more fragments are mapped, until, when the physical map is completed, the number of contigs equals the number of chromosomes. The Physical Map Is Correlated with the Genetic Map The availability of complete genomic sequences allows correlation of the physical sequence with genetic markers and the use of mutants to understand the function of the sequence. Physical and genetic maps are correlated in several ways. Physical sequences that differ between strains (or in lineages, e.g. of humans), resulting in "restriction fragment length polymorphisms" (RFLPs), are genetically mapped in the same way that any other genetic difference is mapped. (An RFLP results whenever a particular, detectable (eg. by hybridization to a probe) restriction fragment differs in size between two organisms. This can come about because one or

18 both of the restriction sites that define the fragment are mutated, because a new restriction site arises between them, because DNA has been deleted or inserted between the two sites, or because some other rearrangement has separated them.) Cloned DNA fragments may also be mapped to chromosomes and parts of chromosomes by in situ hybridization techniques. This approach is particularly powerful in Drosophila, where the genetic map is already correlated in detail with the polytene chromosome banding pattern. A third approach is the identification of functional genes on cloned DNA fragments by the complementation of known mutations (complementation rescue). This approach is particularly powerful in organisms that are easily transformed, such as yeast and C. elegans. Eukaryotic Genomes Contain a Large Amount of Repetitive DNA. In spite of the fact that eukaryotic genomes may have more genes than originally thought, it remains true that these genomes contain a great deal of non-coding sequence. Some of this "extra DNA" appears simply to be non-functional unique sequence. Unique sequence is DNA sequence that occurs only once in the haploid genome. Some extra DNA is accounted for by introns. Other sequences make up distinct classes of repetitive DNA. Repetitive DNA is DNA sequence present more than once per haploid genome. Repetitive DNA can make up anywhere from a small fraction to a majority of the genomic DNA of eukaryotic organisms. Typically it represents some 20% to 50%. The first evidence that genomes contained DNA apart from unique sequences came from analysis of reannealing kinetics. When genomic DNA was denatured (e.g. by heating) to cause the strands to separate, and then allowed to reanneal to the doublestranded form (at a lower temperature), the rate of reannealing was not consistent with a single kinetic component. When double-stranded DNA is denatured and renatured, the rate of reannealing, like the rate of other bimolecular reactions, is dependent on concentration. In the case of double-stranded DNA, the relevant concentration is the concentration of similar or identical DNA sequences, since only these can interact to anneal. The concentration of a pair of similar sequences in a mixture of nucleic acids depends on the complexity of the mixture, that is, the number of different sequences in the mixture. When the kinetics of renaturation of eukaryotic genomic DNA was measured, it was found that much of the DNA reannealed at a rate higher than expected for unique sequences. This indicated that these sequences were repeated within the genome. In fact, there were several kinetic components, indicating sequences present from 10 times to millions of times in the genome. This kind of kinetic analysis is called Cot analysis, because the kinetic data were typically presented in a plot of percent DNA annealed versus the product of the DNA concentration (Co) and time of annealing (T). There Are Several Kinds of Repeated Sequences The fastest kinetic component in a Cot analysis of eukaryotic DNA typically annealed essentially instantaneously, and in a concentration-independent manner. This component consists of inverted repeats. These are similar sequences joined close together and in inverted orientation, so that they reanneal in a so-called "snap-back" or "foldback" reaction. Inverted repeats are often members of other repetitive sequence

19 families elsewhere present as isolated repeats. The second fastest kinetic component consists of sequences present millions of times in the genome. These are simple sequences consisting of long stretches of a short repeat, such as ...ATATATATAT... (from crab) or ...AAGAGAAGAG... (from Drosphila). Such sequences are also known as satellite sequences. This stems from their behavior during density analysis of DNA. When the density of eukaryotic DNA is analyzed by buoyant density centrifugation in a CsCl density gradient, it is found to have several components of different density. The gradient profile consists of main band DNA, containing the unique sequences, including most of the genes, and satellite bands, so-called because they lie along side the main band on the profile. The anomalous, repetitive structure of the simple sequence DNA accounts for its variant buoyant density. In some eukaryotic genomes there is little or no satellite or simplesequence DNA, whereas in others such sequences may make up over 50% of the total. The function of satellite DNA is not known. Speculation focuses on a possible role during pairing of homologous chromosomes. The next slowest kinetic component, lying between the satellite sequences and the unique sequences in rate of annealing, consists of the so-called middle repetitive sequences. There are a great variety of such sequences. Some are genes present in multiple copies in the genome. Genes for common cellular components such as ribosomal RNA or histone proteins are often present in multiple copies. The multiple copies may be dispersed in the genome, or may be present in tandem arrays at a single locus. Non-functional, corrupted (mu-tated) copies of genes, called pseudogenes, make up another component of the middle repetitive DNA. These sequences may have arisen in a duplication event, or by reverse transcription of an RNA copy of the gene, followed by insertion of the DNA copy into the genome. DNA copies of mRNA's, known as processed pseudogenes, are characterized by the presence of polyA tails and absence of introns. This indicates their origin from reverse transcription of cellular mRNA, followed by insertion of the DNA copy into the genome. Between 5% and 10% of the human genome is made up of a large pseudogene family known as the Alu family (named after the restriction enzyme, AluI, that was first used to identify it). These 300 bp repeats, present hundreds of thousands of times in the genome, probably originated as DNA copies of the short cellular RNA known as 7SL RNA. 7SL RNA functions normally as a component of the cellular mechanism that translocates newly synthesized proteins across membranes of the rough endoplasmic reticulum. Short repeats such as the Alu repeats have been dubbed SINES, for "short, interspersed sequences". Other families of middle repetitive sequences consist of transposable elements. These are present in all genomes and have a great variety of structures and modes of transposition. They make up the LINES, or "long interspersed sequences", in mammalian genomes, and are in some cases related to the genomes of retroviruses. Endogenous retroviral genomes themselves are another component of the middle repetitive DNA of mammals. Transposable elements make up some 20% of the Drosophila genome. Finally, there is a class of middle repetitive sequences that so far have eluded explanation. These sequences typically consist of a few hundred base pairs, interspersed among other sequences around the genome. They have been given the general name interspersed repeats. They make up families of anywhere from a few to hundreds of thousands of members, and there are typically hundreds to thousands of families in

20 eukaryotic genomes. They usually account for a large proportion of the middle repetitive DNA. In spite of their prevalence and ubiquity, the origin and function of these interspersed repeats remains a mystery. While they certainly have an origin, the suspicion is that they have no function. They are the ultimate junk DNA. Maintenance and transmission of the genetic material Special Sequences Control the Replication and Transmission of the Genetic Material Most organisms use DNA as their genetic material. The exceptions are some viruses that use RNA. The symmetry of DNA permits replication by polymerases to create two exact copies of the genetic material. One mechanism of replication involves initiation of synthesis at a single point, the origin of replication, and replication to completion. Many bacteria, plasmids and viruses replicate in this fashion. Another mechanism involves initiation of DNA synthesis at many points on the genome and synthesis until the replication forks meet. There may or may not be origins of replication that are used during every round of replication. Eukaryotes use multiple origins on a single DNA molecule. Also eukaryotes have linear genomes which require ends with special structures, called telomeres, both for protection of the DNA, and to permit the end to be correctly replicated. Telomeres have a unique physical structure that includes multiple short DNA repeats with nicks and a capping hairpin structure. Once a genome has been replicated, each copy must be accurately partitioned into the two daughter cells. For the bacterial circular genome and for some plasmids this is accomplished by having a partition sequence in the DNA near the origin of replication. These sequences attach to regions of the cell wall that grow apart during cell division, dragging the two newly replicated genomes apart. For some plasmids and for the plasmid-like DNA of mitochondria and chloroplasts the genome is maintained in multiple copies and the cell depends at least partly on statistics to ensure that each daughter cell or organelle gets at least one copy of the genome. Other mechanisms then ensure amplification of the genome. Eukaryotes generally have their genomes distributed on several chromo-somes and thus have special problems in assuring that each daughter cell gets exactly the right set of chromosomes after replication. A special structure, the centromere, and attached cytoskeletal machinery, the mitotic apparatus (mitotic spindle), ensure accurate segregation of chromosomes. During meiosis, in which a diploid cell undergoes reductive divisions to yield haploid cells, synapsis, or pairing of homologous chromosomes, and a unique meiotic apparatus are required to ensure that haploid gametes get exactly one of each chromosome. Enzymatic Mechanisms Repair DNA Damage and Recombine the DNA Strands As the genetic material, DNA is precious and must be protected from damage. Ultraviolet light, ionizing radiation, and DNA modifying chemicals can damage DNA. Many mechanisms exist to repair damage that occurs. Excision repair pathways exploit the fact that two copies of genetic information are stored in the two strands of DNA. Damaged bases can be removed on one strand and then recopied from the other. Recombinational repair mechanisms work by shuffling damaged and undamaged segments that are present in more than one copy in the cell to try to put together a 'good' genome.

21 Even in the absence of detectable DNA damage DNA sequences may 'recombine'. Homologous recombination is at the heart of both classical genetics and modern "gene-targeting". The mechanism of such recombination or cross-over events is controversial and probably varies according to the organism, but involves breaks in DNA, unwinding of strands, hybridization to homologous segments of DNA and new DNA synthesis and endonuclease strand cleavage. The net result is equivalent to a physical cleavage of DNA and rejoining to a different partner. Some transposons and viruses catalyze recombination events that involve specific DNA sequences that may not be homologous (or only for a few bases). Recombinant DNA and the Construc-tion of Transgenic Organisms Genes May Be Amplified in Pure Form by "Cloning" Them in Microorganisms. Early genetics was dependent on naturally occurring mechanisms for the study of genetic function. In the 1970's techniques were developed to manipulate DNA in vitro and move it across species boundaries. These cloning techniques rely on enzymes that work on DNA. Restriction endonucleases (commonly called restriction enzymes) cut DNA at specific sequences, often palindromic sequences. (A "palindrome" is a word or sentence that reads the same forwards or backwards, like "A man, a plan, a canal, Panama.", or "Madam, I'm Adam.".) For example the restriction enzyme BamHI cuts at GGATCC. BamHI is called a 6-cutter because its recognition sequence is six bases long. On average one expects a specific six-base sequence like GGATCC to occur once every 4Kb of DNA, but of course some fragments are much bigger or smaller. Furthermore, the average size depends on the GC/AT content of the DNA being cut, and the relative numbers of G or C vs A or T nucleotides in the restriction site. The size of DNA fragments can be determined on agarose gels. About 150 restriction enzymes with different recognition sequences are available commercially. The position of restriction sites in a piece of DNA can be determined, giving a restriction map useful for subsequent manipulations. Fragments of DNA can be joined to one another by another enzyme, DNA ligase. Together the ability to cut DNA, separate fragments by size, and then rejoin them in a new combination in vitro, forms the basis for the powerful cloning technologies. Though new DNA molecules can be made in vitro, the yield is usually low. However, by cloning DNA into a vector capable of replication, the recombinant DNA can be amplified in vivo. Furthermore, by placing the recombinant DNA into a microorganism, a single defined segment of a large genome can be separated from the remainder of the genome simply by selecting a clone of organisms, like a bacterial colony or a phage plaque. This is the origin of the term "cloning". Depending on the vector being used, a variety of methods are then available for separating the vector plus insert from the host microorganism's own DNA. Common sources of vector DNA are viruses and plasmids that are capable of replication in E. coli. E coli is a useful host for amplifying DNA since it is easy to grow to high density (2 X 10-9 cells/ml) and has relatively little DNA of its own. Some virus vectors (e.g. the filamentous phage M13) only infect certain stains of E. coli. Vectors such as yeast YACs and mammalian retroviral expression vectors are shuttle vectors that can replicate in both E. coli and eukaryotic cells. Often bulk recombinant DNA is made in E. coli and an experiment is done in another organism.

22 An example of vector cloning follows. The bacteriophage has a genome of about 50kb. A region containing about 1/3 of the genome serves no function (the socalled financial district) and can be replaced by other DNA including E. coli or foreign DNA. These clones can replicate just like the original bacterial virus, but now whenever they duplicate they also duplicate the inserted DNA. For the most part DNA is DNA and can be moved from organism to organism without problems. However there are three common problems in transferring DNA that we will discuss briefly. 1) DNA can be modified in ways that affect function. 2) DNA can contain sequences that get rearranged in some organisms. 3) A cloned sequence may make a protein toxic to some cells. DNA modifications are common and include sequence specific methylation of bases. These methylations can affect gene function and resistance to digestion with specific restriction enzymes. Strains of E. coli that lack many of the offending DNA methylases (e.g. mcrA) have been constructed. Also, some strains of E. coli make restriction enzymes that destroy unmodified DNA. Take care! Often a specific cloning project requires a specific host strain that modifies or does not modify DNA (see the New England Biolabs catalog or Molecular Cloning for details). E. coli does not like DNA containing short direct repeats or with inverted repeats, both of which tend to get deleted from cloned fragments by the very active E. coli recombination pathway. This can be a problem when cloning eukaryotic DNA in which such structures are common. E. coli host vectors that are defective in recombinases (like recA) are helpful but do not completely solve the problem. E. coli vectors do not tolerate more than about 20kb of DNA. Yeast artificial chromosomes (YACs) are useful for cloning up to 400kb of DNA. As the name implies YACs are grown in yeast. Eukaryotic cells such as yeast are tolerant of repeated DNA, and hence repetitive sequences that cannot be cloned in E. coli can often be cloned in a YAC. Cloned DNA may express proteins that kill a specific host. For example, even though eukaryotic promoters and introns do not function in E. coli, often a polypeptide derived from one exon will be expressed in E. coli. E. coli is especially sensitive to hydrophobic proteins that interfere with secretion (a secA strain may tolerate such clones) and to DNA binding proteins. The Polymerase Chain Reaction (PCR) Is a Way to "Clone" DNA Directly In Vitro Instead of amplifying a defined DNA segment by ligating it to a vector and introducing it into a microorganism, it is possible to amplify it enzymatically by the polymerase chain reaction (PCR). In PCR, the DNA segment between two short (15 to 30 nucleotides long) single-stranded oligonucleotide primers is copied by a primer-dependent DNA polymerase. The polymerase used is from a thermophilic bacterium. This makes it possible to carry out many cycles of synthesis automatically by alternately heating the reaction mixture to melt all DNA strands (the polymerase is not inactivated by the high temperature required for this), and then cooling it to allow the primers to anneal and the polymerase to function by extending them. In each successive cycle of melting and replication, the amount of the DNA segment between the two primers increases exponentially, as the product of each synthetic round serves as template in the next. PCR can be extremely specific and sensitive. Specificity is provided if each of the primers anneals to only the single, in-

23 tended sequence. In 30 cycles of polymerization, biochemically detect-able and useful quantities of a sequence from 50 to 5000 bases in length can be amplified from tiny amounts of complex mixtures, such as the genomic DNA of vertebrates. The synthetic product can subsequently be sequenced, used as a labeled probe, or cloned for further in vitro modification. Genes Are Cloned by Isolating Them from Clone Libraries or Clone Banks The fundamental importance of gene cloning is that it allows the purification of a single gene out of the thousands or tens of thousands present in the genomes of complex organisms. To accomplish this feat, it is first necessary to introduce all the genes of the organism under study into a culture of microorganisms. The task is then to identify a clone of the microorganism that contains the single gene of interest. The mixed culture of microorganisms is termed a clone library or clone bank. Clone libraries are generally one of two types. Genomic libraries carry fragments representing the entire genomic DNA of an organism. cDNA libraries contain DNA copies of the organisms RNA. cDNA (complementary DNA) is made with a retroviral reverse transcriptase enzyme. This enzyme makes a DNA copy of an RNA template by extending an annealed primer. cDNA copies of cellular mRNAs are typically made by reverse transcription from an oligo dT primer, which anneals to the polyA tails of mRNA molecules. A Variety of Vectors Provide a Range of Options for the Generation of a Clone Library There are several types of libraries that can be constructed, depending on the type of vector being used. Vectors differ in the amount of DNA they conveniently carry, whether or not they express any foreign genes they carry, and with respect to many diverse technical characteristics that govern the way they can be manipulated in the lab. The most widely used host microorganism is E. coli, although recombinant DNA experiments need not be confined to this host. Plasmid vectors conveniently carry inserts up to 5 kb or 10 kb in length. They typically contain a polylinker cloning site containing a number of restriction enzyme sites where foreign DNA can be inserted. They contain a selectable marker, usually a drug resistance gene, so that bacterial cells containing them can be selected. The vector must also contain an origin of replication, typically from the colicine plasmid colE1, which has the property that it allows the plasmid to replicate in high copy number within the cell. Present-day plasmid vectors are also usually constructed so that when an insert of foreign DNA is present, the expression of a plasmid-borne gene, usually the gene for E. coli -galactosidase (lacZ), is disrupted. The activity of galactosidase is easily assayed in a colony of cells with a cholorimetric assay. Colonies containing a recombinant plasmid lack the enzyme and so are white under conditions where colonies containing the vector alone are blue. Phage vectors can be based on phage , M13, f1, P1, and so forth. Phage vectors carry up to 20 kb inserts and are often used to construct clone libraries, because plaques are easier to screen in large numbers than bacterial colonies. Engineered vectors are available that allow quick recovery of the insert as a plasmid. The size of insert conveniently cloned in a plasmid vector is only limited by the technical difficulty encountered in ligating a large DNA fragment into the vector cloning site. The amount of DNA clonable in a phage vector is limited by the amount of DNA that can be stuffed into a phage head.

24 A form of plasmid vector designed to increase these limits is the cosmid. A cosmid is a plasmid vector which carries, in addition to a polylinker site, a selectable marker, and an origin of replication, the cos site of phage . The cos site of phage is the site on the intracellular, circular form of the chromosome which is cut to produce the cohesive ends of the mature, linear phage chromosome. In generating recombinant DNA molecules, advantage is taken of the cos site as follows. First a mixture of vector and insert fragments are ligated together in the usual way. The process is random, and a mixture of homo- and hetero-dimers and higher multimers of vector and insert fragments is generated. Next, the ligation mixture is treated with an extract that contains all of the components phage uses to package its chromosome into phage particles. In this packaging process, approximately 50 kb DNA segments lying between cos sites are cut out and assembled into infectious phage particles, which are then used to inject the recombinant molecules into cells. Upon injection, the molecules circularize and replicate as plasmids. Since the cosmid vector itself is only about 5 kb, clones containing inserts of between 30 kb (the minimum necessary amount to package) and 45 kb are readily isolated. A similar system based on phage P1 has a cloning capacity of 75kb to 100kb. A third type of vector, known as a phagemid, which also has both plasmid and phage properties, is based on the filamentous bacteriophage M13. The advantage of an M13-based vector is that from the double-stranded, plasmid form single-stranded copies can be continuously produced and harvested in phage particles. Singlestranded DNA makes an excellent template for DNA sequencing. The largest recombinant DNA clones are generated in yeast. Yeast is a microorganism like E. coli and has the same technical advantages for cloning: it can be easily grown in enormous numbers, and clones are simply made by selecting a single colony on a plate. However, yeast is a eukaryote. Its chromosomes have centromeres and telomeres like the chromosomes of other eukaryotes. Yeast artificial chromosomes, termed YACs, can be constructed that contain inserts of foreign DNA of many megabases. A typical YAC vector contains a yeast centromere and telomeres and a gene that allows selection in yeast. Yeast has the additional advantage that, as a eukaryote, it is sensitive to a different set of foreign DNA sequences than E. coli. Often, a segment of DNA that is lethal to E. coli may be harmless to yeast. Likewise, repeats common in eukaryotic DNA are often unstable in E. coli but are tolerated by yeast. Clone Libraries May Be Screened in a Number of Ways Once a clone library containing the genes of an organism is constructed, a single clone containing the gene of interest must be found. A number of techniques are available for doing this. One of the most direct is to use a radioactive nucleic acid probe. DNA in colonies of cells or in plaques is denatured and transferred to a nitrocellulose filter. This is the same technique by which whole genomic DNA is immobilized on a filter for a Southern hybridization. (In a Southern hybridization, genomic DNA cut with a restriction enzyme is fractionated on an electrophoretic gel, transferred to nitrocellulose, and hybridized to a probe.) The DNA on the filter is then allowed to anneal to a single-stranded radioactive probe similar or identical in sequence to the desired gene. Only colonies or plaques containing sequences similar to the probe sequence become radioactive.

25 Of course, this screening technique requires the availability of a probe for the sequence of interest. Possibly a similar gene from another organism may be available. Or perhaps a fragment of the gene may have been isolated, by reverse transcription of mRNA (cDNA), or from a PCR reaction. A probe may be generated if information is available about the amino acid sequence of the protein product of the gene. A segment of the amino acid sequence is "back translated" into a sequence of nucleotides, and an oligonucleotide with the desired sequence is synthesized and used as probe. Unfortunately, because of the redundancy of the genetic code, a unique nucleic acid sequence is not specified by a given amino acid sequence. Usually a mixture of oligonucleotides is generated with various nucleotides at the ambiguous positions. Fortunately, most eukaryotes exhibit considerable bias in their use of codons, and hence oligonucleotides with the most likely codons can be synthesized. A second approach for isolating a clone of microorganisms containing a gene of interest is to select for activity of the gene. This is particularly applicable for isolating genes of the host microorganism itself. By transforming an E. coli or yeast mutant with wild type E. coli or yeast recombinant DNA, it is straight forward to identify wild type colonies in which the mutation is complemented by the cloned DNA. Sometimes such complementation works even with DNA from a different organism. Thus, for example, one of the first eukaryotic genes cloned in E. coli was a gene for a histidine biosynthetic enzyme of yeast, which was found to allow growth of an E. coli his- prototroph in the absence of histidine (1976). A third approach for screening a clone library takes advantage of antibodies against the protein product of a gene. Vectors are available that result in the transcription and translation of any open reading frames that may happen to be present in the cloned insert DNA. If cDNA copies of the mRNA's of an organism are cloned into such an expression vector vectors, then the plaques or colonies of the clone library will express a representation of the expressed protein products of the organism. This technique works even with cloned genomic DNA inserts. These are randomly expressed by the E. coli host, and antibodies are used to find the colony or plaque expressing a cross-reacting exon or even fragment of an exon. Constructing Transgenic Organisms Central to the technique of recombinant DNA is the capability of introducing purified DNA into a suitable microbial host, such as E. coli or yeast. In fact, techniques for introducing foreign DNA into a wide range of organisms have been developed. The possibility of studying genes in transgenic organisms (organisms containing DNA of another species artificially introduced) has opened the way to analysis of many important questions about gene function, not to mention the possibility of gene therapy in humans. Perhaps the most remarkable result is ultimately how easy it is to introduce foreign DNA into the nucleus in a functional state. Cells evidently have mechanisms for recognizing DNA, and for ensuring that this DNA is nuclear, is replicated, is integrated into chromosomes, and so forth. As a result, a great variety of approaches turn out to be successful in producing transformed organisms. Many bacteria have mechanisms for taking up DNA from the media (transformation). Others can be treated in various ways, for example with calcium and heat, and made competent to take up DNA. Mammal-

26 ian cells take up a calcium precipitate of DNA. Yeast cells are transformable as spheroplasts, which are yeast cells with the cell wall removed. DNA may be introduced into cells by a process called electroporation, or by incorporation into liposomes which are fused to cells. Plants are transformed by infecting cells with a plasmid, known as a Ti plasmid, which integrates into the plant genome. A particularly powerful approach with plants is to transform a so-called callus culture of plant cells, which are then allowed to regrow into a plant. DNA can be introduced into organisms such as C. elegans and Drosophila by microinjection. C. elegans is transformed by injecting DNA into the syncytial ovary. Such DNA is incorporated into nuclei, ligated together, and replicated as long extrachromosomal concatemers containing hundreds of copies of the injected sequence. Centromeric sequences, telomeric sequences, and specific origins of replication are not necessary for these concatemers to be replicated and segregated at mitosis and meiosis. If a DNA sequence containing a dominant gene is coinjected along with the sequence of interest, successfully transformed animals can be recognized. One commonly used dominant marker encodes a mutant cuticle collagen which when assembled into the cuticle causes a twist in the body, recognizable as a rolling movement when the organism swims forward. C. elegans chromosomes are holocentric, that is, they do not have a discrete centromeric sequence. In mitosis and meiosis, the spindle microtubules attach all along the chromosomal length. This property probably explains how C. elegans cells can replicate and transmit at cell division apparently any DNA sequence, such as for example plasmid sequences from bacteria. Transgenic animals carrying extrachromosomal arrays of transgenes are not stably transformed; they give rise to progeny that have lost the array at high frequency. Occasionally, however, C. elegans also integrates a transgene into a chromosome, giving rise to stable transformants. Drosophila is transformed by microinjection of DNA into embryos prior to cellularization. Unlike with C. elegans, such injected DNA is not inherited in the absence of a centromere. Drosophila transformation therefore requires that the injected DNA be integrated into a chromosome. To facilitate this, use is made of the transposable element P. Foreign DNA is ligated to P element sequences and injected. Transposase is expressed in the recipient, causing the injected P element to jump into a chromosome, carrying along with it the foreign sequence. P element transposition only occurs in the germline, so transformed flies are obtained among the progeny of injected flies. Mammalian cells can also be transformed by microinjection. A micropipette is used to introduce DNA into the nucleus of an early embryonic blastomere, which takes up the foreign DNA and integrates it into a chromosome at high frequency. Another widely used approach is to culture embryonic cells and transform them in culture with a calcium precipitate of DNA. Transformed cells are then reintroduced into an early embryo, where their progeny can ultimately come to populate any tissue, including the germ line. Fully transformed animals are recovered in the next generation. Basic Elements of Bacterial Genetics The Genetics of Bacteria Has Several Unique Features Bacterial genetics differs from that of eukaryotes because bacteria do not have a regular haploid/diploid sexual cycle. Bacterial cells contain a single chromosome or multiple copies of a single chromosome, and divide by simple cell division. There is no

27 meiosis, gametes, or fertilization. Nevertheless, bacteria do have several means by which they can exchange genetic material and recombine their genes. Genetic maps can be constructed from recombination frequencies, just as they are for higher organisms. Short generation time (as little as 20 minutes) and capability of being examined in enormous numbers (up to 1010) in fact make bacterial genetics extremely powerful. Starting in the early 1950's, bacterial genetics was exploited to examine the nature of the gene with a depth and resolution not possible with most eukaryotes. This work resulted in the elucidation of the role of the genes in specifying the biosynthesis of proteins. It revealed the nature of the genetic code that determines the order of amino acids in proteins, as well as some of the regulatory mechanisms that govern when and at what level the protein encoded by a particular gene will be synthesized. Bacterial Cells Exchange Genetic Material in a Process Known as Conjugation Joshua Lederberg in the early 1950's was the first to observe genetic recombination between bacterial strains. He used E. coli strains carrying multiple mutations resulting in several nutritional requirements. By selecting for wild type growth on minimal media, he could demonstrate the production of cells of recombinant genotypes. Subsequent studies revealed that E. coli cells were of two types, termed F+ and F-, for "fertility". When mixed together, F+ and F- strains exchanged chromosomal material (DNA) in a process termed conjugation. Mixtures of only F+ or only F- cells, on the other hand, produced no recombinants. During conjugation, the chromosome of the F+ cell underwent a one-way transfer from the F+ cell into the F- cell. After transfer, the DNA of the donor cell underwent recombination with the chromosome of the recipient cell, resulting in cells of recombinant genotypes. Further investigation showed that cultures of F+ cells contained a minority of cells primarily responsible for chromosomal transfer. When isolated in pure form, these cells gave rise to a very high frequency of recombination when mated to F- cells. They were therefore called Hfr strains, for high frequency of recombination. The Bacterial Genetic Map Is Defined by the Time of Transfer During Conjugation The process of one-way chromosomal transfer provided a method for constructing a genetic map that would reflect the physical order of genes on the bacterial chromosome. The chromo-somal transfer process between an Hfr and an F- cell takes up to an hour to complete. If the mixture of mating cells is shaken violently during the process, the mating pairs break apart after transfer of only a piece of the chromosome of the Hfr parent. If the F- cells are then plated and tested for recombinants, the genetic markers transferred can be determined. For a given Hfr strain, it was found that certain markers were transferred in the first minutes after mixing the cells, while other markers appeared in the F- cells only later. A map could thus be constructed, starting at an origin, and plotting the number of minutes of transfer at which time various genetic markers appeared. The units of the bacterial genetic map are minutes. The Bacterial Genetic Map and the Bacterial Chromosome Are Circular When genetic maps were produced for different Hfr strains, it was found that each had an origin at a different place on the map. A marker transferred early from one Hfr strain

28 might be transferred late from another. When the positions of the origins of various Hfr strains were plotted onto the map of another, it was found that the map was in fact circular, the origins of the various Hfr strains being at various points on the circle. Just as equivalence of the number of genetic linkage groups and the number of chromosomes in eukaryotes had earlier provided an important indication of the physical location of the genes and the meaning of the genetic map, so the validity and meaning of the circular bacterial genetic map was confirmed by the finding that the bacterial chromosome is a circular DNA molecule. Evidence for this was first provided by autoradiographic images of the bacterial chromosome after replication in the presence of radioactive nucleotides. The F Plasmid Encodes Genetic Functions Required for Transfer of DNA In addition to its major, large, circular genomic DNA molecule, bacterial cells may contain smaller replicating and segregating circles of DNA known as plasmids. Plasmids may carry any types of genes. One important class of plasmids, termed F plasmids, carry the genes necessary for transfer of the bacterial chromosome during conjugation. F stands for fertility factor. F+ strains contain an F plasmid, whereas Fstrains do not. The F plasmid also carries genes that prevent the cell from acting as the recipient during conjugation. This explains why no recombinants are obtained when two F+ cultures are mixed. Integration of the F Plasmid into the Bacterial Chromosome Can Result in Mobilization of the Chromosome for Transfer When an F+ culture is mixed with an F- culture, in addition to the appearance of rare recombinants among the F- cells, many of the F- cells are converted to F+. This is because the F plasmid has catalyzed its own transfer into the F- cells. Rarely, the F plasmid integrates into the bacterial chromosome. When it does so, transfer of the bacterial chromosomal DNA occurs during conjugation. An Hfr strain is one carrying an integrated F plasmid. The origin of transfer is the site of integration. During transfer, DNA lying to one side of the F plasmid is transferred first, and the integrated F plasmid itself is transferred last. Plasmids Can Be Used to Construct Partially-Diploid Bacterial Strains F plasmids can carry other genes besides those that promote or block transfer of DNA. Occasionally an integrated F factor will be inaccurately excised from the chromosome, and in the process will pick up a segment of the bacterial chromosome. F plasmids carrying chromosomal genes are termed F' (F prime) plasmids. When transferred into an F- strain, they make that strain diploid for the segment of the bacterial genome they carry. Such a partially diploid strain is called a merodiploid. Plasmids Play an Important Role in the Transmission of Drug Resistance Another important class of genes often born by plasmids are genes whose protein products confer antibiotic drug resistance on cells. When combined together with transfer genes on a single plasmid, plasmid-born drug-resistance genes are responsible for the rapid transfer of drug resistance from one bacterial strain to another. Medically disastrous spread of multiple drug resistance genes during the early days of antibiotic use lead to research that resulted in the discovery of the class of responsible plasmids

29 termed RTF plasmids, for Resistance Transfer Factors, or simply R factors. Many of the plasmids used for cloning genes in bacterial cells, such as pBR322 or pBS, are descendants of R factors. In Transformation, Bacterial Cells Take Up DNA Directly Transfer of DNA between cells via conjugation is not the only means by which genetic exchange can occur between bacterial cells. In a process termed transformation, some bacterial strains can take up naked DNA directly from the culture medium. After uptake, the new DNA recombines with the host chromosome to produce genetic recombinants. Study of this phenomenon in pneumococcus by Avery, MacLeod, and McCarty lead to the first demonstration that genes were DNA. They demonstrated that rough-colony-forming, avirulent strains could be transformed into smooth-colonyforming, virulent strains by exposing them to purified DNA extracted from a smooth, virulent strain. The discovery in the early 1970's that the well-studied laboratory bacterium E. coli could be transformed if it was treated with calcium and heat made the entire field of recombinant DNA possible. In one of the first experiments, in 1973, an E. coli strain was made antibiotic resistant by transformation with purified R factor DNA. This experiment, now commonplace, was considered a remarkable feat when it was first performed. Bacterial Viruses Play a Role in Genetic Exchange Between Bacteria Bacterial viruses, termed bacteriophages or simply phages, constitute another agent of genetic exchange between bacteria. Just as F factors and R factors can carry bacterial genes, so too, bacterial viruses can transfer genes from one bacterial host to another. This process is called transduction. Some phages pick up a few host genes and incorporate them along with phage genes as part of their chromosome. The process by which such bacterial genes are transferred during subsequent growth of the phage is termed specialized transduction. Phage is an example of a phage that can bring about specialized transduction. Other types of phages, on the other hand, wrap up long segments of the bacterial chromosome into phage particles, without any phage genes. Transfer of such genes is termed generalized transduction. Phage P1 is an example of a generalized transducing phage. Study of Bacteriophages Has Played a Central Role in the Development of Molecular Biology. Many types of viruses, or phages, that infect bacterial cells are known and have been studied. Since they rely on the metabolism of the host cell for replication, phages carry very few genes and have tiny genomes by comparison to their bacterial hosts. Furthermore, the encapsulation of the phage chromosome into a nucleoprotein particle of defined size and structure made it possible to purify the phage chromosome from the bacterial chromosome. For these reasons, phages were the choice of the first molecular biologists for studying the physical nature of the gene and its action. In studies of the rII gene of phage T4, Benzer carried genetic analysis to a resolution never before attained when he demonstrated that genetic recombination could occur within the gene between any two base pairs of DNA. In studies of phage , Jacob and Monod opened the door to the study of gene regulation when they demonstrated that the function of the protein

30 product of one gene was to repress the expression of other genes. Bacterial Viruses May Kill the Host Cell or Coexist with It Among the bacteriophages there are a great variety of lifestyles. Virulent phages completely take over the host cell for the purpose of phage replication. Phage T4, for example, incapacitates the host cell by degrading its chromosome. Lytic phages, a subset of the virulent phages, use the cellular machinery to replicate hundreds of new phage particles, which are then released into the medium by lysis of the cell. Filamentous phages, on the other hand, such as M13, allow continued growth of the cell, which is caused to continuously extrude new phage particles from its surface. At the other extreme are the lysogenic phages. While these can cause cell lysis (hence the name), they need not. They can instead repress their replication and lytic functions and integrate their chromosome into the bacterial chromosome. In this repressed state the phage genes are replicated along with the bacterial chromosome for an indefinite period. A bacterial cell harboring such a repressed phage chromosome is termed a lysogen. This name derives from the fact that, under appropriate conditions, the repressed state can cease, the phage chromosome is excised, and the phage replicates and lyses the cell. The most thoroughly-studied lysogenic phage is phage . Analysis in of the genetic controls leading to switching between the lytic and lysogenic modes of growth has contributed significantly to our understanding of the regulation of gene expression. It has served as an important model for researchers attempting to understand the genetic switching that occurs during development in higher organisms. Inferring Wild Type Gene Function from Mutant Phenotype To Infer Wild Type Gene Function, It Is First Necessary to Determine How the Mutation Affects Gene Activity As stated at the outset, one aim of molecular genetics is the elucidation of gene function. To this end, modern molecular genetics takes advantage of the classical understanding of genetics of organisms, developed over many decades of research. The approach taken in classical genetics to understanding the function of wild type genes in normal organisms is to study the effects of mutant genes in variant organisms. Many properties of gene action can be inferred by this means. The role of the wild type gene during normal development or adulthood is suggested by the aspect of the wild type phenotype altered in a mutant, like eye color or phenylalanine metabolism. Interactions among gene products may be studied by analyzing changes in the mutant phenotype when multiple mutations are simultaneously introduced into a single organism. The temporal order in which genes act during development or in a genetic pathway may also be determined from the phenotypes resulting from combinations of mutations. Genetic tests have been devised to make these types of determinations, and some of these tests are described below and in the last section of this booklet. Before these tests can be applied, however, an essential first step is to determine just how the mutation under consideration affects gene activity. In fact, reliable application of these tests is critically dependent on this information, and without it false conclusions may be drawn. A mutation may affect gene activity in any one of a number of different ways. A mutation may lower gene activity or eliminate it altogether. A mutation may raise

31 gene activity. A mutation may result in altered regulation of gene activity, such that the gene is expressed at an inappropriate time or place. A mutation may even result in the expression of a new gene activity not normally found in wild type. Without knowing which of these effects the mutation under study has, it is impossible to make headway in applying genetic tests aimed at understanding the function of the wild type gene. Types of Mutations Are Defined by Structure and by Affects on Gene Activity There are a variety of ways in which the structure of the chromosome may be altered to affect gene expression. Alterations that affect single nucleotides are called point mutations. These may involve a change of the identity of a nucleotide at a particular position along the DNA chain, or the addition or deletion of a single nucleotide. Alterations that affect a block of nucleotides are called rearrangements. Loss of a block of nucleotides is called a deletion (generally in E. coli and viruses) or deficiency (generally in C. elegans and Drosophila); addition of a block may be a duplication of existing sequence, or insertion of a foreign sequence. Rearrangements may also consist of the inversion of a DNA segment, or the translocation of a segment to another place in the genome. Mutations may also be classified according to their affect on gene expression. A mutation that results in a change in the amino acid sequence of a protein is known as a missense mutation, whereas one which introduces a stop translation or nonsense codon is known as a nonsense mutation. A nonsense codon may result directly from the change of a nucleotide to give rise to one of the three stop codons (UAG, UAA, UGA), or indirectly as a result of a frameshift due to the insertion or deletion of a number of nucleotides that is not a multiple of three. Virtually all genes can only be read in one frame (the so-called open reading frame), the other two reading frames having periodic stop codons. Finally, a mutation that has its affect by altering the expression pattern of an otherwise wild type gene product is known as a regulatory mutation. Rare Spontaneous Mutations Are of All Types Mutations occur spontaneously at a frequency of 10-5 per gene per generation or less in all organisms. The molecular nature of such mutations spans the full range of types described above, from point mutations to large rearrangements. In some organisms (a notable example being Drosophila), the majority of spontaneous mutations are caused by transposable element insertion. Transpos-able elements are segments of DNA usually between 1 and 10 kb in length. They characteristically encode one or more proteins required for the insertion of the DNA segment into new genomic sites: they are "jumping genes" that induce their own jumps. Such jumps cause mutations by gene disruption when the transposable element inserts within a coding sequence, or by affects on gene expression when the transposable element inserts near a gene and disrupts its regulatory sequences. Chemical Mutagens Tend to Induce Point Mutations, Radiation Tends to Produce Rearrangements In most organisms, the frequency of spontaneous mutations is too low to allow their simple recovery. In order to isolate and study mutations more easily, it is necessary

32 to induce new mutations by chemical treatment or radiation. A commonly-used chemical mutagen is EMS (ethylmethane-sulfonate). This is an alkylating agent, which, by alkylating guanine residues, results primarily in G-C to A-T transitions. There are a variety of other chemical mutagens that act by alkylating the DNA (nitrosoguanidine, NMU), by intercalating between the bases causing frameshifts (acridine orange), or by cross-linking the strands (psoralen). For the induction of rearrangements as opposed to point mutations, radiation (rays or X-rays) is used. High energy photons induce free radicals which attack the phosphate backbone of DNA, causing breaks. Deletions, inversions, and translocations result when the cell repairs these breaks. It should be stressed, however, that neither chemical mutagens nor radiation produce a single type of mutation. On the contrary, most mutagenic agents produce a wide spectrum of effects. Chemical mutagens can cause rearrangements as well as point mutations, and likewise radiation can cause point mutations as well as rearrangements. Null Mutations Are Important in the Determination of the Biological Process in which a Gene Participates Mutations that completely eliminate gene function, known as null mutations (also known as amorphic mutations), play a central role in the interpretation of gene activity. One of the first questions that arises in the study of any gene is what is the biological process in which the gene acts? Mutations help to answer this question by indicating what aspect of the phenotype is altered when the gene is mutated. However, unless the mutation being analyzed is a null mutation, the mutant phenotype may give an incomplete or even misleading indication of gene function. An example where the phenotype of a mutation gave a misleading indication of gene function may be taken from Drosophila. The Antennapedia mutation in Drosophila causes the transformation of antennae into legs. It was at first thought that this meant that the wild type Antennapedia gene was required for development of the antenna. When null alleles of the gene were identified, however, they were found to eliminate thoracic segments and to have no effect on the antennae at all. It is now known that the gene is required for development not of the antenna but of the thorax. The first Antennapedia mutation resulted in inappropriate expression of the gene in the head, resulting in the transformation of head parts into thoracic parts. Mutations that alter gene function rather than eliminating it may have affects that are far removed from the point of primary gene action, or they may even affect, as in the case of the Antennapedia mutations, an aspect of the phenotype unrelated to normal gene function. Mutations that completely eliminate gene activity are less likely to give a misleading indication of gene function. They leave no residual gene activity that might obscure the primary action of the gene by still providing some gene function. Furthermore, null mutations cannot affect an aspect of the phenotype that is not normally related, however indirectly, to the function of the wild type gene. It is logically sound to conclude that wild type gene activity is necessary, directly or indirectly, for any aspect of the phenotype that is altered in a null mutant. Null mutations therefore play a central role in the analysis of gene action, and determination of the null phenotype, the phenotype of an organism lacking the gene function, is an essential first step in the mutational analysis of any gene.

33 In some organisms, such as C. elegans and Drosophila, it may be faster to carry out a genetic analysis of a gene than to clone it. In such organisms, many mutant alleles of a gene, both spontaneous and mutageninduced, are often available or may be isolated. To determine the null phenotype, it is necessary to determine which of the available mutations is most likely to be a null. A variety of genetic tests or criteria have been developed by which a null mutation may be identified. Since none of these tests by itself is definitive, several are generally used together. If they give a consistent result indicating a null, this conclusion may be accepted. However, conclusions from a consensus of genetic tests always remain somewhat tentative, and confirmation at the biochemical level, for example by gene knockout, is helpful. In the strongest test, the genetic characteristics of a known mutation are compared to those of a deficiency covering the gene. If in a series of such tests the allele being tested always behaves identically to the deficiency, then it is itself likely to be a null allele. A deficiency is necessarily a null mutation. However, it is a null mutation for several adjacent deleted genes as well as the gene of interest (a deficiency is genetically defined as a mutation that affects several adjacent genes). Therefore, the phenotype of the homozygous deficiency does not indicate the null phenotype of the gene of interest. However, the phenotype of an organism heterozygous for the deficiency and other alleles of the gene may be examined. For example, if a stands for the allele being tested, and Df for the deficiency, the phenotype of the strain a/a may be compared to that of the strain a/Df. If they are identical, a may be a null mutation. This is the most standard test for a null mutation. Likewise, the phenotypes of the strains b/a and b/Df may be compared, where b is any

In Some Organisms the Null Phenotype Is Best Determined by Gene Knockout How can a null mutation be identified so that the null phenotype can be deter-mined? The approach taken depends on the particular experimental characteristics of the organism being studied. In microbial systems such as yeast, where gene isolation is relatively quick and easy, and in higher organisms in which genetic analysis is difficult or impossible, the approach makes use of cloned genes. In organisms in which genetics may be quicker than gene isolation, purely genetic tests are often used. In yeast, the null phenotype of a gene is determined by constructing and studying a yeast strain in which the gene is deleted. First, the gene is cloned. The cloned gene is engineered in vitro to remove a large central portion. This deleted gene is then put back into the yeast genome. If it integrates by homologous recombination at the site of the wild type gene, it will replace the wild type gene with the deleted copy. In case the wild type gene is essential, the gene replacement can be carried out in a diploid strain, which retains one wild type gene copy. After gene replacement, the diploid is sporulated and the phenotype of the haploid spores carrying the deleted gene is determined. This approach, based on the availability of a cloned copy of the gene of interest and known as gene knockout, is used in a range of organisms, from Dictyostelium to mammals, to determine the biological function of a gene. Null Mutations Can Be Identified As Mutations That Behave Genetically Like a Deficiency of the Gene

34 other allele of the gene. Again, if they are the same, then a is behaving like Df in a heterozygote with b, and a may be a null. Null Mutations Have Several Characteristics That Distinguish Them from Non-Null Mutations Null mutations have a number of characteristics in addition to similarity to a deficiency that distinguish them from non-null mutations. These may be used to support the conclusion from the test above. 1) Provided they are viable and not selected against in any way, null mutations are the most frequent type of mutation at a locus; with a standard dose of mutagen, null mutations may be expected to arise with a certain frequency. If new mutations at the locus of interest, all of the same phenotype, arise at this frequency, these are probably null mutations. Mutations that arise at a much lower frequency than expected (rare mutations) are probably not nulls. 2) All null mutations should have the same phenotype; two alleles of a gene that have different phenotypes cannot both be null alleles. 3) Null mutations usually have the most severe phenotype in an allelic series. An allelic series refers to a set of alleles of a gene whose phenotypes may be placed in a series from least to most severe. 4) Null mutations are usually recessive. However, this is not true of mutations at a minority of loci that are haplo-insufficient (this is the definition of haploinsufficiency: null mutations are dominant). 5) Amber-suppressible alleles (alleles containing a UAG, amber, stop codon, suppressible by a mutated tRNA that inserts an amino acid at a UAG codon) are likely to be nulls, since the presence of a stop codon should result in a truncated gene product. However, a low level of gene product could result from translational readthrough, and a stop codon towards the 3' end of the open reading frame could result in the synthesis of an incomplete aminoterminal peptide with some gene activity. 6) Likewise, null mutations are usually not temperature sensitive, since temperature sensitive alleles are usually missense mutations that give rise to a temperature-sensitive protein product. However, some null alleles result in temperature sensitivity by uncovering temperature sensitivity at another locus. New Null Alleles May Be Isolated by a NonComplementation Screen Null mutations are so important to determining the biological role of a gene that it may be necessary to isolate one if none can be found among an existing collection of alleles. This may be done by carrying out a non-complementation screen for new alleles. In a non-complementation screen, a mutagenized organism is mated to an organism carrying a recessive mutation in the gene of interest. Among the heterozygous F1 progeny, the mutant phenotype is not expected to appear because of the presence of a wild type allele. However, rarely, a gamete mutated in the gene of interest may give rise to an F1 expressing the mutant phenotype. This is recovered and the new allele outcrossed. Provided it is viable, the F1 heterozygous for the old allele and a newlygenerated null allele should be the most frequent type of mutant F1. The viability of this genotype may be assessed ahead of time by constructing the heterozygote carrying the existing allele to be used in the noncomplementation screen over a deficiency. Hypomorphic Mutations Lower But Do Not Eliminate Gene Activity Null mutations are the most severe type of a general class of mutations known as loss-offunction mutations. In loss-of-function mu-

35 tations, gene activity is lowered or eliminated. In general, loss-of-function mutations are recessive; the exceptions arise at haplo-insufficient loci. Mutations in which gene activity is lowered but not eliminated are known as hypomorphs or hypomorphic mutations. In a hypomorph, there is residual gene activity which is regulated normally. Generally, a hypomorph has a less severe phenotype than a null mutation. This criterion provides a genetic test for a hypomorph: if a hypomorphic mutation is placed over a null allele, the resulting phenotype will be more severe than the phenotype of the homozygous hypomorphic mutation, because gene activity in the heterozygote is lowered even further. Thus, if a is a hypomorphic mutation, then in comparing the phenotypes of a series of strains, we will find that a/Df > a/a, or a/0 > a/a, where > stands for "phenotype is more severe than", and 0 stands for a null allele. Many mutations causing visible phenotypes known in such organisms as C. elegans, Drosophila, and mouse are found on further examination to be hypomorphic alleles of essential genes. Null mutations at these loci cause lethality, often at a developmental stage much earlier than the stage at which the visible abnormality arises. Such muta-tions alter, but do not eliminate, the activity of essential cellular functions, sometimes only in a particular tissue or at a particular developmental time, in such a way that a only a sublethal abnormality results. Gene Activity Is Raised by Hypermor-phic Mutations Hypermorphic mutations raise gene activity. Apart from being higher than normal, the gene activity in a hypermorph is otherwise normal. This is one type of gain-offunction mutation. A gain-of-function mutation results in the production of more gene activity, or some kind of novel gene activity, not found in wild type. Because gain-offunction mutations result in the expression of a gene product, they are usually dominant. A hypermorphic mutation may be distinguished because it has the opposite effect from a hypomorphic mutation when placed in combination with a null mutation. Whereas a null mutation increases the phenotypic severity of a hypomorph by lowering gene activity still further, it decreases the phenotypic severity of a hypermorph by the same means. Together, the hypermorph and the null could in fact cancel each other out and result in nearly wild type levels of gene activity. If so, the hypermorph/null heterozygote would have wild type phenotype; that is, the null mutation would suppress the hypermorphic mutation. Likewise, a hypermorphic mutation has the opposite effect from a hypomorphic mutation when placed in combination with a wild-type allele. Whereas a wild type allele alleviates the defect due to a loss-offunction mutation by supplying wild type gene activity, it exacerbates the defect due a hyper-morphic mutation. For example, the phenotype may become more severe if a wild type gene copy is added, perhaps on a duplication, to a strain carrying two copies of the hypermorphic mutation. Thus it may be possible to establish the series a/0 < a/a < a/a/+, in which case a is almost certainly a hypermorphic mutation. Antimorphic Mutations Produce a Poison Gene Product Antimorphic mutations (aka dominant negative mutations) are a second class of normally dominant, gain-of-function mutations. Antimorphic mutations give rise to a gene product that blocks the activity of the

36 wild type gene product. This can happen if the protein product of the gene acts as a multimer, such as a subunit of an oligomeric enzyme or a protomer of a polymer. In this case, a missense mutation may result in the production of a protein product that combines with wild type product to produce a non-functional oligomer. Antimorphic mutations are distinguished from hypermorphs by the fact that their phenotype, rather than being exacerbated, is alleviated by additional wild type alleles. Thus we expect the series a/a > a/+ > a/+/+. By adding additional wild type gene copies we can overcome the block to gene activity caused by the mutant a protein. Neomorphic Mutations Result in a Novel Gene Activity Neomorphs are a third class of normally dominant, gain-of-function mutations. A neomorphic mutation results in the production of a gene product with a novel activity not present in wild type. For example, the Antennapedia mutations already cited above, which cause inappropriate expression of the Antennapedia gene product in the head, are neomorphic mutations. The character of a neomorphic mutation may be suspected if a null mutation at the same locus has a strikingly different, unrelated phenotype. A Gain-of-function Mutant Phenotype May Be Eliminated by Introducing a Loss-offunction Mutation at the Same Locus A common characteristic of gain-of-function mutations of whatever kind is that they may be overcome by a second mutation in the same gene that knocks out the deleterious gene product they produce. This property may be used to confirm that a particular mutant allele results in the production of some protein product. Typically, a strain carrying a dominant mutation is treated with mutagen and mated to a strain carrying only the wild type allele. The hetero-zygous F1 progeny are expected to be mutant, because of the dominant effect of the mutation. However, if a typical recessive loss-of-function mutation (usually a null mutation) has arisen within the dominant allele, eliminating the deleterious protein, this will result in the appearance of wild type organisms in the F1 generation. If such progeny appear at a frequency expected for null mutations, and if further study confirms that they carry new, loss-of-function alleles of the locus, this supports the conclusion that the dominant allele results from a gain-of-function mutation. Determining the Time and Place of Gene Action The Time and Location of Gene Expression Can Be Determined by a Number of Biochemical Means In order to understand the function of a gene, it is essential to know when and where it is expressed. Gene expression patterns are of a great variety, from ubiquitous expression in all cells throughout life, which might indicate a basic metabolic function, to expression in a subset of tissues throughout the body, to expression in multiple tissues in a restricted portion of the body, to patterns extremely restricted in both space and time. A number of methods are available for determining the time and location of gene expression in various organisms. In this connection, it is important to make a distinction between gene expression and gene action. These methods indicate where a gene is expressed, but as discussed more fully in the next section, they do not reveal where a gene acts or where its action is required. The time and location of expression, action, and required action need not be the same,

37 and additional tests are necessary to answer the more restrictive questions. Gene expression may be assayed at the protein or RNA levels. In a Northern hybridization, RNA is detected by hybridization to a radioactive probe after fractionation on a gel and immobilization to a filter. In a Western blot, fractionated proteins immobilized on a filter are detected with specific antibodies. The time of gene action during development may be determined by these methods with material taken from isolated developmental stages. In larger animals, such as mice, the tissue or location within the body where the gene is expressed may be determined in a similar manner, taking material from dissected tissues. To determine the place of gene expression with higher resolution than can be achieved with dissected material, or in small organisms such as Drosophila and C. elegans where dissection is difficult or impossible, detection of RNA and protein is carried out in situ. Radioactive or biotinylated nucleic acid probes are hybridized to whole mounts or sections after RNA is immobilized in position. Immobilized proteins are detected by reaction with specific primary antibodies, followed by reaction of the primary antibody with a secondary antibody labeled with a fluorescent reporter group or with biotin. The secondary antibody reacts with some portion of the primary antibody molecule; for example, a goat antibody that reacts with epitopes on mouse immunoglobulins may be used as a secondary antibody if the primary antibody is from mouse. A limitation of the use of antibodies is the possibility of cross-reaction of the primary antibody with another cellular protein apart from the one against which the antibody was originally raised. This possibility cannot generally be ruled out without showing that cross-reactivity is eliminated by a mutation in the gene. Reporter Genes Provide a Sensitive and Versatile Assay of Gene Expression Another approach to determining the time and place of gene expression takes advantage of a reporter gene. A reporter gene synthesizes a protein product that is easily detected in tissue. Commonly used reporter genes include the -galactosidase gene of E. coli, which is detected after it is allowed to react with the chromogenic substrate Xgal, firefly or bacterial luciferase genes, whose products catalyze a light-producing reaction, and the gree fluorescent protein (GFP) of jellyfish. These gene products may be easier to detect than RNA, and do not require the preparation of gene-specific antibodies. To use a reporter gene to assay the expression of a known gene of interest, the reporter must be placed under the influence of the sequences that control the expression of that gene. Typically, the protein coding region of the reporter is placed downstream of the DNA sequence lying 5' to the gene, or is placed within the protein coding region of the gene. Caution must be used in interpreting the results of such an experiment, however, because important control sequences may lie almost anywhere, including within introns or in 3' flanking DNA sequences. An important variation of this approach allows the detection of promoters and enhancer sequences of unknown genes. So-called "promoter trap" or "enhancer trap" vectors carry respectively a reporter gene with no promoter sequence or with a promoter but no enhancer. Such a vector is allowed to integrate randomly into genomic DNA, and individual animals are inspected for interesting patterns of reporter gene expression. When such an animal is found, analysis usually reveals a gene at the site of integration that has the interesting expression pattern. This approach has been particularly fruitful in Drosophila and in C. elegans.

38 central nervous system. The essential site of gene action is termed its focus of action. Organisms in which all the cells do not have the same genotype are called mosaics. Analysis of mosaic animals is most readily carried out in C. elegans and Drosophila, where genetic markers are available for this purpose. However, they can also be generated in mice by inroduction of genetically-marked cells into early embryos. Such cells may come to populate only a subset of tissues. Mosaic animals generated in this way are also known as chimeras. In flies and worms, mosaic animals are generated by causing a subset of cells in an animal of heterozygous genotype to become homozygous. If the animal is heterozygous for a recessive mutation, clones homozygous for the mutant allele are generated. If such clones include the focus of essential gene action, the mosaic will exhibit a mutant phenotype. If only cells outside the focus become homozygous for the mutation, this will have no effect. By analysis of the phenotypes of a number of such mosaics, the site of the focus may be determined. In C. elegans, homozygous mutant clones are generated in a heterozygote by placing the wild type copy of the gene on an unstable extrachromosomal fragment. At mitosis, the fragment, and hence the wild type gene copy, are lost at a low but appreciable frequency, generating a clone of mutant cells. In Drosophila, homozygous mutant clones are generated by inducing a crossover in mitotic cells with X-rays. In a heterozygote, if the crossover occurs between homologous chromosomes after they have duplicated prior to cell division, mitotic product cells can result in which a portion of a chromosome (the portion distal to the crossover site) derives from only one of the two parental homologs. If both the focus and the mutant phenotype occur in the same cell, that is, if a gene's activity is required within the cell af-

Gene Knockout Frequently Reveals That a Gene's Activity Is Not Required Everywhere It Is Expressed Gene expression need not reflect gene activity or essential function. This possibility is being increasingly tested as the expression patterns are determined of genes known from genetic analysis in such organisms as Drosophila and C. elegans, and as genes with studied expression patterns are subjected to analysis by gene knock-out. Results so far indicate that in fact genes are often not needed everywhere they are expressed. For example, in mouse such seemingly fundamental genes as ras and TGF- have been knocked out with either no evident phenotype, or no evident phenotype during important periods of expression, such as during embryogenesis. These results are subject to two interpretations. The gene may be imperfectly regulated and expressed at irrelevant times. Alternatively, the gene product may carry out an essential function that is duplicated by the product of another gene. Further studies of a gene's function are required to distinguish these two possibilities. The Tissue Where Gene Activity Is Required May Be Determined by Mosaic Analysis As repeatedly pointed out in this booklet, genetic analysis is required to determine a gene's essential function. Null or geneknock-out mutations give an indication of a gene's wildtype activity. By studying organisms in which gene function is lost in only some cells, the site where gene activity is required may be determined. For example, for a gene in which null mutations cause paralysis, we will want to know whether wild type gene activity is required in the muscles, in the peripheral nervous system, or in the

39 fected by a mutation in that gene, the gene is said to be cell-autonomous. If the gene's activity is required in another cell, different from the one that shows the mutant phenotype, then the gene's activity is said to be cell-non-autonomous. A gene for a transcription factor would certainly be cellautonomous for it's immediate effect on gene expression. A gene for a hormone would be non-autonomously-required for effects brought about on other cells by the hormone. For some genes, the autonomy or non-autonomy of gene action may depend on how close the mutant effect being analyzed is to the primary action of the gene. For example, suppose we analyze the requirement for a gene encoding a transcription factor of a hormone gene. Such a transcription factor gene will be cellautonomous for transcription, but nonautonomous for the effects of hormone expression. Gene Product Synthesis and Gene Product Action Need Not Take Place in the Same Generation Because of the parental contribution of gene products to the egg and sperm cell, mutations need not have their effect in the generation that carries them. During gametogenesis, a considerable amount of gene expression to synthesize gene products that make up the egg and sperm takes place in both maternal and paternal gonads, and elsewhere in the body (for example yolk protein may be synthesized by gut cells [C. elegans] or fat bodies [Drosophila]). Indeed, the entire substance of the earliest embryo is synthesized by the parents, mostly the mother, and this inherited material may participate for a considerable amount of time in the growth and development of the embryo. (The genetic phenomenon of perdurance can arise when there is a persistence of gene product. Perdurance is defined as the persistence of a phenotype in the absence of the conditioning genotype.) When a gene product is synthesized and acts in different generations, the genotype of one organism affects the phenotype of another. For example, supposing a gene product synthesized by the mother is deposited in the egg and acts later during embryogenesis. Then the mother must carry a wild type allele to produce viable or normal embryos. Conversely, the embryo need not carry the wild type allele, that is, it may be homozygous mutant, yet it will have a wild type phenotype provided the mother synthesized the gene product. Genes which can be expressed in the mother in this way are called maternal effect genes, or simply maternal genes. Eventually the embryo's own genome is expressed and directs syn-thesis of gene products throughout most of life. For such genes, the organism's own genotype determines its phenotype. Genes of this type are termed zygotic genes. For some genes, expression may occur in either the mother or the embryo. These genes are both maternal and zygotic. The class of maternal effect genes which must be expressed by the mother and cannot be rescued by gene expression in the embryo is known as strict maternal effect genes. Finally, rare paternal effect genes are known which must be expressed in the father to synthesize components of sperm essential for embryogenesis. Parental Effects May Be Identified by Genetic Tests Genetic tests are used to determine in which generation a gene must act. For example, if a recessive gene is zygotic, then in matings between heterozygous parents, the one quarter of offspring that are homozygous for the mutant allele will be mutant in phenotype.

40 If, on the other hand, there is a parental effect at work, the offspring homozygous for the mutation will not be mutant in phenotype. For example, in the case of an essential maternal gene, homozygous mutant offspring of a heterozygous mother will be non-mutant, that is, they will be viable, because the wild type gene in the mother will supply the embryo with wild type product. In the case of a strict maternal gene, these homozygous mutant offspring will be sterile (they will give inviable embryos), even if mated to a male carrying a wild type gene, because their embryos will not receive any wild type gene product. Temperature-Sensitive Mutations Can Be Used to Determine the Time of Gene Action Temperature-sensitive mutations are a class of conditional mutations, that is, mutations whose expression depends on experimental factors. Temperature sensitive mutations may be either heat sensitive, that is, their mutant phenotype may appear only at temperatures above normal growth temperature, or cold sensitive, their mutant phenotype appearing only at temperatures below the normal growth temperature. Heatsensitive mutations are more common than cold-sensitive mutations, and hence, in normal usage, "temperature-sensitive" is used synonomously with "heat-sensitive". We will adhere to that convention here: for "temperature-sensitive" read "heatsensitive". Where "cold-sensitive" is meant, that term will be used. The basis for temperature sensitive mutations lies in the heat sensitive nature of all proteins. When proteins are heated they denature, that is, they lose their precise folded structure and assume a variable, random-coil form. This is because protein folding is an enthalpically driven process with an unfavorable decrease in entropy. Increased temperature increases the unfavorable entropic term in the free energy equation, and hence favors the random-coil form. For any protein, there is a temperature, called the transition temperature, at which the protein unfolds. Temperature sensitive muta-tions are usually missense mutations that result in a destabilized protein with a lower transition temperature. If the transition temperature falls within the normal growth range of the organism, then it is possible to grow the mutant organism at a low temperature, where the protein is folded and active, and at a high temperature, where the protein is unfolded and inactive. Cold sensitive mutations affect entropically-driven reactions, that is, reactions in which there is an increase in entropy. Such reactions are often assembly reactions, in which multi-protein complexes are built. Thus, cold-sensitive mutations often involve disrup-tion of macromolecular complexes, such as microtubules. Temperature sensitive mutations are useful in studying essential genes. The null phenotype of an essential gene is lethality, and such genes can be cumbersome to study because null mutations in them must always be maintained in heterozygous form. It may even be impossible to determine the full null phenotype of an essential maternal gene, since the wild type gene product provided by the heterozygous mother will mask an early embryonic function. These problems are avoided with temperature sensitive alleles. Stocks homozygous for the mutation may be maintained without difficulty at low temperature, where the lethal mutant phenotype is not expressed. To study the mutant phenotype, the rearing conditions are simply raised. For a maternal effect mutant, the rearing temperature of the mother is raised. Temperature sensitive mutations may also be used to determine the time when a gene acts. By determining the period during the life cycle when high tem-

41 perature causes mutant phenotype, it is possible to determine at least one period when the gene product is present. This period is called the temperature sensitive period (TSP). The tempera-ture sensitive period must fall somewhere in the interval between the time the gene product is synthesized and the last time its action is required. Where precisely in this interval it falls may differ for different genes. For example, many gene products become assembled into multiprotein complexes that may keep a temperature sensitive polypeptide from unfolding. For such genes, the temperature sensitive period is likely to be shortly after synthesis and before assembly, and may be ended well before the gene product actually functions. Alternatively, a monomeric enzyme could be temperature sensitive at all times; the temperature sensitive period for such an enzyme would cover the entire period over which the enzyme acts. The temperature sensitive period is determined by a temperature shift experiment. In a temperature shift experiment, an organism is reared at high temperature and shifted to low temperature at a later time (a down-shift), or is reared at low temperature and shifted to high temperature at a later time (an up-shift). In both the down-shift and up-shift experiments, the effect of changing temperature is observed at some later time after the shift. The beginning of the temperature sensitive period is the earliest time that organisms can be shifted from high temperature to low temperature and still show the mutant phenotype. The end of the temperature sensitive period is the latest time that organisms can be shifted from low temperature to high temperature and still show the mutant phenotype. Analyzing Complex Processes by Genetics Genetic Analysis Allows the Probing of Complex Biological Processes Involving Multiple Genes One of the important facets of modern genetics is its use to explore, not just the function of individual gene products, but of whole processes involving many components. Complex processes, like learning and memory, or regulation of the cell cycle, cannot often be reconstituted in vitro and so may be more accessible to genetic analysis than to direct biochemical investigation. In this context the primary goals of genetic analysis are the identification of all (or most) of the genes involved and then the understanding of how they together regulate the process under study. Some Genes Involved in a Biological Process May Be Identified As Genetic Modifiers Many of the genes involved in a particular process will be identified by the straightforward methods described elsewhere in this booklet. Mutations affecting the process will be isolated in genetic screens and sorted into complementation groups until no more can be found. Once many alleles have been obtained for each complementation group, and few additional complement-ation groups are being discovered, the mutagenesis is said to be close to saturation. At saturation, however, there may remain many genes essential to the process that have not yet been identified. One problem with the straightforward genetic screen is pleiotropy. Pleiotropic mutations have multiple effects because they affect genes involved in more than one process. For example, in Drosophila the gene Notch is involved in the development of both the nervous system and of the muscles. Another example is the Antennapedia gene, involved in the development of the adult legs. Since

42 Antp is also involved in the development of the embryo, homozygous null mutants die before they can hatch from the eggs, and Antp mutations could not easily have been found in a screen for mutations affecting legs. Fortunately, involvement of such problem genes is often discovered through genetic modifier effects. A modifier mutation changes the phenotype of some other mutation. The combination may produce a phenotype closer to wild-type, in which case the modifier is a suppressor mutation, or further from wild-type in the case of an enhancer mutation. Such special alleles may reveal the role of a gene in a process, through their modifier effects, when pleiotropy of null alleles would conceal that role. Consider the case of a tRNA molecule, which has a cloverleaf structure dependent on complementary base-pairing to generate its double helical regions. Mutation of a base in one of these regions (eg. substituting C for G) may change the structure and disrupt the tRNA function. A second, complementary mutation of the base normally pairing with the first mutant restores normal structure and function; the doubly-mutant tRNA has a C-G base pair instead of G-C. The two mutations are intragenic suppressors of one another. Extragenic (or second-site) suppression can occur in a similar way. The mutant effect of an amino-acid change can be suppressed by a second change in another protein that contacts the first. One example can be found in the progression of the immune response. Some pathogens frequently change the antigens present on their surface. Antibodies selected to bind the original surface are no longer effective. However, the antibodies also mutate, and mutants may appear that now restore binding to the altered antigen. The mutant antibody genes are suppressors of the mutant surface antigens, the mutant phenotype being failure of antigen recognition. Clearly, if extragenic suppressors can be found for a mutation in one gene, they are likely to affect genes whose products interact with the first gene's product. Thus, a suppressor screen may identify many genes involved in a particular process, starting from a mutation in only one. A crucial advantage is that the suppressor mutations often act dominantly (since the presence of a wild-type product in a heterozygote does not interfere). This means mutations may be identified without encountering any complicating pleiotropies, so long as pleiotropic effects are recessive. The disadvantage of this particular type of screen is that mutating just the amino acid(s) necessary to suppress another mutant is unlikely by random mutagenesis. This kind of suppressor screen is most useful in bacteria and yeast, where very large populations can be screened for these rare mutations. However, even null mutations, which are usually a significant fraction of the mutations obtained, sometimes suppress or enhance other mutations. One example is rudimentary (r) and suppressor of rudimentary (su(r)) in Drosophila. The r gene encodes enzyme activities required for adequate pyrimidine synthesis. In r mutants growth and development are stunted. The phenotype is suppressed in the double mutant r/r su(r)/su(r). The su(r) gene is thought to encode an enzyme involved in pyrimidine catabolism (breakdown). When both gene products are absent, sufficient pyrimidines accumulate to permit more normal growth and development. Clearly the su(r) kind of suppressor mutation identifies a gene with function related to r, but not one that binds the r enzyme itself. How can the su(r) type of suppressor be distinguished experimentally from the direct interaction type? Firstly,

43 su(r) is recessive, whereas interacting proteins are expected to act dominantly. Secondly, any su(r) null mutation can suppress any r mutation. But when direct molecular interactions are the basis of suppression, only a very restricted set of mutations affecting only one small part of the protein are likely to be suppressed. Such suppressors are said to be allele specific. Allele specificity in genetic interactions is frequently interpreted as evidence of direct molecular interaction. Thirdly, null mutations of su(r) suppress r. This means, for example, that a deficiency for the su(r) gene would also suppress r. One special case of allele-specificity is provided by nonsense-suppressors. Mutations in the anticodon of tRNA genes will change the amino-acid inserted into protein. When an anticodon mutates to a sequence complementary to one of the stop codons (eg. UUA), an amino acid may be inserted at the TAA stop codon position during translation. Such tRNA mutants suppress mutations in other genes that were caused by mutation to a stop codon. They are allelespecific, since only alleles associated with one type of stop codon are suppressible. Nonsense-suppressor tRNA's have been obtained genetically in only some organisms (E. coli, yeast, C. elegans), but can be introduced into others like Drosophila or mammalian cell lines by means of gene cloning and transformation. Very powerful screens can be designed around mutations that reduce gene function to threshold levels. The phenotype of such mutations is sensitive to changes in the rate of other steps in the same pathway. Null mutations affecting other steps may be obtained as dominant suppressors or enhancers, if halving the amount of gene product alters the initial mutant phenotype. Recent examples include screens for modifiers of the let-23 gene in C. elegans or of the sevenless gene in D. melanogaster, both encoding receptor tyrosine kinases. These screens identified many components of tyrosine kinase signal transduction pathways, starting with receptor mutants encoding only marginal activity. Enhancer mutations especially can identify parallel or redundant pathways. Sometimes homozygotes for mutations in two different genes show a new phenotype in addition to that of each locus individually. Phenotypes that result only when multiple genes are mutant simultaneously are called synthetic phenotypes. Synthetic phenotypes indicate redundancy. Two genes are redundant if both fulfill the same function. For instance, two distinct enzymes may catalyze the same reaction. Null mutations in either gene alone have little effect, but mutating both genes blocks the reaction. An example is provided by three cyclin genes in yeast, CLN1, CLN2 and CLN3, which regulate the transition from the G1 to S phases of the cell cycle. Any one of the cyclins is sufficient for a normal cell cycle, and yeast mutant for one or even two CLN genes grow normally. However, the cln1 cln2 cln3 triple mutant is unable to grow and dies. Synthetic phenotypes are often discovered by chance. Alternatively, screens may be performed using strains mutant for gene A to identify mutations in other genes with overlapping function. Such mutations appear as enhancers of mutant A because of the synthetic phenotype. At present it is not certain what percentage of genes are redundant in each organism. Some studies in yeast and C. elegans suggest it may be substantial. There is also evidence from gene knockouts in mice. For example, disrupting the mouse engrailed (en) gene homologue has no effect. Yet since the mouse gene has been conserved since the common ancestor of mammals and flies, where the original en gene was described, it is likely to be important. It is assumed there is at least one other functionally redundant gene in mice which if

44 if mutated would give a synthetic phenotype with en mutants. A similar lack of phenotype has been found in knock-out mutations of several other seemingly important genes of the mouse. In the past, geneticists have tended to think that genes with the most dramatic mutant phenotypes are most important, but where parallel pathways exist, this may not be correct. To summarize, suppressor and enhancer mutations are a useful way to define more genes involved in any particular process. These terms (like most gene names) reflect how the mutations were discovered as much as they do the real function of the wild-type genes. Thus the su(r) gene obviously does not exist so that its mutants will suppress r; it has a function of its own. Had the gene been discovered first for this function, it would probably have a different name. Interactions between genes can indicate the relationship of gene products when comparison of the individual mutant phenotypes might not, eg. because of pleiotropy or redundancy. Genetic interactions may be allelespecific and dominant when the gene products interact directly. Other examples are recessive, eg. enhancers giving synthetic phenotypes, or compensating mutations affecting another branch to the pathway like su(r). The most efficient screens detect null mutations (which are not uncommon) as dominant modifiers (so their recessive phenotypes do not confuse matters). Information About the Order of Gene Action in a Pathway Can Be Obtained by Epistasis Analysis Once mutations in most or all of the important genes have been isolated, the loftier goal of demonstrating the specific role of each gene product and so explaining how something as complex as regulation of cell division or learning and memory occurs may be pursued. It is the potential of genetics to probe such complex questions in organisms like E. coli, yeast, C. elegans and D. melanogaster that makes these organisms important in research today. The specific approaches vary in different cases. It is possible to give some examples to illustrate general techniques. The main thing is to be very clear on the nature of individual mutations, as described in the previous section of this booklet, and to build logically on this information when multiple mutations are combined. The phenomenon of epistasis can allow the ordering of gene action into a pathway. Epistasis means one mutant phenotype cannot be seen in the presence of another mutation. For example, in C. elegans a mutation causing cell death is epistatic to a mutation changing the shape of the cell nucleolus, because if the cell is no longer there it becomes impossible to assess the nucleolar phenotype. Clearly this example does not indicate any functional relationship between the products of these genes, which are not suspected of acting in the same process. Where the mutations are thought to affect related processes, the results are more interesting. The nature of this analysis is perhaps best introduced by an illustration. Ultrabithorax (Ubx) and Sex combs reduced (Scr) are two Drosophila genes. In Ubx/Ubx- mutants, the second leg develops as an exact copy of the normal first leg instead of being somewhat different. In Scr-/Scrmutants, the first leg develops as an exact copy of the normal second leg. The two mutations, both loss-of-function alleles, thus have opposite and equivalent effects. Combining them establishes an order of gene function. In Scr-/Scr- Ubx-/Ubx- double mutants, the first leg develops like a normal second leg, and the second leg is just like a normal second leg (i.e. Scr- is epistatic to

45 Ubx-, the double mutant has the phenotype of the Scr- single mutant, and the phenotype seen in Ubx-/Ubx- alone can no longer be detected). The double mutant shows that changing second leg into first leg (in Ubx/Ubx-) requires a functional Scr+ gene. But changing first leg into second (in Scr-/Scr-) does not require a functional Ubx+ gene. The Ubx- phenotype requires Scr+ gene function, but not vice versa. This suggests the Ubx gene regulates Scr gene function. The Ubx gene acts like a repressor or antagonist of the Scr gene in the second leg, preventing first leg development (which requires a functional Scr gene) from occurring. In this example, just combining them establishes the order of gene function (Ubx regulates Scr), and suggests sensible molecular models of their function (Ubx might encode a repressor protein for expression of the Scr gene, or an inhibitor of the Scr protein). This model could be developed without cloning the genes, obtaining any molecular information, or reconstituting their activity in vitro. The regulation of quite complex processes can be unravelled as a series of steps like this. The key is that the epistatic gene is always required for the function of the gene it is epistatic to. When genes act in a regulatory pathway, mutations in the later gene normally are epistatic to the earlier. When establishing a genetic pathway from epistasis it is very important to know whether the mutations are null mutations, unlike in the study of individual genes, where a hypomorphic allele is almost as good. A cell that contains fewer proteins than normal (because it has null mutations) may behave more simply than a wild-type cell. In contrast, a cell that contains several modified proteins (because of other kinds of mutations) may be more complicated than a wild-type cell. Consider a mutation in a cell surface receptor A. Mutations in another gene B are epistatic (i.e. the double mutant AB resembles B). If the receptor mutant A is a null, gene B must act downstream of the receptor, perhaps on the signal transduction pathway, because seeing the effect of mutant A requires a functional B gene product. But if the receptor mutation has some residual function, one cannot tell whether gene B acts downstream or upstream. For example, ligand concentration might affect activity of the mutant A receptor. Although it might seem obvious, realizing the importance of null mutations has been of major significance to modern genetics. When more and more null mutations are combined, fewer and fewer proteins are present and the process under study should become simpler and simpler. Restoring each wild type gene to a multiply-mutant genotype is like adding one protein at a time to a reaction, readily allowing its properties to be studied. It is significant that the Ubx- and Scr- mutations had different phenotypes. Different phenotypes can result if there is a negative regulatory interaction, as here. If the mutations have the same phenotype, epistasis is difficult to assess. This is the case, for example, with metabolic pathways, which have to be analyzed slightly differently. Dominant mutations are also useful for ordering genes. This is especially true when the null mutations have similar phenotypes, unlike the Scr, Ubx example. Epistasis can still be determined if there is a dominant gain of function mutation with a different phenotype. For example, null mutations for a cell surface receptor, its ligand, or second messenger system may well have similar phenotypes, making epistasis hard to detect. If there is a constitutively activated receptor mutant with the opposite phenotype, these components can be ordered. The activated receptor will be epistatic to null mutations of

46 the ligand, but null mutations in the signal transduction steps will be epistatic to the receptor (because the constitutive receptor still requires the signal transduction machinery for its effect, but does not require the ligand). Sometimes redundancy (see earlier) can help order genetic pathways. As an exercise consider a process like this: A B E F C D Where A - F represent gene products, and null mutations in A, B, C, and D have less effect on the process than null mutations in E and F, because the earlier branches of the pathway are redundant and E can be partly activated by each branch. Although there are no dominant mutants or negative regulation, you will find that much of the pathway can easily be deduced from the phenotypes of the various double mutants.

47

You might also like