Lec 6 BI 317 Lecture 6 (McStay)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

BI317 Human Molecular Genetics

Prof Brian McStay


Centre for Chromosome Biology
& Discipline of Biochemistry

Lecture 6

DNA sequencing (past, present and future)

and

How the human genome sequence was obtained

See Strachan and Read Chapter 8

Centre for
Chromosome
Biology
Sanger or Dideoxy sequencing with 1000’s of these automated fluorescent
sequencers at centres such as the Sanger Centre in Cambridge UK were
responsible for the human genome sequence.
Dideoxy nucleotides can be incorporated into DNA by polymerase but
act as chain terminators since the 3’OH group has been replaced by H and
cannot form a phosphodiester bond with an incoming dNTP

Deoxy NTP

3’ OH

Dideoxy NTP 5’

Chain terminator
Fluorescent derivatives of each Dideoxy NTP have been developed
The Sanger method underpinned molecular genetics for three decades. The
method is disadvantaged by the need for gel electrophoresis, making it difficult
to scale up. The most advanced machines could only sequence 96 samples at
a time (30-60kb every 3-4 hrs).

New technologies developed in the mid-2000s led to sequencing of DNA


during strand synthesis. No gel was required and this led to the development
of massively parallel sequencing also known as next generation sequencing
(NGS)
Illumina workflow (1)

Library preparation

Samples consisting of longer fragments are first sheared into a random library of 100-300 base-pair
long fragments. After fragmentation the ends of the obtained DNA-fragments are repaired, and an A-
overhang is added at the 3'-end of each strand. Afterwards, adaptors which are necessary for
amplification and sequencing are ligated to both ends of the DNA-fragments. These fragments are
then size selected and purified.
Illumina workflow (2)

Cluster Generation

The Cluster Generation is performed on the Illumina cBot. Single DNA-fragments are attached to the flow cell
by hybridizing to oligos on its surface that are complementary to the ligated adaptors. The DNA-molecules are
then amplified by a so called bridge amplification which results in a hundred of millions of unique clusters.
Finally, the reverse strands are cleaved and washed away and the sequencing primer is hybridized to the
DNA-templates.
Illumina workflow (3)

Sequencing
During sequencing the huge amount of generated clusters are sequenced simultaneously. The DNA-
templates are copied base by base using the four nucleotides (ACGT) which are fluorescently-labeled and
reversibly terminated. After each synthesis step, the clusters are excited by a laser which causes
fluorescence of the last incorporated base. After that, the fluorescence label and the blocking group are
removed allowing the addition of the next base. The fluorescence signal after each incorporation step is
captured by a built-in camera, producing images of the flow cell.
2000X
human genome
MiSeq

NextSeq 500

HiSeq 2500
Sequencing without an amplification step could overcome bias in the
amplification step of current NGS platforms (eg Illumina etc)

This prompted the advent of single molecule sequencing, the most


notable platform being that offered by Pacific Biosciences
PacBio currently gives fewer and less
accurate reads than Illumina. The real value
is the length of the reads obtained. This
offers significant advantages when
sequencing difficult regions of genomes.
Sequencing of larger inserts (15-20 kb) to support de novo
assembly applications with HiFi data
By combining longer but possibly less accurate PacBio reads with shorter
Illumina reads we can now sequence through repeated DNA

Target DNA

PacBio reads

Illumina reads reads


Array-based DNA capture can enable targeted resequencing

Eg Exome sequencing captures all exons in cancer samples


Human genome project
Framework maps are needed for the first-time sequencing of
complex genomes

Genetic maps rely on the principal that if two mutant phenotypes show a tendancy
to be inherited together they could be expected to be closely linked on the same
chromosome.

Genetic maps can be constructed in model organisms

For ethical and practical reasons genetic mapping of mutations


could never be contemplated in humans
The first human genetic maps were of low resolution and
were constructed using polymorphic DNA markers

RFLP Restriction fragment


length polymorphisms

assayed by Southern blotting


(slow)
A second generation human genetic map was based on Microsatellite DNA

Microsatellite DNA also known as short tandem repeats (STRs). (remember 2nd yr lectures )

Microsatellite instability arises because of mistakes during replication

Assayed by PCR
(rapid)
Marker on
average every
3.5Mb

Marker ~ every Mb
Physical Mapping
Somatic cell hybrids

Radiation hybrids contain


fragments of human
chromosomes generated by
X-rays integrated into rodent
chromosomes
Physical Mapping
Yeast Artificial Chromosome (YAC) vectors enable clone of megabase fragments of DNA

• URA3 and TRP1 provide positive selection for yeast containing YAC

• SUP4 disruption by insert turns positive colonies from white to red

• ARS is a replication origin,

• Telomeres (TEL) and a centromeres (CEN4) confer stability to the artificial chromosome
A Clone Contig The chromosomal sequence from position A to B is represented by
overlapping DNA inserts in a series of genomic clones (YACs or BACs). Clones with
overlapping inserts are generated by the random fragmentation of DNA(usually by partial
digestion with restriction enzymes) when a genomic DNA library is constructed
Physical Mapping
Bacterial Artificial Chromosome vectors

(BACs)

A bacterial artificial chromosome (BAC) is


a vector, based on the F-plasmid.

BAC vectors usually are in the range of 10-


13kb in size

The bacterial artificial chromosome's usual


insert size is 100-300 kb.

The enormous capacity of BAC vectors


makes them especially suited to genome
projects (i.e. less clones required to cover
the genome).
Sequences tagged sites

An important physical mapping aim was to build a map based on sequence-tagged


site (STS) markers.

An STS marker is any known UNIQUE DNA sequence that can easily be assayed by
PCR.

STS markers include polymorphic markers such as microsatellites and many more
non-polymorphic sequences (many obtained from end sequence of BAC clones).
Aligning BAC clones by hybridisation with STS probes
Aligning BAC clones by hybridisation with STS probes
A Clone Contig An example of human clone contig assembly. YAC clones from a
portion of human chromosome 2. Positive typing for an STS marker is indicated by a
closed circle and brackets indicated the absence of an expected STS (YAC instability)
Expressed sequence tags

Other markers were obtained from sequenced cDNA clones and these were known
as expressed sequence tags (ESTs)

Library of
cDNA inserts
Two approaches to genome sequencing

Hierarchical approach is the best approach for the first time


sequencing but for resequencing whole genome shotgun is the best
approach, especially in the era of NGS
Human Genome Project Celera Genomics

International Consortium Commercial enterprise


Publically funded
Whole genome shotgun approach
Hierarchical approach

The first draft genomes from both were published simultaneously

15th Feb 2001 16th Feb 2001

Currently we are on draft 19 (Hg19)


The human genome is complete, or is it?

α Satellite DNA is present at


all human centromeres

Large blocks of β satellite and


sat1, sat 2 and sat 3 are
indictated in red

Blocks of repetitive sequence present a considerable problem in contig assembly


Powerful genome databases and genome browsers
help to store and analyze the enormous amounts of
sequence data available
Genome Browsers such as
Ensemble allow users to explore
selected sub-chromosomal regions
using a graphical interface.
Genome Browsers such as
Ensemble allow users to explore
selected sub-chromosomal regions
using a graphical interface.

Here the subject of the query was


the the human CFTR (cystic
fibrosis transmembrane
regulator)gene on chromosome
7q31.
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between
sequences. The program compares nucleotide or protein sequences to sequence databases
and calculates the statistical significance of matches. BLAST can be used to infer functional
and evolutionary relationships between sequences as well as help identify members of gene
families.
DNA sequence

protein sequence
Nucleotide BLAST
Results
TBLASTN
Protein Vs translated
nucleotide database
of Danio rerio
1. Sequencing technologies

NGS (Illumina)
Sanger sequencing

Single molecule long read


PacBio, Nanopore

2. Human Genome project

3. A very brief intro to analysing genomic data

You might also like