Biochemistry and Bioinformatics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Transactions of the Nigerian Society of Biochemistry and Molecular Biology

Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology

Biochemistry and Bioinformatics

Clement O. Bewaji
Department of Biochemistry, University of Ilorin, P. M. B. 1515, Ilorin, Kwara State, Nigeria

Introduction facing researchers in biomedical and pharmaceutical


Bioinformatics is the new academic disciplines is how to extract biologically useful
discipline which is an offshoot of Biochemistry. The information from millions of sequences. This is what
word “bioinformatics” suggests a bridge between the bioinformatics is supposed to address, using a
world of biology and that of information technology. multidisciplinary approach which combines computer
It is best defined as the branch of science which science, information theory and molecular biology.
applies computer technology and information theory Quite recently, attention has shifted to
to the study of biological sequence data comparative genomics and the study of single
Biochemistry was the first to serve as a nucleotide polymorphisms (SNPs). This has been
bridge between the physical and biological sciences. made possible by advances in genome sequencing
At the early stage, it was aptly named physiological technology (Alkan et al., 2009; Cai et al., 2004; Chen
chemistry which was later changed to biological et al., 2009; Conrad et al., 2010; Drmanal et al.,
chemistry (biology + some chemistry). Other 2010). Bioinformatics tools have also been deployed
disciplines quickly followed: biophysics, to the study of genomic variations among human
biomathematics and biocomputing -- the latter being races. This has been tagged the Human Haplotype
the favourite of crystallographers and those studying Project (International HapMap Consortium 2003,
three dimensional structures of macromolecules such 2005; Kidd et al., 2008; Kim et al.; 2009; Park et al.,
as proteins and nucleic acids. With the advent of 2010; Redon et al., 2006).
large-scale gene sequencing, a new scientific
discipline, known as computational molecular Information Storage and Retrieval
biology or bioinformatics, was born.
Protein and DNA sequences are being
The origin of bioinformatics can be traced to
determined in several laboratories around the world
the development by Sanger of a technique for the
on a continuous basis. This calls for the
sequencing of nucleic acids (Adamo, Filoteo et al.
establishments of banks for storing these data. Three
1995) (Sanger and Coulson, 1975). The original
big "banks" have emerged and are collaborating to
technique, devised in 1975, was subsequently
produce computerised DNA sequence databases.
improved upon and automated (Maxam and Gilbert,
These include:
1977; Sanger et al., 1977a). For this invention, and
for working out the first complete nucleotide
(1) the European Molecular Biology Laboratory
sequence for an organism (Sanger et al., 1977b),
(EMBL) in Cambridge, U.K.;
Sanger won a second Nobel prize in Chemistry which
(2) Genbank, based at the National Centre for
was awarded in 1980 (see Sanger, 1981 for the text of
Biotechnology Information (NCBI), a division
his Nobel lecture). He got his first Nobel prize in
of the National Library of Medicine, located on
1962 also for inventing the technique for the
the National Institutes of Health (NIH) campus
sequencing of proteins (Sanger and Tuppy, 1961;
in the USA; and
Sanger and Thompson, 1963). Sanger’s pioneering
(3) the DNA Database of Japan (DDBJ).
efforts paved the way for the sequencing of other
genomes, starting with the bacterial plasmid pBR322
These "banks" exchange information daily
(Sutcliffe, 1979).
and have agreed on common standards for DNA
Bioinformatics is still primarily concerned
sequence database entries. Each bank is responsible
with access to and analysis of the databases of
for collecting data from a different geographical area;
published gene and protein sequences. Increasingly,
e.g. EMBL covers Europe while Genbank covers
this is being augmented by similar methods applied
North America. The data contained in these banks
to databases of macromolecular structures and
can be accessed using the following web addresses:
genetic maps. At present, over one billion DNA and
protein sequences have been determined and
deposited in computerised databases. These
sequences contain a wealth of information hidden
within them, including protein structure, disease
mechanisms and drug target sites. The challenges

5
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology

EMBL – http://www.ebi.ac.uk/ebi the database, the most popular being human (Homo
sapiens), baker’s yeast (Saccharomyces cerevisiae)
Genbank – http://www.ncbi.nlm.nih.gov and mouse (Mus musculus).
Information from various genomic projects
DDBJ – http://www.ddbj.nig.ac.jp are also deposited in different databases. Organisms
whose genes have been completely sequenced
The three banks are holding a tremendous include S. cerevisiae, Plasmodium falciparum,
amount of information. EMBL alone has over a Haemophilus influenzae, Mycoplasma genitalium and
billion nucleotide bases, with entries from the Human Escherichia coli (Table 1).
Genome Project (HGP) constituting 54% of the total.
More than 15,500 different species are represented in

Table 1: Some genomes that have been completely sequenced.

Organism Haploid Genome No. of Remarks


Size Chromosomes

Mycoplasma genitalium 580,000 1 Human parasite


Rickettsia prowazeki 1,112,000 1 Bacterium, Causative organism
of typhus
Borrelia burgdorferi 1,444,000 1 Lyme disease spirochaete
Methanococcus jannaschii 1,665,000 1 Thermophilic methanogenic
archaeon
Haemophilus influenzae 1,830,000 1 Bacterium, Human pathogen
Archaeoglobus fulgidus 2,178,000 1 Hyperthermophilic, sulphate-
reducing archaeon
Synechocystis sp. 3,573,000 1 Cyanobacterium
Mycobacterium tuberculosis 4,412,000 1 Causative organism of
tuberculosis
Escherichia coli 4,639,000 1 Bacterium, Human symbiotic
organism
Schizosaccharomyces pombe 13,800,000 3 Fission yeast
Saccharomyces cerevisiae 11,700,000 16 Baker’s yeast
Plasmodium falciparum 30,000,000 14 Protozoa, Causative organism of
malaria
Caenorhabditis elegans 97,000,000 6 Nematode worm
Drosophila melanogaster 137,000,000 4 Fruit fly
Arabidopsis thaliana 117,000,000 5 Flowering plant
Oryza sativa 430,000,000 12 Rice
Mus musculus 2,500,000,000 20 Mouse
Rattus norvegicus 2,600,000,000 21 Rat
Homo sapiens 3,200,000,000 23 Human

6
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology

Multiple Sequence Alignment all these families, added more members from the
databases, and then defined conserved patterns of
Bioinformatics is essentially a theoretical amino acids called blocks. Blocks are present in all
discipline which attempts to make predictions about members of the family and are approximately 4-60
biological functions from sequence data. It is also a amino acids long with no gaps or substitutions. The
powerful tool in experimental design. The eventual BLOSUM (Blocks Substitution Matrix) amino acid
goal is to deduce protein structure and function comparison tables were derived from these aligned
strictly from amino acid sequences. However all we blocks. Common patterns of amino acids arising from
can do, at present, is to see whether the new multiple sequence alignment have also been called
sequences being discovered are similar to sequences MOTIFS. These motifs may have more than one
whose structure and/or function are already known. amino acid at each position and may include gaps.
This is achieved by the use of sequence alignment Once motifs have been defined, the sequence
tools designed with various algorithms. databases may be searched for additional sequence
The sequence similarity of two polypeptides entries with the same motif.
or DNA molecules can be estimated quantitatively by Simultaneous alignment of three or more
determining their number of aligned residues that are sequences poses a difficult algorithmic problem but
identical. For example, human and dog cytochrome c programs to perform such alignments are now
differ in 11 out of 104 residues in the molecule. They available (for review, see Chan et al. 1992) and the
are therefore 89% identical, i.e. [(104 – 11)/104] x methods used have been compared (McLure et al.
100. In the same way, it can be deduced that human 1994). The method often used, especially for ten or
and baker’s yeast cytochrome c are [(104 – 45)/104] more sequences, is to first determine sequence
x 100 = 57% identical. When determining the similarity between all pairs of sequences in the set.
percentage identity of two protein or DNA Based on these similarities, various methods are then
molecules, the length of the shorter protein or DNA used to cluster the sequences into the most related
is, by convention, used as the denominator. groups or into a phylogenetic tree.
In the group approach, a consensus is
Multiple sequence alignment is the process produced for each group and then used to make
of aligning two or more sequences with each other so further alignments between groups. Two examples of
as to bring as many similar sequence characters programs using the grouping approach are the
(nucleotides or amino acids) into register as possible. program PIMA (Smith and Smith 1990) which
The resulting alignments can be used for two utilizes several novel alignment techniques and a
purposes: first, to find regions of similar sequence in program described by Taylor (Taylor 1990), and
all of the sequences that define a conserved MAXHOM. (Sander & Schneider 1991).
consensus pattern or domain; and second, if the The tree method uses the distance method of
alignment is particularly strong, to use the aligned phylogenetic analysis to arrange the sequences. The
positions to try and derive the possible evolutionary two closest sequences are then aligned, and the
relationships among the sequences. When dealing resulting consensus alignment is then aligned with
with a sequence of unknown function, the presence of the next best sequence or cluster of sequences, and so
similar domains in several similar sequences implies on, until an alignment is obtained which includes all
a similar biochemical function or structural fold that the sequences. The programs GCG PILEUP
may become the basis of further experimental developed by the Genetics Computer Group (GCG),
investigation. A group of similar sequences may CLUSTAL W (Higgins and Sharp 1988), the ALIGN
define a protein family which may share a common set of programs (Feng and Doolittle1987) and the
biochemical function or evolutionary origin. MS/DOS program by Corpet (Corpet 1988) utilize
Similar proteins have been organized by this method. A disadvantage to all these methods is
several laboratories into protein families. The that the algorithm used is greedy since the alignment
sequence families originally identified by Margaret is driven by the most alike sequences. An example of
Dayhoff and colleagues became the basis of the PAM using CLUSTAL to align a similar portion of four
matrices used for sequence comparisons (Dayhoff et hypothetical protein sequences is shown in Fig. 1.
al., 1978). PAM stands for ‘percentage of acceptable Note that the CLUSTAL alignment produces a
mutations’. Subsequently, Amos Bairoch identified a consensus sequence at the bottom which shows
large number of protein families and prepared a absolutely conserved positions with an "*", and
database of amino acid patterns which is called the almost completely conserved positions with a":", but
PROSITE catalog and which define the active sites of the rest of the alignment is also remarkably similar.
these proteins. These patterns are called MOTIFS.
Subsequently, Henikoff and Henikoff (1991) aligned

7
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology

A MAKEHASTEWHILETHEMANSHINES rearranged as more sequences are added. The


sequences are then re-aligned so that the same tree
B MAKEHASTEWHILETHEMENSHINE
can be produced by maximum parsimony. Finally,
C TAKEHASTEWHILETHEMANSHINES the tree is rearranged to maximize parsimony. The
advantage of this method is the use of phylogenetic
D MAKEHASTEWHILETHEGIRLSSHINE
analysis to improve the multiple sequence alignment.
:****************. ...... A disadvantage is that use of the maximum
parsimony criterion does not produce statistically
Fig. 1: Alignment of some hypothetical proteins with testable models and favours phylogenetic trees with
CLUSTAL W. an increased number of nodes and short branches.
A novel algorithm MSA that reduces the Blocks Substitution Matrices for Protein Sequence
computational demands of dynamic programming Comparisons (BLOSUM).
was used to align up to eight protein sequences The BLOSUM substitution matrices are
simultaneously (Lipman et al. 1989). Another used for scoring protein sequence alignments
method which runs under the UNIX operating system (Henikoff and Henikoff, 1992). The matrix values are
(Vingron and Argos 1991) aligns all possible pairs of based on the observed amino acid substitutions in a
sequences to create a set of dot matrices, and the large set of approximately 2000 conserved amino
matrices are then filtered sequentially to find motifs acid patterns, called blocks. These blocks have been
which provide a starting point for sequence found in a database of protein sequences representing
alignment. MACAW is an interactive MS/DOS over 500 groups of related proteins (Henikoff and
computer program which searches for conserved Henikoff 1992), and act as signatures of these protein
segments in a set of sequences, and then allows the families. The blosum matrices are based on an
user to modify the alignments (Schuler et al. 1991). entirely different principle and a much larger data set
Another set of UNIX programs for interactive than the Dayhoff PAM matrices, which are derived
multiple sequence alignment by dot matrix analysis from the observed rate of mutation during the
and other alignment techniques has been developed predicted evolutionary changes in a smaller number
(Boguski et al. 1992). of protein families . The Blosum62 matrix, explained
The method most recently applied to below, detected more distant relationships in a
performing multiple sequence alignment is that of BLAST search, and produced an alignment of
using Hidden Markov Models (Baldi et al. 1994), diverged proteins more in agreement with three-
which has been shown to be more successful in dimensional structures, than did the corresponding
identifying motifs in protein families than several of PAM60 matrix (Henikoff and Henikoff 1992,
the aforementioned methods (McClure, 1995). 1993).The Blosum62 matrix is, therefore, highly
Multiple sequence alignments should be recommended for sequence alignment and database
done in a fashion that simultaneously minimizes the searching.
number of changes needed during evolution to To prepare the Blosum matrices, the
generate the observed sequence variation. The sequences of the individual proteins in each of the
program TREEALIGN utilizes this approach (Hein 500 families were aligned in the regions defined by
1990). In the program TREEALIGN (also named the blocks. Each column in the aligned sequences
ALIGN in the program versions), the author has then provided a set of possible amino acid
designed a method for performing the alignment and substitutions. Note that the probability of change
the most parsimonious tree construction from amino acid A to amino acid B is the same as the
simultaneously. To use this program requires the probability of the reverse change from B to A in this
assistance of an experienced UNIX programmer. The analysis. The types of substitutions were then scored
method has two important features: (1) novel for all aligned patterns in the database. More
applications of mathematical graph theory to perform common substitutions should represent a closer
alignments between multiple sequences within a relationship between two amino acids in related
phylogenetic tree, and (2) the use of distance rather proteins, and thus receive a more favourable score in
than similarity scores between sequences, as is sequence alignment. Conversely, rare substitutions
usually the case in aligning sequences by dynamic should be less favoured. This procedure, however,
programming methods. The initial steps are similar to can lead to an over-representation of amino acid
other multiple sequence alignment methods, except substitutions which occur in the most closely related
for the use of a distance scale, i.e. the sequences are members of the protein families. To reduce this
aligned pair-wise and the resulting distance scores are dominant contribution from the most alike proteins,
used sequentially to produce a tree, which is the sequences of these proteins were grouped

8
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology

together into one sequence before scoring the amino Sequence alignments based on Bayesian Statistics
acid substitutions in the aligned blocks. The amino There is need for a method of finding
acid changes within these clustered sequences were sequence alignment that examines all possible
then averaged. Patterns which were 60% identical alignments and does not depend on scoring matrix
were grouped together to make one substitution and gap penalties. To solve this problem, the
matrix called BLOSUM60, and those 80% alike to following algorithms/properties have been
make another matrix called BLOSUM80, and so on. considered:
As the clustering percentage was increased,  dynamic programming provides many different
the ability of the resulting matrix to distinguish actual possible alignments and may have many
from chance alignments, defined as the relative alignments with about the same score
entropy or average information content per residue  each combination of matrix/penalty combination
pair, also increased. However, at the same time, the gives a different alignment and alignment score
dominance effect of the more alike proteins also  for protein alignments, the gaps should only be
increased and biased the matches. BLOSUM62 in certain regions corresponding to loops on the
(Table 2) represents a balance between information outside of the 3D structure: gaps should not
content and match bias and is the favored matrix that occur in regions corresponding to the core of the
is best able to predict alignments among all protein 3D structure - a tightly packed hydrophobic
families. environment
The amino acids in the table are grouped  instead of looking for alignments with gaps -
according to the chemistry of the side group: C there is another algorithm for finding ungapped
(sulfhydryl), STPAG (small hydrophilic), NDEQ regions with matches and mismatches - these are
(acid, acid amide and hydrophilic), HRK (basic), called blocks - they might correspond to the a
MILV (small hydrophobic) and FYW (aromatic). helices and b strands in the protein core.
Each entry is the actual frequency of occurrence of
the amino acid pair in the blocks database, clustered The Need For A Bioinformatics Curriculum
at the 62% level, divided by the expected probability
of occurrence. The expected value is calculated from Bioinformatics will continue to influence the
the frequency of occurrence of each of the two way we think and the way we do science. Students of
individual amino acids in the blocks database, and Biochemistry, Genetics or Molecular Biology will
provides a measure of a chance alignment of the two need to know more about the new discipline if they
amino acids. The actual/expected ratio is expressed intend to get employment in the pharmaceutical or
as a log odds score in so-called half bit units, biotechnology industries. We are gradually
obtained by converting the ratio to a logarithm to the approaching the era of personalised medicine (Wang
base 2, and then multiplying by 2. A zero score et al., 2008; Wheeler et al., 2008; Bodmer and
means that the frequency of the amino acid pair in the Bonilla, 2008).
database was as expected by chance, a positive score This raises the question of developing a
that the pair was found more often than by chance, curriculum in Bioinformatics. Undergraduate and
and a negative score that the pair was found less postgraduate students as well as researchers need a
often than by chance. The accumulated score of an basic understanding of the Internet and how to search
alignment of several amino acids in two sequences for appropriate information on it. This means that
may be obtained by adding up the respective scores universities must invest in making a large number of
of each individual pair of amino acids. networked PCs available for this purpose. This is a
The highest scoring matches are between sine qua non for picking up the skill for browsing the
amino acids which are in the same chemical group web which is essential in gaining access to biological
and the very highest scoring matches are for cysteine- sequence databases.
cysteine matches and for matches among the Teaching bioinformatics will be more
aromatic amino acids. A similar variation is seen in difficult to achieve with our current educational
the PAM matrices. Compared to the PAM160 matrix, philosophy. This is because our current practice in
however, the BLOSUM62 matrix gives a more curriculum design treats the physical and biological
positive score to mismatches with the rare amino sciences as alternatives. There is need for all science
acids e.g. cysteine, a more positive score to students to take courses in mathematics, computer
mismatches with hydrophobic amino acids, but a science and biology up to 200 level in the
more negative score to mismatches with hydrophilic universities.
amino acids (Henikoff and Henikoff 1992).

9
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology

Table 2: The BLOSUM62 Amino Acid Substitution Matrix.

C S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2

S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3

T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3

P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4

A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3

G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2

N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4

D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4

E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3

Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2

H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3

K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1

I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3

L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2

V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2

W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Source: Henikoff and Henikoff (1992).

Students will need to be exposed to the make predictions about the function of the new
various sequence databases available, the genomic sequence. Anyone who can acquire these skills will
projects that have been completed and how to be the hottest in demand in a global market place
perform simple analysis on a sequence, i.e. how to which cuts across the Atlantic.
compare a new sequence against a database to
determine whether similar sequences exist, and to

10
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology

References I, Nilsen GB, Yeung G, Dahl F, Fernandez A,


Staker B, Pant KP, Baccash J, Borcherding AP,
Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Brownley A, Cedeno R, Chen L, Chernikoff D,
Antonacci F, Hormozdiari F, Kitzman JO, Baker Cheung A, Chirita R, Curson B, Ebert JC, Hacker
C, Malig M, Mutlu O, Sahinalp SC, Gibbs RA, CR, Hartlage R, Hauser B, Huang S, Jiang Y,
Eichler EE (2009) Personalized copy number and Karpinchyk V, et al. (2010) Human genome
segmental duplication maps using next-generation sequencing using unchained base reads on self-
sequencing. Nat Genet, 41, 1061-1067. assembling DNA nanoarrays. Science, 327, 78-
Baldi, P., Chauvin, Y., Hunkapillar, T., and McClure, 81.
M.A. (1994) Hidden markov models of biological Feng D.F. and Doolittle R.F. (1987) Progressive
primary sequence information. Proc. Nat .Acad. sequence alignment as a prerequisite to correct
Sci. (USA) 91: 1059-1063. phylogenetic trees. J. Mol. Evol. 25:351-360.
Bodmer W, Bonilla C (2008) Common and rare Hein J. (1990) Unified approach to alignment and
variants in multifactorial susceptibility to phylogenies. Methods Enzymol183: 626-645
common diseases. Nat Genet 2, 40:695-701. Henikoff S. and Henikoff J.G. (1992). Amino acid
Boguski M., Hardison R.C., Schwartz S. and Miller, substitution matrices from protein blocks. Proc.
W. (1992) Analysis of conserved domains and Natl. Acad. Sci. USA 89:10915-10519.
sequence motifs in cellular regulatory proteins Henikoff S. and Henikoff J.G. (1991) Automated
and locus control regions using software tools for assembly of protein blocks for database
multiple alignment and visualization. New searching. Nucleic Acids Res. 19: 6565-6572.
Biologist 4: 247-260. Henikoff S. and Henikoff J.G. (1993). Performance
Cai, Z., Tsung, E. F., Marinescu, V. D. et al. (2004), evaluation of amino acid substitution matrices.
‘Bayesian approach to discovering pathogenic Proteins 17: 49-61.
SNPs in conserved protein domains’, Human Higgins D.G. and Sharp P.M. (1988) CLUSTAL: a
Mutat., 24(2), 178–184. package for performing multiple sequence
Chan S.C., Wong A.K. and Chiu D.K. (1992) A alignment on a microcomputer. Gene 73:237-244.
survey of multiple sequence alignments methods. International HapMap Consortium (2003) The
Bull. Math. Biol. 54:563-598. International HapMap Project. Nature, 426, 789-
Chen K, Wallis JW, McLellan MD, Larson DE, 796.
Kalicki JM, Pohl CS, McGrath SD, Wendl MC, International HapMap Consortium (2005) A
Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, haplotype map of the human genome. Nature,
Wilson RK, Ding L, Mardis ER (2009) 437,1299-1320.
BreakDancer: an algorithm for high-resolution Kidd JM, Cooper GM, Donahue WF, Hayden HS,
mapping of genomic structural variation. Nat Sampas N, Graves T, Hansen N, Teague B, Alkan
Methods 6, 677-681. C, Antonacci F, Haugen E, Zerr T, Yamada NA,
Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Tsang P, Newman TL, Tuzun E, Cheng Z, Ebling
Zhang Y, Aerts J, Andrews TD, Barnes C, HM, Tusneem N, David R, Gillett W, Phelps KA,
Campbell P, Fitzgerald T, Hu M, Ihm CH, Weaver M, Saranga D, Brand A, Tao W,
Kristiansson K, Macarthur DG, Macdonald JR, Gustafson E, McKernan K, Chen L, Malig M, et
Onyiah I, Pang AW, Robson S, Stirrups K, al.:(2008) Mapping and sequencing of structural
Valsesia A, Walter K, Wei J, Tyler-Smith C, variation from eight human genomes. Nature,
Carter NP, Lee C, Scherer SW, Hurles ME (2010) 453, 56-64.
Origins and functional impact of copy number Kim JI, Ju YS, Park H, Kim S, Lee S, Yi JH, Mudge
variation in the human genome. Nature, 464,704- J, Miller NA, Hong D, Bell CJ, Kim HS, Chung
712. IS, Lee WC, Lee JS, Seo SH, Yun JY, Woo HN,
Corpet F. Multiple sequence alignment with Lee H, Suh D, Lee S, Kim HJ, Yavartanoo M,
hierarchical clustering. (1988)Nucleic Acids Res. Kwak M, Zheng Y, Lee MK, Park H, Kim JY,
16:10881-10890. Gokcumen O, Mills RE, Zaranek AW, et al.
Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (2009) A highly annotated wholegenome
(1978) A model of evolutionary change in sequence of a Korean individual. Nature, 460,
proteins. In Dayhoff, M.O. (ed.), Atlas of Protein 1011-1015.
Sequence and Structure. National Biomedical Lipman D.J., Altschul S.F. and Kececioglu J.D.
Research Foundation, Washington, DC, Vol. 5, (1989) A tool for multiple sequence alignment.
suppl. 3, pp. 345–352. Proc. Natl. Acad. Sci. U S A 86: 4412-4415.
Drmanac R, Sparks AB, Callow MJ, Halpern AL,
Burns NL, Kermani BG, Carnevali P, Nazarenko

11
Transactions of the Nigerian Society of Biochemistry and Molecular Biology
Vol. 1, No. 1: 2011 pp. 5-12 (Printed in Nigeria) © 2011 Nigerian Society of Biochemistry and Molecular Biology

Maxam, A. M. and Gilbert, W. (1977) A new method Sanger, F. and Tuppy, H. (1961) The amino acid
for sequencing DNA. Proc. Natl. Acad. Sci. sequence in the phenylalanyl chain of insulin.
(USA) 74, 560 - 564. Biochem. J. 49, 463 - 490.
McClure M.A. (1995) Paramaterization studies of Sanger, F.; Air, G. M.; Barrel, B. G.; Brown, N. L.;
hidden markov chain models representing highly Coulson, A. R.; Fiddes, J. C.; Hutchison, C. A.
divergent protein sequences, in Proceedings of (III); Slocombe, P. M. and Smith, M. (1977)
the28th Annual Hawaii International Conference Nucleotide sequence of bacteriophage X174.
on System Sciences, ed. by Lawrence Hunter sand Nature 265, 687 – 695.
Bruce Shriver, vol. 5: Biotechnology Sanger, F.; Nicklen, S. and Coulson, A. R. (1977)
ComputingIEEE Computer Society Press DNA Sequencing with chain terminating
McClure M.A., Vasi, T.K., and Fitch W.M. (1994) inhibitors. Proc. Natl. Acad. Sci. (USA) 74, 5463
Comparative analysis of multiple protein- - 5467.
sequence alignment methods. Mol. Biol. Evol. Schuler G.D., Altschul S.F. and Lipman D.J. (1991)
11:571-592. A workbench for multiple alignment construction
Park H, Kim JI, Ju YS, Gokcumen O, Mills RE, Kim and analysis. Proteins 9:180-90.
S, Lee S, Suh D, Hong D, Kang HP, Yoo YJ, Smith R.F., and Smith T.F. (1990) Pattern-induced
Shin JY, Kim HJ, Yavartanoo M, Chang YW, Ha multiple sequence alignments. Proc. Natl. Acad.
JS, Chong W, Hwang GR, Darvishi K, Kim H, Sci. U S A 87:118-122.
Yang SJ, Yang KS, Hurles ME, Scherer SW, Sutcliffe, G. (1979) Nucleotide sequence of the E.
Carter NP, Tyler-Smith C, Lee C, Seo JS (2010) coli plasmid pBR322. Cold Spring Harbor Symp.
Discovery of common Asian copy number Quant. Biol. 43, 77 – 90.
variants using integrated high-resolution array Taylor, W.R. (1990) Hierarchical method to align
CGH and massively parallel DNA sequencing. large numbers of biological sequences. Methods
Nat Genet, 42, 400-405. in Enzymology 183:456-474.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Vingron M. and Argos P. (1991) Motif recognition
Andrews TD, Fiegler H, Shapero MH, Carson and alignment for many sequences by comparison
AR, Chen W, Cho EK, Dallaire S, Freeman JL, of dot matrices. J. Mol. Biol. 218:33-43.
Gonzalez JR, Gratacos M, Huang J, Wang J, Wang W, Li R, Li Y, Tian G, Goodman L,
Kalaitzopoulos D, Komura D, MacDonald JR, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li
Marshall CR, Mei R, Montgomery L, Nishimura H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y,
K, Okamura K, Shen F, Somerville MJ, Tchinda Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M,
J, Valsesia A, Woodwark C, Yang F, et al.(2006) Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, et
Global variation in copy number in the human al.(2008) The diploid genome sequence of an
genome. Nature, 444, 444-454. Asian individual. Nature, 456, 60-65.
Sander C. and Schneider R. (1991) Database of Wheeler DA, Srinivasan M, Egholm M, Shen Y,
homology derived protein structures and the Chen L, McGuire A, He W, Chen YJ, Makhijani
structural meaning of sequence alignment. V, Roth GT, Gomes X, Tartaro K, Niazi F,
Proteins 9:56-68 Turcotte CL, Irzyk GP, Lupski JR, Chinault C,
Sanger, F. (1981) Determination of the nucleotide Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X,
sequences in DNA. Science 214, 1205 - 1210. Muzny DM, Margulies M, Weinstock GM, Gibbs
Sanger, F. and Coulson, A. R. (1975) A rapid method RA, Rothberg JM (2008) The complete genome
for determining sequences in DNA by primed of an individual by massively parallel DNA
synthesis with DNA polymerase. J. Mol. Biol. 94, sequencing. Nature 452, 872-876.
444 – 448.
Sanger, F. and Thompson, E. O. P. (1963) The amino
acid sequence in the glycyl chain of insulin.
Biochem. J. 53, 353 - 374.

12

You might also like