Kinza Central Dogma

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 33

The ‘Central Dogma’ is the process by which the instructions in DNA

are converted into a functional product. It was first proposed in 1958


by Francis Crick, discoverer of the structure of DNA.

 The central dogma of molecular biology explains the flow of genetic


information, from DNA ?to RNA?, to make a functional product, a protein?.
 The central dogma suggests that DNA contains the information needed to
make all of our proteins, and that RNA is a messenger that carries this
information to the ribosomes?.
 The ribosomes serve as factories in the cell where the information is
‘translated’ from a code into the functional product.
 The process by which the DNA instructions are converted into the
functional product is called gene expression?.
 Gene expression has two key stages - transcription? and translation?.
 In transcription, the information in the DNA of every cell is converted into
small, portable RNA messages.
 During translation, these messages travel from where the DNA is in the cell
nucleus to the ribosomes where they are ‘read’ to make specific proteins.
 The central dogma states that the pattern of information that occurs most
frequently in our cells is:
o From existing DNA to make new DNA (DNA replication ) ?

o From DNA to make new RNA (transcription)


o From RNA to make new proteins (translation).
An illustration showing the flow of information between DNA, RNA and protein.

Image credit: Genome Research Limited

 Reverse transcription is the transfer of information from RNA to make new


DNA, this occurs in the case of retroviruses, such as HIV?. It is the process by which
the genetic information from RNA is assembled into new DNA.

Does the ‘Central Dogma’ always apply?

 With modern research it is becoming clear that some aspects of the central
dogma are not entirely accurate.
 Current research is focusing on investigating the function of non-coding RNA?.
 Although this does not follow the central dogma it still has a functional role
in the cell. 

Share Tweet Pin

This page was last updated on 2016-01-25


The central dogma of molecular biology is an explanation of the flow of genetic information
within a biological system. ... It states that such information cannot be transferred back from
protein to either protein or nucleic acid. The central dogma has also been described as "DNA
makes RNA and RNA makes protein".

Home > Articles
Applied Bioinformatics Computing: An Introduction
 By Bryan Bergeron
 Nov 29, 2002

📄 Contents

1. Historical Context
2. The Central Dogma
3. Challenges and Opportunities

 ⎙ Print 
 + Share This

< Back Page 2 of 3 Next >

From the author of 


From the author of
Bioinformatics Computing

Learn More Buy
From the author of 
Bioinformatics Computing

Learn More Buy

The Central Dogma


The Central Dogma of Molecular Biology was originally defined by the American biochemist James Watson who, together

with the British physicist Francis Crick, first described the now famous right-handed double helix of DNA (deoxyribonucleic

acid) in 1953. The Central Dogma is deceptively simple: DNA defines the synthesis of protein by way of an RNA

intermediary. What isn't so simple is documenting, controlling, and modifying this process (illustrated in Figure 2), which is

the focus of much of bioinformatics.

Figure 2 The Central Dogma: DNA is transcribed to RNA, which is translated to protein.

As shown in Figure 2, the replication of DNA and the transcription of DNA to RNA occurs in the cell

nucleus, which houses the DNA in the form of tightly-packed chromosomes. The translation of RNA to

protein (the building block of everything from blood to the muscles and organs of the body) occurs in the cytoplasm.

From a computer science perspective, the biology of this process may be less important than the flow of data it represents.

For example, the process is inherently digital with a four-character alphabet (A, T, C, and G)—with each character

representing a nucleotide or base. Following the message from the original DNA in the nucleus to protein in the cytoplasm,

A combines with T, and C combines with G—at least in theory. In practice, however, the A-T and C-G base pairings aren't

always exact. There is an error rate of about one base pair mistake in every million base pairs. Like a single bit error in a

computer program, the effect of the error might be insignificant or horrific, depending on exactly where the error occurs. A

single error may result in debilitating diseases such as sickle cell anemia or cystic fibrosis, for example.

Understanding these and other errors so that they can be corrected is one focus of bioinformatics. It's important to note that

in attaining this understanding, modeling the process according to Claude Shannon's theory of communications is necessary

but insufficient. Although Shannon's theory aptly specifies the amount of information that can be transferred from DNA in the

nucleus to the protein synthesis machinery in the cytoplasm as a function of the noise level of the cellular matrix, it ignores

the biology of the system. For example, people who carry one copy of the defective sequence definition that results in sickle

cell anemia are relatively resistant to malaria. As a result, people who live in areas of the world in which mosquitoes carry

the parasite that results in malaria benefit from what we consider a disease in the malaria-free U.S. The point is that

although many biological systems can be reduced to relatively simple and mathematically sound models, knowledge of the

relevant biology is needed to fully appreciate the applicability of particular computer science methods.

Databases
Bioinformatics is characterized by an abundance of data stored in very large databases. Local databases with capacities

measured in the tens of terabytes are common. As such, fluency in data warehousing, data dictionaries, database design,

archiving, and knowledge management techniques are mandatory for the design and maintenance of these systems. Most
of the large biology databases are based on traditional relational databases architectures; whereas others, especially

systems dealing with images and other multimedia, are based on object-oriented designs.

A sample of the types of databases available online is listed in Table 1. Readers who aren't familiar with database types are

encouraged to read the ancillary materials that accompany many of the systems. The tutorial materials that accompany the

larger systems, such as the biomedical literature database PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi) or the

image-rich Protein Data Bank or PDB (http://www.rcsb.org/pdb/), are particularly informative.

Table 1 Popular public online databases serving the bioinformatics community.

Database Type Example

Nucleotide Sequence GenBank, DDBJ, EMBL, MGDB, GSX, NDB

Protein Sequence SWISS-PROT, TrEMBL, TrEMBLnew, PIR

3D Structures PDB, MMDB, Cambridge Structural Database

Enzymes and LIGAND

Compounds

Sequence Alignment PROSITE, BLOCKS, PRINTS, Pfam, ProDOM

Pathways and Complexes Pathway

Molecular Disease OMIM

Biomedical Literature PubMed

Vectors UniVec

Protein Mutations PMD


Database Type Example

Gene Expressions GEO

Amino Acid Indices Aaindex

Protein/peptide LITDB

Literature

Gene Catalog GENES

In addition to database architecture and design, computer science professionals in charge of the online biology databases,

as well as those who are charged with developing and maintaining local data warehouses, must be conversant in database

management, data lifecycle management, computer interface development, and implementation of large database projects

on time and on budget. In this regard, the database component of bioinformatics differs little from database projects in the

banking, retail, or medical fields.

Networks
Bioinformatics, like virtually every other knowledge-intensive field, is dependent on a robust information technology

infrastructure that includes the Internet, the World Wide Web, intranets, and wireless systems. These and other network

technologies are applied directly to sharing, manipulating, and archiving genetic sequences and other bioinformatics data.

For example, the majority of resources available for researchers in the bioinformatics are Web-based systems such as

GenBank, which is maintained by the National Center for Biological Information (NCBI), the National Institutes of Health

(NIH), and other government agencies.

The issues and challenges associated with providing an adequate network infrastructure are related to selecting and

implementing the appropriate communications models, selecting the best transmission technology, identifying the most

effective protocols, dealing with limited bandwidth, selecting the most appropriate network topologies, and contending with

security and privacy. Because of the computational requirements associated with bioinformatics, the field serves as a test-

bed for many of the leading-edge networking technologies, such as the Great Global Grid (GGG), which distributes not only

data but supercomputing-level processing power to PCs and workstations as well.


Search Engines
The exponentially increasing volume of data accessible in digital form over the Internet—from gene sequences to published

references in the biomedical literature to the experimental methods used to determine specific sequences—is accessible

only through advanced search engine technologies. In this regard, the challenges faced by designers of bioinformatics-

specific search engines are virtually identical to those addressed by computer scientists working in other areas. These

include how to best constrain the search space, how to use hashing and other pre-processing methods to increase

performance, and how to combine powerful search engine technologies in a manner that is not only powerful but also

usable.

Visualization
Most molecular biologists agree that protein function is related to form. Unfortunately, experimental methods of determining

protein structure, such as X-Ray crystallography and Nuclear Magnetic Resonance (NMR) imaging, are typically

painstakingly slow and expensive—hence the interest in predicting protein structure through computational methods.

Exploring the possible configurations of folded proteins has proven to be virtually impossible by simply studying linear

sequences of bases. However, sophisticated 3D visualization techniques allow researchers to use their visual and spatial

reasoning abilities to understand the probable function of proteins. For example, the molecule featured in Figure 3: a form of

human insulin. The structure is derived from data in the Protein Data Bank (PDB) that is rendered with freely available

software that can be run within a Web browser or downloaded to take advantage of local processing power. By using tools

that allow the protein to be rotated in virtual free space, scientists can experiment with the interaction of protein molecules

and identify potential interactions in lieu of using arduous experimental wet-lab methods.

Figure 3 Rendering of human insulin. From the PDB (Protein Data Bank) Structure Explorer,

based on MolScript and Raster3D. The two superimposed spheres in the center of the figure

represent zinc ions.

Statistics
The randomness inherent in any sampling process, including measuring the reactions of thousands of genes simultaneously

with microarray techniques or assessing the similarity between genetic sequences, necessarily involves probability and

statistical methods. In many cases, the statistical techniques applied to bioinformatics problems are integrated within other

applications and support activities such as the statistical analysis of structural features, gene prediction, and quantifying

uncertainty in sequencing results.

Data Mining
In the early 1980s, gene sequencing worldwide resulted in about four base pairs per day. Today, scientists worldwide are

contributing about 1,000 base pairs per second to the online sequence databases. Given this ever-increasing store of

sequence and protein data from several ongoing genome projects, data mining the sequences has become a field of
research in its own right. Thanks to the ongoing development of data mining applications, many scientists are able to

conduct significant basic research from their Web-connected PC, without ever entering a wet lab or seeing a sequencing

machine.

In addition to mining the sequence databases, many researchers are developing powerful text-mining applications that are

capable of extracting data from online biomedical literature databases such as PubMed. Many areas are still out of reach for

these and other traditional text-mining methods. For example, although algorithms are available to summarize a multi-page

document into a single paragraph that can be quickly reviewed, the content contained in images and tables in the document

are lost in the summarization process.

Pattern Matching
Classical pattern matching through standard AI techniques—such as reasoning under uncertainty, machine learning, image

and pattern recognition, neural networks, and rule-based expert systems—have direct and practical applicability to practical

bioinformatics research and development. For example, real-time microarray analysis lends itself to machine learning, in that

it is humanly impossible to follow tens of thousands of parallel reactions unaided. Similarly, several gene prediction

applications in bioinformatics are based on neural network pattern-matching engines.

One of the most often-used pattern-matching approaches in bioinformatics is dynamic programming, which is essentially

recursive programming with a memory of intermediate results. Dynamic programming is used to align sequences that don't

exactly match, but that are close enough to suggest that the two molecules considered in the alignment are similar in

form(and, by extension, perhaps function). In other words, two molecules of the same general shape and configuration may

be related evolutionarily (homology).

However, even if they aren't related, they likely behave similarly in the body because they share a structure. For example,

the structure of the hemoglobin molecule found in a monkey blood is virtually identical to the hemoglobin molecule found in

human blood, and both perform essentially the same function in each system.

Modeling and Simulation


Modeling and simulation are essential for understanding any complex system, and exploring the inner workings of a cell at

the molecular level is no exception. A variety of simulation techniques is used in bioinformatics to model potential drug-

protein interactions, probable protein folding configurations, and the analysis of potential biological pathways. Modeling and

simulation techniques are most useful when they are linked with visualization techniques.

Collaboration
Bioinformatics is characterized by a high degree of cooperation between the researchers who contribute their part to the

whole knowledge base of genomics and proteomics. This level of collaboration is made possible by technologies that

facilitate multimedia communications, such as real-time videoconferencing, groupware, Web portals for submission of

sequence data, and the Internet, of course.


Protein sequencing
From Wikipedia, the free encyclopedia

Jump to navigationJump to search

Using a Beckman-Spinco Protein-Peptide Sequencer, 1970

Protein sequencing is the practical process of determining the amino acid sequence of


all or part of a protein or peptide. This may serve to identify the protein or characterize
its post-translational modifications. Typically, partial sequencing of a protein provides
sufficient information (one or more sequence tags) to identify it with reference to
databases of protein sequences derived from the conceptual translation of genes.
The two major direct methods of protein sequencing are mass spectrometry and Edman
degradation using a protein sequenator (sequencer). Mass spectrometry methods are
now the most widely used for protein sequencing and identification but Edman
degradation remains a valuable tool for characterizing a protein's N-terminus.

Contents

 1Determining amino acid composition


o 1.1Hydrolysis
o 1.2Separation and quantitation
 2N-terminal amino acid analysis
 3C-terminal amino acid analysis
 4Edman degradation
o 4.1Digestion into peptide fragments
o 4.2Reaction
o 4.3Protein sequencer
 5Identification by mass spectrometry
o 5.1Proteolytic digests
o 5.2De novo sequencing
o 5.3N- and C-termini
o 5.4Post-translational modifications
o 5.5Whole-mass determination
o 5.6Limitations
 6Predicting from DNA/RNA sequences
 7Bioinformatics tools
 8See also
 9References
 10Further reading

Determining amino acid composition[edit]


It is often desirable to know the unordered amino acid composition of a protein prior to
attempting to find the ordered sequence, as this knowledge can be used to facilitate the
discovery of errors in the sequencing process or to distinguish between ambiguous
results. Knowledge of the frequency of certain amino acids may also be used to choose
which protease to use for digestion of the protein. The misincorporation of low levels of
non-standard amino acids (e.g. norleucine) into proteins may also be determined.  A[1]

generalized method often referred to as amino acid analysis  for determining amino acid
[2]

frequency is as follows:

1. Hydrolyse a known quantity of protein into its constituent amino acids.


2. Separate and quantify the amino acids in some way.
Hydrolysis[edit]
Hydrolysis is done by heating a sample of the protein in 6 M hydrochloric acid to 100–
110 °C for 24 hours or longer. Proteins with many bulky hydrophobic groups may
require longer heating periods. However, these conditions are so vigorous that some
amino acids (serine, threonine, tyrosine, tryptophan, glutamine, and cysteine) are
degraded. To circumvent this problem, Biochemistry Online suggests heating separate
samples for different times, analysing each resulting solution, and extrapolating back to
zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce
degradation, such as thiol reagents or phenol to protect tryptophan and tyrosine from
attack by chlorine, and pre-oxidising cysteine. He also suggests measuring the quantity
of ammonia evolved to determine the extent of amide hydrolysis.
Separation and quantitation[edit]
The amino acids can be separated by ion-exchange chromatography then derivatized to
facilitate their detection. More commonly, the amino acids are derivatized then resolved
by reversed phase HPLC.
An example of the ion-exchange chromatography is given by the NTRC using
sulfonated polystyrene as a matrix, adding the amino acids in acid solution and passing
a buffer of steadily increasing pH through the column. Amino acids are eluted when the
pH reaches their respective isoelectric points. Once the amino acids have been
separated, their respective quantities are determined by adding a reagent that will form
a coloured derivative. If the amounts of amino acids are in excess of 10
nmol, ninhydrin can be used for this; it gives a yellow colour when reacted with proline,
and a vivid purple with other amino acids. The concentration of amino acid is
proportional to the absorbance of the resulting solution. With very small quantities, down
to 10 pmol, fluorescent derivatives can be formed using reagents such as ortho-
phthaldehyde (OPA) or fluorescamine.
Pre-column derivatization may use the Edman reagent to produce a derivative that is
detected by UV light. Greater sensitivity is achieved using a reagent that generates a
fluorescent derivative. The derivatized amino acids are subjected to reversed phase
chromatography, typically using a C8 or C18 silica column and an
optimised elution gradient. The eluting amino acids are detected using a UV or
fluorescence detector and the peak areas compared with those for derivatised
standards in order to quantify each amino acid in the sample.

N-terminal amino acid analysis[edit]

Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's reagent (DNFB), B total


acid hydrolysis of the dinitrophenyl peptide

Determining which amino acid forms the N-terminus of a peptide chain is useful for two
reasons: to aid the ordering of individual peptide fragments' sequences into a whole
chain, and because the first round of Edman degradation is often contaminated by
impurities and therefore does not give an accurate determination of the N-terminal
amino acid. A generalised method for N-terminal amino acid analysis follows:

1. React the peptide with a reagent that will selectively label the terminal amino acid.
2. Hydrolyse the protein.
3. Determine the amino acid by chromatography and comparison with standards.
There are many different reagents which can be used to label terminal amino acids.
They all react with amine groups and will therefore also bind to amine groups in the side
chains of amino acids such as lysine - for this reason it is necessary to be careful in
interpreting chromatograms to ensure that the right spot is chosen. Two of the more
common reagents are Sanger's reagent (1-fluoro-2,4-dinitrobenzene) and dansyl
derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for the Edman
degradation, can also be used. The same questions apply here as in the determination
of amino acid composition, with the exception that no stain is needed, as the reagents
produce coloured derivatives and only qualitative analysis is required. So the amino
acid does not have to be eluted from the chromatography column, just compared with a
standard. Another consideration to take into account is that, since any amine groups will
have reacted with the labelling reagent, ion exchange chromatography cannot be used,
and thin layer chromatography or high-pressure liquid chromatography should be used
instead.

C-terminal amino acid analysis[edit]


The number of methods available for C-terminal amino acid analysis is much smaller
than the number of available methods of N-terminal analysis. The most common
method is to add carboxypeptidases to a solution of the protein, take samples at regular
intervals, and determine the terminal amino acid by analysing a plot of amino acid
concentrations against time. This method will be very useful in the case of polypeptides
and protein-blocked N termini. C-terminal sequencing would greatly help in verifying the
primary structures of proteins predicted from DNA sequences and to detect any
postranslational processing of gene products from known codon sequences.

Edman degradation[edit]
Main article: Edman degradation

The Edman degradation is a very important reaction for protein sequencing, because it


allows the ordered amino acid composition of a protein to be discovered. Automated
Edman sequencers are now in widespread use, and are able to sequence peptides up
to approximately 50 amino acids long. A reaction scheme for sequencing a protein by
the Edman degradation follows; some of the steps are elaborated on subsequently.

1. Break any disulfide bridges in the protein with a reducing agent like 2-mercaptoethanol.


A protecting group such as iodoacetic acid may be necessary to prevent the bonds from
re-forming.
2. Separate and purify the individual chains of the protein complex, if there are more than
one.
3. Determine the amino acid composition of each chain.
4. Determine the terminal amino acids of each chain.
5. Break each chain into fragments under 50 amino acids long.
6. Separate and purify the fragments.
7. Determine the sequence of each fragment.
8. Repeat with a different pattern of cleavage.
9. Construct the sequence of the overall protein.
Digestion into peptide fragments[edit]
Peptides longer than about 50-70 amino acids long cannot be sequenced reliably by the
Edman degradation. Because of this, long protein chains need to be broken up into
small fragments that can then be sequenced individually. Digestion is done either
by endopeptidases such as trypsin or pepsin or by chemical reagents such as cyanogen
bromide. Different enzymes give different cleavage patterns, and the overlap between
fragments can be used to construct an overall sequence.
Reaction[edit]
The peptide to be sequenced is adsorbed onto a solid surface. One
common substrate is glass fibre coated with polybrene, a cationic polymer. The Edman
reagent, phenylisothiocyanate (PITC), is added to the adsorbed peptide, together with a
mildly basic buffer solution of 12% trimethylamine. This reacts with the amine group of
the N-terminal amino acid.
The terminal amino acid can then be selectively detached by the addition
of anhydrous acid. The derivative then isomerises to give a
substituted phenylthiohydantoin, which can be washed off and identified by
chromatography, and the cycle can be repeated. The efficiency of each step is about
98%, which allows about 50 amino acids to be reliably determined.

A Beckman-Coulter Porton LF3000G protein sequencing machine

Protein sequencer[edit]
A protein sequenator   is a machine that performs Edman degradation in an
[3]

automated manner. A sample of the protein or peptide is immobilized in the reaction


vessel of the protein sequenator and the Edman degradation is performed. Each cycle
releases and derivatises one amino acid from the protein or peptide's N-terminus and
the released amino-acid derivative is then identified by HPLC. The sequencing process
is done repetitively for the whole polypeptide until the entire measurable sequence is
established or for a pre-determined number of cycles.

Identification by mass spectrometry[edit]


Main articles: protein mass spectrometry and De novo peptide sequencing
Protein identification is the process of assigning a name to a protein of interest (POI),
based on its amino-acid sequence. Typically, only part of the protein’s sequence needs
to be determined experimentally in order to identify the protein with reference to
databases of protein sequences deduced from the DNA sequences of their genes.
Further protein characterization may include confirmation of the actual N- and C-termini
of the POI, determination of sequence variants and identification of any post-
translational modifications present.
Proteolytic digests[edit]
A general scheme for protein identification is described. [4][5]

1. The POI is isolated, typically by SDS-PAGE or chromatography.


2. The isolated POI may be chemically modified to stabilise Cysteine residues (e.g. S-
amidomethylation or S-carboxymethylation).
3. The POI is digested with a specific protease to generate peptides. Trypsin, which
cleaves selectively on the C-terminal side of Lysine or Arginine residues, is the most
commonly used protease. Its advantages include i) the frequency of Lys and Arg
residues in proteins, ii) the high specificity of the enzyme, iii) the stability of the enzyme
and iv) the suitability of tryptic peptides for mass spectrometry.
4. The peptides may be desalted to remove ionizable contaminants and subjected
to MALDI-TOF mass spectrometry. Direct measurement of the masses of the peptides
may provide sufficient information to identify the protein (see Peptide mass
fingerprinting) but further fragmentation of the peptides inside the mass spectrometer is
often used to gain information about the peptides’ sequences. Alternatively, peptides
may be desalted and separated by reversed phase HPLC and introduced into a mass
spectrometer via an ESI source. LC-ESI-MS may provide more information than MALDI-
MS for protein identification but uses more instrument time.
5. Depending on the type of mass spectrometer, fragmentation of peptide ions may occur
via a variety of mechanisms such as Collision-induced dissociation (CID) or Post-source
decay (PSD). In each case, the pattern of fragment ions of a peptide provides
information about its sequence.
6. Information including the measured mass of the putative peptide ions and those of their
fragment ions is then matched against calculated mass values from the conceptual (in-
silico) proteolysis and fragmentation of databases of protein sequences. A successful
match will be found if its score exceeds a threshold based on the analysis parameters.
Even if the actual protein is not represented in the database, error-tolerant matching
allows for the putative identification of a protein based on similarity
to homologous proteins. A variety of software packages are available to perform this
analysis.
7. Software packages usually generate a report showing the identity (accession code) of
each identified protein, its matching score, and provide a measure of the relative
strength of the matching where multiple proteins are identified.
8. A diagram of the matched peptides on the sequence of the identified protein is often
used to show the sequence coverage (% of the protein detected as peptides). Where
the POI is thought to be significantly smaller than the matched protein, the diagram may
suggest whether the POI is an N- or C-terminal fragment of the identified protein.
De novo sequencing[edit]
The pattern of fragmentation of a peptide allows for direct determination of its sequence
by de novo sequencing. This sequence may be used to match databases of protein
sequences or to investigate post-translational or chemical modifications. It may provide
additional evidence for protein identifications performed as above.
N- and C-termini[edit]
The peptides matched during protein identification do not necessarily include the N- or
C-termini predicted for the matched protein. This may result from the N- or C-terminal
peptides being difficult to identify by MS (e.g. being either too short or too long), being
post-translationally modified (e.g. N-terminal acetylation) or genuinely differing from the
prediction. Post-translational modifications or truncated termini may be identified by
closer examination of the data (i.e. de novo sequencing). A repeat digest using a
protease of different specificity may also be useful.
Post-translational modifications[edit]
Whilst detailed comparison of the MS data with predictions based on the known protein
sequence may be used to define post-translational modifications, targeted approaches
to data acquisition may also be used. For instance, specific enrichment of
phosphopeptides may assist in identifying phosphorylation sites in a protein. Alternative
methods of peptide fragmentation in the mass spectrometer, such as ETD or ECD, may
give complementary sequence information.
Whole-mass determination[edit]
The protein’s whole mass is the sum of the masses of its amino-acid residues plus the
mass of a water molecule and adjusted for any post-translational modifications.
Although proteins ionize less well than the peptides derived from them, a protein in
solution may be able to be subjected to ESI-MS and its mass measured to an accuracy
of 1 part in 20,000 or better. This is often sufficient to confirm the termini (thus that the
protein’s measured mass matches that predicted from its sequence) and infer the
presence or absence of many post-translational modifications.
Limitations[edit]
Proteolysis does not always yield a set of readily analyzable peptides covering the
entire sequence of POI. The fragmentation of peptides in the mass spectrometer often
does not yield ions corresponding to cleavage at each peptide bond. Thus, the deduced
sequence for each peptide is not necessarily complete. The standard methods of
fragmentation do not distinguish between leucine and isoleucine residues since they are
isomeric.
Because the Edman degradation proceeds from the N-terminus of the protein, it will not
work if the N-terminus has been chemically modified (e.g. by acetylation or formation of
Pyroglutamic acid). Edman degradation is generally not useful to determine the
positions of disulfide bridges. It also requires peptide amounts of 1 picomole or above
for discernible results, making it less sensitive than mass spectrometry.
Predicting from DNA/RNA sequences[edit]
In biology, proteins are produced by translation of messenger RNA (mRNA) with the
protein sequence deriving from the sequence of codons in the mRNA. The mRNA is
itself formed by the transcription of genes and may be further modified. These
processes are sufficiently understood to use computer algorithms to automate
predictions of protein sequences from DNA sequences, such as from whole-genome
DNA-sequencing projects, and have led to the generation of large databases of protein
sequences such as UniProt. Predicted protein sequences are an important resource for
protein identification by mass spectrometry.
Historically, short protein sequences (10 to 15 residues) determined by Edman
degradation were back-translated into DNA sequences that could be used as probes or
primers to isolate molecular clones of the corresponding gene or complementary DNA.
The sequence of the cloned DNA was then determined and used to deduce the full
amino-acid sequence of the protein.

Nucleic acid sequence


From Wikipedia, the free encyclopedia

Jump to navigationJump to search

This article needs additional citations for verification. Please help improve this


article by adding citations to reliable sources. Unsourced material may be challenged and
removed.
Find sources: "Nucleic acid sequence" – news · newspapers · books · scholar · JSTOR (March
2014) (Learn how and when to remove this template message)
Interactive image of nucleic acid structure (primary, secondary, tertiary, and quaternary) using DNA helices and examples
from the VS ribozyme and telomerase and nucleosome. (PDB: ADNA, 1BNA, 4OCB, 4R4V, 1YMO, 1EQZ)

A nucleic acid sequence is a succession of bases signified by a series of a set of five


different letters that indicate the order of nucleotides forming alleles within a DNA (using
GACT) or RNA (GACU) molecule. By convention, sequences are usually presented
from the 5' end to the 3' end. For DNA, the sense strand is used. Because nucleic acids
are normally linear (unbranched) polymers, specifying the sequence is equivalent to
defining the covalent structure of the entire molecule. For this reason, the nucleic acid
sequence is also termed the primary structure.
The sequence has capacity to represent information. Biological deoxyribonucleic acid
represents the information which directs the functions of a living thing.
Nucleic acids also have a secondary structure and tertiary structure. Primary structure is
sometimes mistakenly referred to as primary sequence. Conversely, there is no parallel
concept of secondary or tertiary sequence.

Contents
 1Nucleotides
o 1.1Notation
 2Biological significance
 3Sequence determination
o 3.1Digital representation
 4Sequence analysis
o 4.1Genetic testing
o 4.2Sequence alignment
o 4.3Sequence motifs
o 4.4Long range correlations
o 4.5Sequence entropy
 5See also
 6References
 7External links

Nucleotides[edit]

Chemical structure of RNA


A series of codons in part of a mRNA molecule. Each codon consists of three nucleotides, usually representing a
single amino acid.

Main article: Nucleotide

Nucleic acids consist of a chain of linked units called nucleotides. Each nucleotide
consists of three subunits: a phosphate group and a sugar (ribose in the case
of RNA, deoxyribose in DNA) make up the backbone of the nucleic acid strand, and
attached to the sugar is one of a set of nucleobases. The nucleobases are important
in base pairing of strands to form higher-level secondary and tertiary structure such as
the famed double helix.
The possible letters are A, C, G, and T, representing the four nucleotide bases of a DNA
strand – adenine, cytosine, guanine, thymine – covalently linked to
a phosphodiester backbone. In the typical case, the sequences are printed abutting one
another without gaps, as in the sequence AAAGTCTGAC, read left to right in the 5' to
3' direction. With regards to transcription, a sequence is on the coding strand if it has
the same order as the transcribed RNA.
One sequence can be complementary to another sequence, meaning that they have the
base on each position in the complementary (i.e. A to T, C to G) and in the reverse
order. For example, the complementary sequence to TTAC is GTAA. If one strand of
the double-stranded DNA is considered the sense strand, then the other strand,
considered the antisense strand, will have the complementary sequence to the sense
strand.
Notation[edit]
Main article: Nucleic acid notation

Comparing and determining % difference between two nucleotide sequences.

 AATCCGCTAG
 AAACCCTTAG
 Given the two 10-nucleotide sequences, line them up and compare the differences
between them. Calculate the percent similarity by taking the number of different DNA bases
divided by the total number of nucleotides. In the above case, there are three differences in
the 10 nucleotide sequence. Therefore, divide 7/10 to get the 70% similarity and subtract
that from 100% to get a 30% difference.
While A, T, C, and G represent a particular nucleotide at a position, there are also
letters that represent ambiguity which are used when more than one kind of nucleotide
could occur at that position. The rules of the International Union of Pure and Applied
Chemistry (IUPAC) are as follows: [1]

Symbol[2] Description Bases represented Complement

A Adenine A T

C Cytosine C G

G Guanine G 1 C

T Thymine T A

U Uracil U A

W Weak A T W

S Strong C G S

M aMino A C K
2
K Keto G T M

R puRine A G Y

Y pYrimidine C T R

B not A (B comes after A) C G T 3 V

D not C (D comes after C) A G T H


H not G (H comes after G) A C T D

V not T (V comes after T and U) A C G B

N any Nucleotide (not a gap) A C G T 4 N

Z Zero 0 Z

These symbols are also valid for RNA, except with U (uracil) replacing T (thymine). [1]

Apart from adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), DNA and
RNA also contain bases that have been modified after the nucleic acid chain has been
formed. In DNA, the most common modified base is 5-methylcytidine (m5C). In RNA,
there are many modified bases, including pseudouridine (Ψ), dihydrouridine (D), inosine
(I), ribothymidine (rT) and 7-methylguanosine (m7G).  Hypoxanthine and xanthine are
[3][4]

two of the many bases created through mutagen presence, both of them through


deamination (replacement of the amine-group with a carbonyl-group). Hypoxanthine is
produced from adenine, and xanthine is produced from guanine.  Similarly, deamination [5]

of cytosine results in uracil.

Biological significance[edit]

A depiction of the genetic code, by which the information contained in nucleic acids are translated into amino


acid sequences in proteins.

Further information: Genetic code and Central dogma of molecular biology

In biological systems, nucleic acids contain information which is used by a living cell to


construct specific proteins. The sequence of nucleobases on a nucleic acid strand
is translated by cell machinery into a sequence of amino acids making up a protein
strand. Each group of three bases, called a codon, corresponds to a single amino acid,
and there is a specific genetic code by which each possible combination of three bases
corresponds to a specific amino acid.
The central dogma of molecular biology outlines the mechanism by which proteins are
constructed using information contained in nucleic
acids. DNA is transcribed into mRNA molecules, which travels to the ribosome where
the mRNA is used as a template for the construction of the protein strand. Since nucleic
acids can bind to molecules with complementary sequences, there is a distinction
between "sense" sequences which code for proteins, and the complementary
"antisense" sequence which is by itself nonfunctional, but can bind to the sense strand.

Sequence determination[edit]

Electropherogram printout from automated sequencer for determining part of a DNA sequence

Main article: DNA sequencing

DNA sequencing is the process of determining the nucleotide sequence of a


given DNA fragment. The sequence of the DNA of a living thing encodes the necessary
information for that living thing to survive and reproduce. Therefore, determining the
sequence is useful in fundamental research into why and how organisms live, as well as
in applied subjects. Because of the importance of DNA to living things, knowledge of a
DNA sequence may be useful in practically any biological research. For example,
in medicine it can be used to identify, diagnose and potentially
develop treatments for genetic diseases. Similarly, research into pathogens may lead to
treatments for contagious diseases. Biotechnology is a burgeoning discipline, with the
potential for many useful products and services.
RNA is not sequenced directly. Instead, it is copied to a DNA by reverse transcriptase,
and this DNA is then sequenced.
Current sequencing methods rely on the discriminatory ability of DNA polymerases, and
therefore can only distinguish four bases. An inosine (created from adenosine
during RNA editing) is read as a G, and 5-methyl-cytosine (created from cytosine
by DNA methylation) is read as a C. With current technology, it is difficult to sequence
small amounts of DNA, as the signal is too weak to measure. This is overcome
by polymerase chain reaction (PCR) amplification.
Digital representation[edit]

Genetic sequence in digital format.

Once a nucleic acid sequence has been obtained from an organism, it is stored in
silico in digital format. Digital genetic sequences may be stored in sequence databases,
be analyzed (see Sequence analysis below), be digitally altered and be used as
templates for creating new actual DNA using artificial gene synthesis.

Sequence analysis[edit]
Main article: Sequence analysis

Digital genetic sequences may be analyzed using the tools of bioinformatics to attempt
to determine its function.
Genetic testing[edit]
Main article: Genetic testing

The DNA in an organism's genome can be analyzed to diagnose vulnerabilities to


inherited diseases, and can also be used to determine a child's paternity (genetic father)
or a person's ancestry. Normally, every person carries two variations of every gene, one
inherited from their mother, the other inherited from their father. The human genome is
believed to contain around 20,000–25,000 genes. In addition to
studying chromosomes to the level of individual genes, genetic testing in a broader
sense includes biochemical tests for the possible presence of genetic diseases, or
mutant forms of genes associated with increased risk of developing genetic disorders.
Genetic testing identifies changes in chromosomes, genes, or proteins.  Usually, testing
[6]

is used to find changes that are associated with inherited disorders. The results of a
genetic test can confirm or rule out a suspected genetic condition or help determine a
person's chance of developing or passing on a genetic disorder. Several hundred
genetic tests are currently in use, and more are being developed. [7][8]

Sequence alignment[edit]
Main article: Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences


of DNA, RNA, or protein to identify regions of similarity that may be due to
functional, structural, or evolutionary relationships between the sequences.  If two
[9]

sequences in an alignment share a common ancestor, mismatches can be interpreted


as point mutations and gaps as insertion or deletion mutations (indels) introduced in one
or both lineages in the time since they diverged from one another. In sequence
alignments of proteins, the degree of similarity between amino acids occupying a
particular position in the sequence can be interpreted as a rough measure of
how conserved a particular region or sequence motif is among lineages. The absence
of substitutions, or the presence of only very conservative substitutions (that is, the
substitution of amino acids whose side chains have similar biochemical properties) in a
particular region of the sequence, suggest  that this region has structural or functional
[10]

importance. Although DNA and RNA nucleotide bases are more similar to each other
than are amino acids, the conservation of base pairs can indicate a similar functional or
structural role. [11]

Computational phylogenetics makes extensive use of sequence alignments in the


construction and interpretation of phylogenetic trees, which are used to classify the
evolutionary relationships between homologous genes represented in the genomes of
divergent species. The degree to which sequences in a query set differ is qualitatively
related to the sequences' evolutionary distance from one another. Roughly speaking,
high sequence identity suggests that the sequences in question have a comparatively
young most recent common ancestor, while low identity suggests that the divergence is
more ancient. This approximation, which reflects the "molecular clock" hypothesis that a
roughly constant rate of evolutionary change can be used to extrapolate the elapsed
time since two genes first diverged (that is, the coalescence time), assumes that the
effects of mutation and selection are constant across sequence lineages. Therefore, it
does not account for possible difference among organisms or species in the rates
of DNA repair or the possible functional conservation of specific regions in a sequence.
(In the case of nucleotide sequences, the molecular clock hypothesis in its most basic
form also discounts the difference in acceptance rates between silent mutations that do
not alter the meaning of a given codon and other mutations that result in a
different amino acid being incorporated into the protein.) More statistically accurate
methods allow the evolutionary rate on each branch of the phylogenetic tree to vary,
thus producing better estimates of coalescence times for genes.
Sequence motifs[edit]
Main article: Sequence motif

Frequently the primary structure encodes motifs that are of functional importance. Some
examples of sequence motifs are: the C/D  and H/ACA boxes  of snoRNAs, Sm
[12] [13]

binding site found in spliceosomal RNAs such as U1, U2, U4, U5, U6, U12 and U3,


the Shine-Dalgarno sequence,  the Kozak consensus sequence  and the RNA
[14] [15]

polymerase III terminator. [16]

Long range correlations[edit]


Peng et al.  found the existence of long-range correlations in the non-coding base pair
[17][18]

sequences of DNA. In contrast, such correlations seem not to appear in coding DNA
sequences. This finding has been explained by Grosberg et al.  by the global spatial
[19]

structure of the DNA.


Sequence entropy[edit]
Main article: Sequence entropy

In Bioinformatics, a sequence entropy, also known as sequence complexity or


information profile,  is a numerical sequence providing a quantitative measure of the
[20]

local complexity of a DNA sequence, independently of the direction of processing. The


manipulations of the information profiles enable the analysis of the sequences using
alignment-free techniques, such as for example in motif and rearrangements detection.
 
[20][21] [2

You might also like