Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

What is bioinformatics?

Interdisciplinary field of science that combines biology, computer science, statistics, physics,
chemistry, mathematics and engineering by developing methods and software tools for
understanding and interpreting biological data in genetics and genomics
What is DNA, meaning, groups facts?

DNA, deoxyribonucleic acid-hereditary material in humans and almost all other organisms,
biological instructions that make each species unique

• DNA was discovered in 1869-Frederich Miescher-Watson and Crick model (1950)

• Information in DNA is stored as a code made up of four chemical bases: adenine (A), guanine
(G), cytosine (C), and thymine (T)

• DNA bases pair up with each other, A with T and C with G, to form units called base pairs

• Each base is also attached to a sugar molecule and a phosphate molecule

• Base + Sugar + Phosphate = Nucleotide

• Nucleotides are arranged in two long strands to form a spiral called double helix

All DNA follow Chargaff’s rule- “The total number of purines in a DNA molecule is equal to the total
number of pyrimidines”

• Structure of double helix resembles a ladder, with the base pairs forming the ladder’s rungs and
the sugar and phosphate molecules forming the vertical sidepieces of the ladder

• DNA can replicate or make copies of itself

• Each strand of DNA in double helix can function as a pattern for duplicating sequence of bases

• During cells division, each new cell needs to


have an exact copy of DNA present in old cell

Purines are bases that have double ring and triple bound

Pyrimidines are bases that have single ring and double


bound
What are proteins and everything about them
Proteins were huge molecules (macromolecules) made up of large
numbers of amino acids (typically from 100 to 500), picked out from a
selection of 20 “flavors” with names such as alanine, glycine,
tyrosine, glutamine, and so on….
Proteins with similar sequences would fold into similar shapes
 Proteins with similar structures would be encoded by similar
sequences of amino acids
 Function of a protein turned out to be a direct consequence of its
3-D structure
-structural bioinformatics- Branch of bioinformatics which is related
to the analysis and prediction of the three-dimensional structure of
biological macromolecules such as proteins, RNA, and DNA.

Final 3-D shape of protein molecule is


uniquely dictated by its sequence
because some amino-acid types (for
instance, hydrophobic residues L, V,
I) have no desire whatsoever to be at
the surface interacting with the
surrounding water — while others (for instance,
hydrophilic residues D, S, K) are actively looking for
such an opportunity
 Protein chain is also affected by other influences,
such as the electric charges carried by some of the
amino acids, or their capacity to fit with their immediate neighbors
 First 3-D structure of a protein was determined in 1958 by Drs. Kendrew and Perutz, using the
complicated technique of X-ray crystallography

HEMOGLOBIN

Hemoglobin is the protein that makes blood red


 Made up of four protein chains (polypeptide chains), two alpha
chains (141 amino acid residues each) and two beta chains (146 amino
acid residues each), each with a ring-like heme group containing an
iron atom
 There are four binding sites for oxygen on the hemoglobin molecule,
because each chain contains one heme group
 Alpha and beta chains have different sequences of amino acids, but
fold up to form similar three-dimensional structures
 Four chains are held together by noncovalent interactions

Oxygen binds reversibly to these
iron atoms and is transported
through blood

A protein is a polymer of
amino acids linked together by
peptide bonds- Primary structure
is the sequence of amino acids
in the chain
 Backbone of the protein will fold to form a regular repeating pattern called secondary structure
 Protein folds upon itself when regions of secondary structure are interrupted by irregularly
folded loops and turns. It helps to visualize the helices as pink and the turns as white. This pattern
repeats for the entire length of whole protein chain. The irregular folding of the whole protein into a
compact globular structure is called tertiary structure
 Some proteins are actually a collection of smaller proteins called subunits. Hemoglobin is made
of four subunits. The arrangement of subunits in a protein is its quaternary structure. It helps to
visualize the subunits as different colors

AMINO ACIDS

Amino acids are linked together as a chain — and that the true identity of a protein is derived not
only from its composition, but also from the precise order of its constituent amino acids
• First amino-acid sequence of protein insulin-determined in 1951.
• Actual recipe for human insulin, from which all its biological properties derive, is the following
chain of 110 residues
insulin=MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTP
KTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
NH2 and COOH groups of atoms are used to form peptidic bonds between successive residues in
the sequence
 Protein molecule is made when a free NH2 group links chemically with a COOH group, forming
the peptide bond CO-NH
 As a result of this chaining process, your protein molecule is going to be left with an unused
NH2 at one end and an unused COOH at the other end known as N-terminus and C-terminus of
protein chain
 books, databases, and so on defines the sequence of a protein or of a protein fragment as the
succession of its constituent amino acids, listed in order from the N-terminus to the C-terminus.

MAVLD= Met-Ala-Val-Leu-Asp = Methionine–


Alanine-Valine–Leucine- Aspartic

Databases
Primary Database
•Original submission by experimentalists who have researched
•Content controlled by submitters
•Example: GENEBANK, SNP, GEO...

Secondary Database
•Built up from primary data which is retrieved by
primary database
•Content controlled by third party NCBI
•Example: RefSeq, RefSNP, NCBI, Structure, Protein

Shortcuts :

European Nucleotide Archive-ENA


Protein Information Resource (PIR)
European Molecular Biology Laboratory (EMBL)
UniProtKB/Swiss-Prot Protein Knowledgebase
ExPASy (Expert Protein Analysis System
Swiss Institute of Bioinformatics (SIB)
National Library of Medicine (NLM),
NCBI(National center for biotechnology information)
DNA DataBank of Japan (DDBJ),
the European Molecular Biology Laboratory (EMBL) and GenBank at NCBI
PubMed is a database developed by NCBI National Library of Medicine (NLM),
it works as a part of the NCBI Entrez retrieval system
PubMed Central (PMC)

Entrez Global Query Cross-Database Search System

Protein data bank (PDB)

Worldwide Protein Data Bank, wwPDB.

Sequence retrieval system (SRS;

Sequence homology vs sequence similarity

DNA and proteins are products of evolution


The building blocks of these biological macromolecules, nucleotide bases, and amino acids form
linear sequences that determine the primary structure of the molecules
The molecular sequences undergo random changes
Selected sequences gradually accumulate mutations and diverge over time, traces of evolution
may still remain in certain portions of the sequences to allow identification of the common ancestry
For example, active site residues of an enzyme family tend to be conserved because they are
responsible for catalytic functions.
Comparing sequences through alignment, patterns of conservation and variation can be
identified
sequence alignment can be used as basis for prediction of structure and function of
uncharacterized sequences.

When two sequences are descended from a common evolutionary origin, they are said to have a
homologous relationship or share homology. Sequence similarity, which is the percentage of
aligned residues that are similar in physiochemical properties such as size, charge, and
hydrophobicity.

sequence homology similarity


An inference or a conclusion about a common a direct result of observation from the
ancestral relationship drawn from sequence sequence alignment
similarity comparison when the two sequences
share a high enough degree of similarity

Sequence similarity can be quantified using


percentages (For example, one may say that two
sequences share 40% similarity. It is incorrect to
say that the two sequences share 40% homology.
They are either homologous or nonhomologous)

An identity of 30% or higher can be safely regarded


as having close homology. They are sometimes
referred to as being in the “safe zone” If their identity
level falls between 20% and 30%, determination of
homologous relationships in this range becomes less
certain. This is the area often regarded as the
“twilight zone,” Below 20% identity, where high
proportions of nonrelated sequences are present,
homologous relationships cannot be reliably
determined and thus fall into the “midnight zone.”

Sequence similarity and sequence identity are synonymous for nucleotide sequences. For protein
sequences, however, the two concepts are very different In a protein sequence alignment,
sequence identity refers to the percentage of matches of the same amino acid residues between
two aligned sequences. Similarity refers to the percentage of aligned residues that have similar
physicochemical characteristics and can be more readily substituted for each other

Calculation of sequence similarity/identity


S is the percentage sequence similarity Ls is the number of aligned residues with similar
characteristic L a and L b are the total lengths of each individual sequence

METHODS Global Alignment and Local Local Alignment


Alignment Global Alignment
Two sequences to be aligned are assumed to be Does not assume that the two sequences in
generally similar over their entire length question have similarity over the entire length
Alignment is carried out from beginning to end of It only finds local regions with the highest level of
both sequences to find the best possible similarity between the two sequences and aligns
alignment across the entire length between the these regions without regard for the alignment of
two sequences the rest of the sequence regions
Fails to recognize highly similar local regions more appropriate for aligning divergent biological
between the two sequences. sequences containing only modules that are
similar, which are referred to as domains or
motifs.

Задача

ALGORITAMS

Dynamic Programming for Global Alignment


Needleman–Wunsch algorithm
In this algorithm, an optimal alignment is obtained over the entire lengths of the two sequences.
One of the few web servers dedicated to global pairwise alignment is GAP. GAP
(http://bioinformatics.iastate.edu/aat/align/align.html) is a web-based pairwise global alignment
program.
It aligns two sequences without penalizing terminal gaps so similar sequences of unequal
lengths can be aligned.
To be able to insert long gaps in the alignment, such gaps are treated with a constant penalty.
This feature is useful in aligning cDNA to exons in genomic DNA containing the same gene.

The first application of dynamic programming in local alignment is the Smith–Waterman algorithm
•In this algorithm, positive scores are assigned for matching residues and zeros for mismatches.
• No negative scores are used.
•This approach may be suitable for aligning divergent sequences or sequences with multiple
domains that may be of different origins
Most commonly used pairwise alignment web servers apply the local alignment strategy, which
include SIM, SSEARCH, and LALIGN. SIM (http://bioinformatics.iastate.edu/aat/align/align.html) is
a web-based program for pairwise alignment using the Smith–Waterman algorithm that finds the
best scored non overlapping local alignments between two sequences.
•It is able to handle tens of kilobases of genomic sequence.
•The user has the option to set a scoring matrix and gap penalty scores.
•A specified number of best scored alignments are produced.

SSEARCH (http://pir.georgetown.edu/pirwww/search/pairwise.html) is a simpleweb-based


programs that uses the Smith–Waterman algorithm for pairwise alignment of sequences.
•Only one best scored alignment is given.
•There is no option for scoring matrices or gap penalty scores. LALIGN
(www.ch.embnet.org/software/LALIGN form.html) is a web-based program that uses a variant of
the Smith–Waterman algorithm to align two sequences.
•Unlike SSEARCH, which returns the single best scored alignment, LALIGN gives a specified
number of best scored alignments.
•The user has the option to set the scoring matrix and gap penalty scores.
•The same web interface also provides an option for global alignment performed by the ALIGN
program.

Major types of RNA


•mRNA messenger RNA (mRNA) RNA molecule that specifies the amino acid sequence of a
protein.

•rRNA ribosomal RNA (rRNA) Any one of a number of specific RNA molecules that form part of
the structure of a ribosome and participate in the synthesis of proteins

•tRNA transfer RNA (tRNA) Set of small RNA molecules used in protein synthesis as an interface
(adaptor) between messenger RNA and amino acids.
DNA encodes hereditary information (genotype) -> decoded into RNA -> protein
(phenotype)

TRANSLATION
Conversion of RNA into amino acid sequence that makes a protein
•The mRNA leaves the nucleus and enters the cytoplasm
• Ribosomes attach to mRNA
• tRNA (carrying anti-codon) picks up the correct amino acids and carries them to the mRNA
strand forming the protein
Ex:
–tRNA carries GAU (anti-codon)& looks for CUA on mRNA

Transcription
•Transcription- process that makes mRNA from DNA

1.DNA unzips into 2 separate strands A. DNA Helicase is the enzyme that breaks H-bond 2. Free
floating RNA NITROGEN BASES in the nucleus pair up w/unzipped DNA NITROGEN BASES: A.
Cytosine(C) pairs with Guanine(G) * (G) with (C) B. Uracil(U) pairs with Adenine(A) * (A) with (U)
C. Thymine (T) pairs with Adenine (A) ***remember (T) is only with DNA

3. After all the pairing is done:


•a single strand of RNA has been produced. 4. Genetic code from DNA is transferred to mRNA 5.
The code obtained from DNA lets the mRNA know which amino acids to pick up:

•code is a set of 3 nitrogen bases = Codon


Overall process

TRANSCRIPTION VS TRANSLATION

RNA splicing
In molecular biology and genetics, splicing is a modification of the nascent pre-messenger
RNA(pre-mRNA) transcript in which introns are removed and exons are joined. For nuclear-
encoded genes,splicing takes place within the nucleus after or concurrently with transcription.
•carried out by spliceosomes
•Spliceosomes
–complex of proteins and several small nuclear ribonucleoproteins (snRNPs)
–Recognize splice sites (specific RNA sequences)
–cleave out introns and splice together exons (coding
region)
In most eukaryotic genes, coding regions (exons) are
interrupted by noncoding regions (introns). During
transcription, the entire gene is copied into a pre-mRNA,
which includes exons and introns. During the process of
RNA splicing, introns are removed and exons joined to
form a contiguous coding sequence.

Function of RNA

Storage/transfer of genetic information


• Genomes
• many viruses have RNA genomes single-stranded
(ssRNA) e.g., retroviruses (HIV) double-stranded
(dsRNA)
• Transfer of genetic information
• mRNA = "coding RNA" - encodes proteins
A non-coding RNA (ncRNA) is an RNAmolecule that is not translated into a protein. Less-
frequently used synonyms are non-protein-coding RNA (npcRNA), non-messenger RNA
(nmRNA) and functional RNA (fRNA). The DNA sequence from which a functional non-coding
RNA is transcribed is often called an RNA gene

Structural
• e.g., rRNA, which is a major structural component of ribosomes BUT - its role is not just
structural, also: Catalytic RNA in the ribosome has peptidyltransferase activity
• Enzymatic activity responsible for peptide bond formation between amino acids in growing
peptide chain
• Also, many small RNAs are enzymes "ribozymes"

Regulatory Recently discovered important new roles for RNAs In normal cells:
• in "defense" - esp. in plants
• in normal development e.g., siRNAs, miRNA
As tools:
• for gene therapy or to modify gene expression
• RNAi
• RNA aptamers

You might also like