Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Introduction to

bioinformatics

Subha Narayan Rath


• Nucleotide sequence database: 6*10e11 bases (600 Gbp)
• Human genome: 3 * 10e9
• 200 human genome equivalent data are there.
• The databse of macromolecular structures: 100 000 entries with full 3d corordinates of
proteins, amino acids etc.
• Phenotype= genotype+ environment+ life history+ epigenetics
• Alleles are different forms or sequences of the same gene
• Homozyogsity and heterozygosity…
• Bioinformatics is defined as the application of tools of
computation and analysis to the capture and interpretation of
biological data. It is an interdisciplinary field, which harnesses
computer science, mathematics, physics, and biology

• (Molecular) Bioinformatics is conceptualizing biology in terms of


• molecules (in the sense of Physical chemistry) and
• applying “informatics techniques” (derived from disciplines such as applied
maths, computer science and statistics)
• to understand and organize the information associated with these
molecules, on a large scale.
• Outside strands= phosphoric acid+
deoxyribose sugar
• Inside strands= 4 nitrogenous bases
such as purines (AG) or pyrimidines (CT)
bases determining the code of the genes
• Nucleotide= Phosphoric
acid+deoxyribose+Base
• Three successive bases= a code word
• Transcription: DNA > RNA
• 1. GENETIC REGULATION • 2. ENZYME REGULATION
• Promoter (TATA box or TATAAAA) • Enzyme inhibition (negative feedback
control)
• TF: + OR – transcription factors
• Enhancers • Enzyme activation (e.g. cAMP activates
glycogen breakdown to form more ATPs)
• Hormones: signals from outside the cells
• E.g. purine and pyrimidine formation
uses both of the above mechanisms
• DNA > m RNA > Protein
• Strands in double helix are anti-parallel, direction either 3` or 5` (for deoxy-ribose ring)
• Transcription of DNA TO RNA and translation of m RNA: always read from 5` position
• Protein formation requires splicing or removal of non-coding regions
• Several proteins from same gene: by mixing and matching of exons
• Other types of RNA such as siRNA, microRNA, piwi-interacting RNAs: control translation
• Triplets of code from DNA act as cipher for protein code (as fig)
• To understand function of entire human genome: Encyclopaedia of DNA elements
• https://www.nature.com/collections/aghcdefffg/
• 80% of human genome can be ascribed to some function…compared to previously
thought 23,000 protein coding genes which is 1.5% of genome
• The rest is called by some junk DNA
• Variable splicing means number of proteins are not limited to these genes
• TWO ways non-coding regions to have function
• 1. involved in sequence dependent physical interaction: within chromatin that either expose
it or block from protein ligands
• 2. if transcribed to RNA: functions like regulation of transcription
• 75% of human genome is transcribed
• Mapping and dictionary of regulatory sites: many to one i.e. many proteins can bind to
same regulatory sites
• A sketch of structure of regulatory network
• Mapping of exposed sites of chromatin, which are unprotected from DNase1 cleavage:
these are regulatory sites near genes for binding of regulators of expression
• 200-400 Amino acids length
• Exons and introns: among introns some are for regulation and some considered junks
• Proteins and structural RNAs vary a lot in their 3d structure
• For each protein or peptide sequence: there is a stable native state adopted
spontaneously
• The paradigm is: (and this is focus of bioinformatics)
• DNA structure determines protein sequence
• That determines protein structure
• That determines protein function
• Regulatory mechanisms like control of expression patterns
• Cell RNA content: Transcriptome
• DNA methylation patterns
• Splice variants and post-translational modifications of proteins in any cell
• Patterns in protein-protein interaction, DNA-Protein interaction with finding of exact
regions binding for it
• Integration of individual regulatory steps into networks
Genome organization

Subha Narayan Rath


• Genome of a typical bacterium like E. coli: 4.6* 10e6 bp with a single DNA molecule
• If extended it would be 2 mm long and it fits into 0.001 mm of diameter of a cell
• Human cells contain 23 pairs of chromosomes and size is 3223*10e6 bp
• Transcriptome: for all the RNA content of the cells
• Proteome: for all the protein content of the cells….in humans only 20000 protein
coding genes, but number of proteins are very huge
• The overall base composition of E.COLI genome is A=T=49.2%. In a random sequence
of 4 639 221 (normal bp of E. coli) with these proportions, what is the expected number
of occurrences of the sequence CTAG?
• They may appear in either strand of DNA
• In bacteria: functional unit of genetic sequence are
• 3N nucleotides presenting
• N amino acids of a protein
• In eukaryotes: one gene is split into separate segments in genetic DNA
• EXON: Expressed region
• INTRON: Intervening region
• Cellular machinery splices together the initial mRNA to make the product
• Control mechanism may turn genes ON or OFF
• Or regulate gene expression more finely
• Cascade of controls respond to conc. of nutrients, to stress, or to control cell cycle
• Rapidly compare a query sequence to a database of subject sequences
• Generate alignments between them= the quality of which is by ALIGNMENT SCORE
• Return alignments that pass user defined score and statistical significance thresholds
• BLAST uses local alignment to find high scoring segment pairs (HSP) between two
sequences
• BLAST HIT: A Subject sequence that is aligned to the query
https://blast.ncbi.nlm.nih.gov/Blast.cgi
• Mapping oligonucleotides to genome
• Comparing DNA from closely related species
• Aligning expressed sequence tags to a genome
• Exploring protein function
• Initial discovery for conserved domains
• Nucleotide query is translated into all 6 reading frames
• 3 reading frames in + strand
• 3 reading frames in – strand
• Each reading frame is compared to a protein database
• It is used for:
• Gene finding in genomic DNA (Annotations)
• Annotating ESTs
• Query is a protein sequence
• Nucleotide database is translated into 6 RFs
• The query is then compared to each RF
• It is used for
• Mapping a protein to genome database
• Finding ESTs that map to a protein sequence
• Finding RNA Seq reads that map to a protein sequence
• BOTH query and subject database (both are nucleotides) are converted to 6 RFs and
compared
• It is best used for:
• Comparing the nucleotide sequence from distantly related species
• Identify coding regions in ESTs
• Sensitive but expensive
Protein structure and
how to retrieve
information from
archives

Subha Narayan Rath


• Exploring protein function
• Initial discovery for conserved domains
• However, can we predict protein structure and function based on amino acid
sequence??
• Each protein folds to a unique 3D structure based on its amino acid sequence
• Protein structure is closely related to its function
• Protein structure prediction is a grand challenge of computational biology
• 1. Protein information resource (PIR) and associated database
• A. PIRSF: Sequence family classification
• B. iProClass: integrated protein knowledgebase
• C. iProLINK: integrated protein literature, information and knowledge
• 2. SWISS-PROT (from Swiss institute of bioinformatics, Geneva, Switzerland)
• Also contains bioinformatics tools and links called Expert Protein Analysis System
(ExPASy….www.expasy.org)
• PROSITE is a set of signature patterns characteristic of protein familiies
• 3. Trembl
• Amino acid sequence is not inferable from the gene sequence, because
• A. ambiguity in splicing
• B. information about ligands, disulphide bridges, subunit associations, post-
translational modifications, effects of mRNA editing etc. can’t be known from gene
sequence
• E.g. https://www.uniprot.org/
• Patterns of conservation identity features that nature has found to retain (PROSITE
signatures are examples)
• R. Doolittle suggested: two full length protein sequences (>100 residues) that have
25% or more identical residues in an optical alignment are likely to be related
• < 15%; doubtful similarity
• 18-25%: twilight zone where there might be similarity like appearance of PROSITE
consensus patterns
• Sequence oriented databases are: interPro, Pfam, COG
• Structure-oriented databases are: SCOP, CATH
• They archive, annotate, distribute a set of atomic coordinates
• Worldwide Protein DataBank (wwPDB)
• Others: protein data bank Europe (EBI at UK) and protein data bank Japan (based at
Osaka university)
• www.wwpdb.org
• It contains also structures of nucleic acids, carbohydrates in addition to proteins
• CCDC: Cambridge crystallographic Data Centre archives the structure of small
molecules
• BioMagResBank at University of Wisconsin: archives protein structures determined by
NMR
• wwPDB keeps the data from X-ray structure determinations
• What protein and from which species
• Who solved the structure with reference
• Experimental details to solve the structure such as resolution of xray structure
detemination
• Amino acid sequence
• Atomic coordinates (starting with ATOM)
• Additional molecules including cofactors, inhibitors, water molecules (keyword
HETATM identifies coordinates of these)
• Assignment of secondary structure: helices and sheets
• Disulphide bridges
• Protein details can be found in RCSB homepage, a part of PDB
• https://www.rcsb.org/
• 1TRZ and different parts of that
• Depending on R sub-unit
of amino acids the
properties vary
• 20 amino acids and their
properties
• Polymerization of amino
acids while synthesis,
causes the primary
structure
• Linear and ordered
• 1D
• Sequence of amino acids
• Written from amino end to carboxyl end
by convention
• This linear structure is neither functional
nor stable; so it has to do folding
• Non-linear
• 3D
• Localized to regions of amino acid
chain
• Formed and stabilized by
• Hydrogen bonding
• Van-der-walls interactions
• Electrostatic forces
• Occurs in cytosol (upto 60% of bulk
water or 40% of water of hydration)
• Due to interaction of 2 structure and
solvents
• It’s non-linear and 3D
• In cytosol due to close proximity to other
folded and packed proteins
• Involve interaction of tertiary structure
elements of separate protein molecules
• Non-linear and 3D
• Class = secondary structure composition
• Could be all α, all β, α/β, α+β
• Motif = small specific combination of
secondary structure elements e.g. β –α – β
loop
• Fold is architecture, the overall shape
and orientation of secondary structures,
ignoring connectivity between the
structures
• E.g. alpha/beta barrel
• Fold families: catergorizations that take
into account topology and previous
subsets as well as empirical or biological
properties
• Superfamilies: above plus it includes
evolutionary and ancestral properteis
• Primary (amino acid sequence)
• Secondary (alpha helix and beta sheet)
• Tertiary (3D structure formed by
assembly of secondary structures)
• Quaternary (more than one polypeptide
chains)
Alignment of pair of
sequences and
phylogenetic trees

Subha Narayan Rath


• Given 2 or more sequences:
• Measure their similarity
• Determine the residue-residue correspondence
• Observe the patterns of conservation and variability
• Infer evolutionary relationships
• Sequence alignment is the identification of residue-residue correspondences.
• A mutual alignment of more than 2 sequences is called multiple sequence alignment
• IT is a tool in bioinformatics to compare two sequences and show the sequences of
close similarity in a visually understandable way.
• Sliding window size: noisy vs smooth (default is 10)
• Window size changes with goal of analysis
– size of average exon
– size of average protein structural element
– size of gene promoter
– size of enzyme active site
• Cut-off value by statistics and Z score
• If the direction of the movement: diagonal or horizontal or vertical….
• Repeated domains
• Conserved domains
• Exons and introns
• Terminators
• Frameshifts
• Low-complexity regions

https://www.bioinformatics.nl/cgi-
bin/emboss/dotmatcher
Arrangement of domains as described in
Swiss-Plot entry

Drosophila melanogaster SLIT protein against itself


• ANACON – Contact analysis of dot plots.
• D-Genies– Specializes in interactive whole genome dotplots of large genomes
• Dotlet – Provides a program allowing you to construct a dot plot with your own sequences.
• dotmatcher– Web tool to generate dot plots (and part of the EMBOSS suite).
• Dotplot – easy (educational) HTML5 tool to generate dot plots from RNA sequences.
• dotplot – R package to rapidly generate dot plots as either traditional or ggplot graphics.
• Dotter– Stand alone program to generate dot plots.
• JDotter – Java version of Dotter.
• Flexidot – Customizable and ambiguity-aware dotplot suite for aesthetics, batch analyses and printing (implemented in
Python).
• Gepard – Dot plot tool suitable for even genome scale.
• Genomdiff – An open source Java dot plot program for viruses.
• LAST for whole-genome “split-alignment”.
• lastz and laj – Programs to prepare and visualize genomic alignments.
• yass – Web-based tool to generate (both forward and reverse complement) dot plots from genomic alignments.
• seqinr – R package to generate dot plots.
• SynMap – An easy to use, web-based tool to generate dotplots for many species with access to an extensive genome
database. Offered by the comparative genomics platform CoGe.
• UGENE Dot Plot viewer – Opensource dot plot visualizer.
• Given 2 strings, two measures of the distance between them.
• A. Hamming distance: defined between two strings of equal length, is the number of
positions with mismatching characters
• B. Levenshtein or edit, distance: defined between two equal or unequal length, is the
minimal number of “edit operations’ required to change one string to the other.
• It could be Deletion, Insertion, Substitution of a single character in either sequence
• A given sequence of edit operations induces a unique alignment, but not vice versa!!!
• For molecular biology certain changes are likely to occur than others e.g. amino acid
substitutions of similar sizes, so variable weights to different edit operations
• What about similarity scoring system??
• Transition mutation vs transversion mutation with higher points for former groups.
• To measure the relative probability of any particular substitution, first we find the
relative frequencies of changes in pairs of aligned homologous sequences and based
on that we can make a scoring matrix for substitutions.
• A common change should score HIGHER than a rare one
• A measure of sequence divergence is PAM = 1% Accepted Mutation
• 1 PAM apart of two sequences would have 99% identical residues and collecting
statistics of these pair produces 1PAM substitution matrix
• Power of the matrix is used for more divergent sequences
PAM 250 levels
(250% of expected change or
250 substituions per 100 amino acids),
corresponds to 20% overall sequence similarity.

The occurrence of reversions, either directly or


via other changes, produces slowdown of
mutation rates
• It expresses scores of log-odds values:
• Score of mutation I to J = log10
(observed I to J mutation
rate/mutation rate expected from
amino acid frequencies)
• The numbers are multiplied by 10 to
avoid decimals
• The probability of 2 independent
mutations is the product of their
individual probabilities and hence
added.
• Score is +ve: sequences are related or
conservative substitution
• Using much larger amount of data available
now
• Means BLOcks SUbstittuion Matrix and
based on BLOCKS database (representing
known protein families) of aligned protein
sequences
• From family of closely related proteins
alignable without gaps… they calculated
the ratio of number of observed pairs of
amino acids at any position to the number
expected from overall amino acid
frequencies
• They have sequence identities higher than a
threshold e.g. BLOSUM 62% is commonly
used where the matrix built using
sequences no more than 62% similarity
• In addition to substituion matrix: there is a way of gap weighting too
• Aligning DNA sequnces: CLUSTAL-W is recommended
• Alignign Protein sequences: BLOSUM 62 is recommended

Matrix/ protocol Gap initiation Gap Match Mismatch


recommended extension

DNA CLUSTAL-W 10 0.1 1 0

Protein BLOSUM62 11 1 Matrix Matrix


• An algorithm used for this: dynamic programming and very imp for molecular biology
• Guarantee: to give an optimal global alignment
• Problem1: many alignment may give the same optimal score
• Problem 2: technical: the time required to align two sequences is proportional to n * m, as it
is the size of edit matrix that must be filled in
• Variations of the dynamic programming method:
• 1. entire sequence to entire sequence: global match
• 2. region of one sequence to entire other sequence: local match
• 3. region of one to region of another: motif match
• Typical approximation approach would take a small integer k > all instances of each k-tuple
of residues in the probe sequence that is found in database sequences

You might also like