Comp Bio Lab File

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 43

EXPERIMENT-1:

AIM : To Explore NCBI

About NCBI :
The National Center for Biotechnology Information (NCBI) is part of the United States National
Library of Medicine (NLM), a branch of the National Institutes of Health. The NCBI is located in
Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude
Pepper. The NCBI houses genome sequencing data in GenBank and an index of biomedical
research articles in PubMed Central and PubMed, as well as other information relevant to
biotechnology. All these databases are available online through the Entrez search engine.

NCBI is directed by David Lipman, one of the original authors of the BLAST sequence alignment
program and a widely respected figure in Bioinformatics. He also leads an intramural research
program, including groups led by Stephen Altschul (another BLAST co-author), David Landsman,
and Eugene Koonin (a prolific author on comparative genomics).

Screenshot of NCBI :
Q.1. What is pubmed? Write few lines about it and then the home page.

PubMed is a free database accessing the MEDLINE database of citations, abstracts and some
full text articles on life sciences and biomedical topics. The United States National Library of
Medicine (NLM) at the National Institutes of Health (NIH) maintains PubMed as part of the Entrez
information retrieval system. Listing an article or journal in PubMed is not endorsement. In
addition to MEDLINE, PubMed also offers access to

 OLDMEDLINE for pre-1966 citations. This has recently been enhanced, and records for
1951+, even those parts in the printed indexes, are now included within the main portion.
 Citations to all articles (even those that are out-of-scope, e.g., covering plate tectonics or
astrophysics) from certain MEDLINE journals, primarily the most important general
science and chemistry journals, from which the life sciences articles are indexed for
MEDLINE.
 In-process citations which provide a record for an article before it is indexed with MeSH
and added to MEDLINE or converted to out-of-scope status (PREMEDLINE).
 Citations that precede the date that a journal was selected for MEDLINE indexing (when
supplied electronically by the publisher).
 Some life science journals that submit full text to the PubMed Central digital library and
may not have been recommended for inclusion in MEDLINE although they have
undergone a review by NLM, and some physics journals that were part of a prototype
PubMed in the early to mid-1990s.[1]
Q.2.What is BLAST? Mention the variants of BLAST.

In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing
primary biological sequence information, such as the amino-acid sequences of different proteins
or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and identify library sequences that resemble
the query sequence above a certain threshold. Different types of BLASTs are available according
to the query sequences. For example, following the discovery of a previously unknown gene in
the mouse, a scientist will typically perform a BLAST search of the human genome to see if
humans carry a similar gene; BLAST will identify sequences in the human genome that resemble
the mouse gene based on similarity of sequence. The BLAST program was designed by Eugene
Myers, Stephen Altschul, Warren Gish, David J. Lipman and Webb Miller at the NIH and was
published in J. Mol. Biol. in 1990.

Variants of BLAST

 BLASTN - Compares a DNA query to a DNA database. Searches both strands


automatically. It is optimized for speed, rather than sensitivity.
 BLASTP - Compares a protein query to a protein database.
 BLASTX - Compares a DNA query to a protein database, by translating the query
sequence in the 6 possible frames, and comparing each against the database (3 reading
frames from each strand of the DNA) searching.
 TBLASTN - Compares a protein query to a DNA database, in the 6 possible frames of
the database.
 TBLASTX - Compares the protein encoded in a DNA query to the protein encoded in a
DNA database, in the 6*6 possible frames of both query and database sequences (Note
that all the combinations of frames may have different scores).
 BLAST2 - Also called advanced BLAST. It can perform gapped alignments.
 PSI-BLAST - (Position Specific Iterated) Performs iterative database searches (details
below).
Q3.Give names of any five databases present in NCBI.

1.PubMed
A service of the National Library of Medicine that provides access to over 17 million citations from
MEDLINE and additional life sciences journals. PubMed includes links to many sites providing full
text articles and other related resources.

2.OMIM
This database is a catalog of human genes and genetic disorders authored and edited by Dr. Vict
or A. McKusick and his colleagues at Johns Hopkins and elsewhere and developed for the Web b
y NCBI.

3. GenBank:
An annotated collection of all publicly available nucleotide and amino acid sequences.

4.HomoloGene:
A gene homology tool that compares nucleotide sequences between pairs of organisms in order t
o identify putative orthologs.

5.EST database:
A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDN
A).

Q.4. What is bookshelf?

NCBI Bookshelf provides free online access to several books in molecular and cell biology
The NCBI Bookshelf is a collection of freely available, downloadable, on-line versions of selected
biomedical books. As of March 2006, the Bookshelf had 55 titles covering aspects of molecular
biology, biochemistry, cell biology, genetics, microbiology, a couple of disease states from a
molecular and cellular point of view, research methods, and virology. Some of the books are
online versions of previously published books, while others, such as Coffee Break (book), are
written and edited by NCBI staff. The Bookshelf is a complement to the Entrez PubMed repository
of peer-reviewed publication abstracts in that Bookshelf contents provide established
perspectives on evolving areas of study and a context in which many disparate individual pieces
of reported research can be organized.
Q.5.What is an accession number?

An accession number in bioinformatics is a unique identifier given to a DNA or protein sequence


record to allow for tracking of different versions of that sequence record and the associated
sequence over time in a single data repository. Because of its relative stability, accession
numbers can be utilized as foreign keys for referring to a sequence object, but not necessarily to
a unique sequence. All sequence information repositories implement the concept of "accession
number" but might do so with subtle variations.
EXPERIMENT 2

AIM : To retrieve a sequence from NCBI.

Q.1. What is Entrez?

The Entrez Global Query Cross-Database Search System is a powerful federated search engine,
or web portal that allows users to search many discrete health sciences databases at the National
Center for Biotechnology Information (NCBI) website. NCBI is part of the National Library of
Medicine (NLM), itself a department of the National Institutes of Health (NIH) of the United States.
"Entrez" also happens to be the second person plural (or formal) form of the French verb "entrer
(to enter)", meaning the invitation "Come in!".

Entrez Global Query is an integrated search and retrieval system that provides access to all
databases simultaneously with a single query string and user interface. Entrez can efficiently
retrieve related sequences, structures, and references. The Entrez system can provide views of
gene and protein sequences and chromosome maps. Some textbooks are also available online
through the Entrez system.
Q.2. Explain these fields in genbank or genpept format of the query AAA40590?
a) What is version?
VERSION is made of the accession number of the database record followed by a
dot and a version number. The VERSION system of identifiers was adopted in
February 1999 by the International Nucleotide Sequence Database Collaboration
(GenBank, EMBL, and DDBJ). Version information can then used to identify the
latest version of a sequence, when keyed by its Accession alone. Specific
versions can also be retrieved. For e.g. AAB33294.2 ; where (.2) is version.
b) What is GI ?
GI number (sometimes written in lower case, "gi") is simply a series of digits that
are assigned consecutively to each sequence record processed by NCBI. The gi
number will change every time the sequence changes.
This number serves three main purposes:

1. The NCBI assigns gi numbers to all sequences processed into Entrez,


including nucleotide sequences from DDBJ/EMBL/GenBank, protein sequences
from SWISS-PROT, PIR and many others. The gi number provides a unique
sequence identifier which is independent of the database source.

2. The gi number provides a unique identifier that specifies an exact sequence. If


a sequence in GenBank is modified, even by a single base pair, a new gi number
is assigned to the updated sequence (example: from gi 5868931 to gi 6358754).
3. The gi number is stable and retrievable. NCBI keeps the last version of every
gi number and includes this history in the record. For example, anyone can
examine this history and determine that gi 5868931 was replaced by gi 6358754.

c) What is the source of this query?


Octodon degus (degu)
ORGANISM Octodon degus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
Hystricognathi; Octodontidae; Octodon.

d) What is the date of submission for this query?


27-APR-1993
e) What is the length of this query?
109 aa
Q.3. How many display options do we have in this segment?

We have ten display options in the display setting as shown below.

 Summary
 Genpept
 Genpept(full)
 Fasta
 Fasta(text)
 graphics
 Asn.1
 Revision history
 Accession list
 GI list
Q.4. Retrieve your query in these formats.

a) Genpept
b) FASTA:
c) ASN.1:
EXPERIMENT 3

AIM: To perform local and global alignment at LALIGN servers.

Q1. What is EMBnet?


Ans: EMBnet stands for European Molecular Biology Network. Since its creation in 1988, EMBnet has
evolved from an informal network of individuals in charge of maintaining biological databases into the
only organization world-wide bringing bioinformatics professionals to work together to serve the
expanding fields of genetics and molecular biology. Although composed predominantly of academic
nodes, EMBnet gains an important added dimension from its industrial members. The success of EMBnet
is attracting increasing numbers of organizations outside Europe to join. EMBnet has a tried-and-tested
infrastructure to organise training courses, give technical help and help its members effectively interact
and respond to the rapidly changing needs of biological research in a way no single institute is able to do.
In 2005 the organization created additional types of node to allow more than one member per country as
associated node.
EMBnet is a science-based group of collaborating nodes throughout Europe and a number of nodes
outside Europe. The combined expertise of the nodes allows EMBnet to provide services to the European
molecular biology community which encompasses more than can be provided by a single node. This site
gives an overview of the organization and of its members. It provides the visitors with news of the
EMBnet community and new links related to bioinformatics.

Q2. Who developed LALIGN program?


Ans: LALIGN has been developed by William Pearson.

Q3. Which algorithm does this program implement?


Ans: The LALIGN program implements the algorithm of Huang and Miller.
Q4. Fetch these two sequences from NCBI and then perform Local & Global alignment between them at
default parameters. Paste the results.
1) AAA29796

2) NP_001157519

a) What are the default values of the following fields:

i) Scoring matrix
ii) Gap opening Penalty
iii) Gap extension Penalty

b) Discuss the output of Local & Global alignment.

ANS:
 LOCAL ALIGNMENT:
 Default values of the following fields:

i) Scoring matrix
o BLOSUM 50

ii) Gap opening Penalty


o -12

iii) Gap extension Penalty


o -2
 GLOBAL ALIGNMENT:
 Default values of the following fields:
i) Scoring matrix
o BLOSUM 62
ii) Gap opening Penalty
o -11
iii) Gap extension Penalty
o -1
Q5. Change Gap opening Penalty/Gap extension Penalty to -8 / -2. Record your Observation. Discuss the
difference between both the results from the perspective of Evolutionary distances.

ANS:
 LOCAL ALIGNMENT:
 GLOBAL ALIGNMENT

Here, there is a 21.6% identity in local alignment and 16% identity in global alignment.

As the identity in local alignment is more, this indicates that the sequences are not very divergent nor very
close. They are remote homologues.
Experiment-4
Aim: To perform similarity searching using BLAST software.
URL: www.ncbi.nlm.nih.gov/BLAST
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity
between sequences. The program compares nucleotide or protein sequences to sequence
databases and calculates the statistical significance of matches. BLAST can be used to
infer functional and evolutionary relationships between sequences as well as help identify
members of gene families.
Q1. What is BLAST? Mention the other variants of BLAST.

Ans. Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary
biological sequence information, such as the amino-acid sequences of different proteins or the
nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and identify library sequences that resemble
the query sequence above a certain threshold. Different types of BLASTs are available according
to the query sequences.

Variants of BLAST

 BLASTN - Compares a DNA query to a DNA database. Searches both strands


automatically. It is optimized for speed, rather than sensitivity.
 BLASTP - Compares a protein query to a protein database.
 BLASTX - Compares a DNA query to a protein database, by translating the query
sequence in the 6 possible frames, and comparing each against the database (3 reading
frames from each strand of the DNA) searching.
 TBLASTN - Compares a protein query to a DNA database, in the 6 possible frames of
the database.
 TBLASTX - Compares the protein encoded in a DNA query to the protein encoded in a
DNA database, in the 6*6 possible frames of both query and database sequences (Note
that all the combinations of frames may have different scores).
 BLAST2 - Also called advanced BLAST. It can perform gapped alignments.
 PSI-BLAST - (Position Specific Iterated) Performs iterative database searches (details
below).

Q2. Explain BLAST algorithm.


Ans. BLAST algorithm can be divided into four steps:
1. Seeding: In this step, we divide the sequence into words of certain specific length in the
query protein sequence (It is usually 11 for a DNA and 3 for protein sequences) "sequen
tially", until the last letter of the query sequence is included. For eg.
2. Alignment: BLAST only cares about the high-scoring words. The scores are created by co
mparing the word in the list with all the bases one by one and scored by using the scorin
g matrix (substitution matrix) to score the comparison of each residue pair.
3. Extension: BLAST then stretches a longer alignment between the query and the databas
e sequence in the left and right directions, from the position where the exact match occ
urred. The extension doesn’t stop until the accumulated total score of the HSP begins to
decrease.

4. HSP’s: We list the HSPs whose scores are greater than the empirically determined score.

Q3. Fetch this query from NCBI: “AAA40590”.

Q4. Perform BLAST with above query and record the results.
 Description:
 Summary:

Q5. What is the difference between:


a) Similarity and Identity. Show in output.
 Similarity: The number of residues that can be replaced, as seen in nature, due to similar
biochemical properties give similarity between two sequences.

Identity: The number of residues that are exactly the same in two sequences constitute
sequence identity.

o For example in the output of above sequence;


Here, Identity is 61% and similarity is 70%.

b) Mention the version of BLAST being used.


 BLASTP 2.2.25+
c) Describe which database is used for searching.
 Non-redundant protein sequences, consisting in non-redundant protein sequences.

Q6. Discuss the graphical output of BLAST.

The graphical output compared 100 sequences with the query. The length of output
sequences shows us the length being compared to query and the colour indicates the
alignment scores, here which is between 80-200 for all output sequences as compared to
the query sequence. It also tells us about the conserved domains found in the query
sequence.
Q7. How query coverage is important for us while using BLAST for a similarity
searching.
Ans. The percent of the query length that is included in the aligned segments comes
under query coverage. This tells us, how long piece of our sequence is covered by the one
found. E.g. we can get coverage of 100%, but low homology, or we can have like 90%
homology, but only on half of your sequence. This way it can help us in obtaining an
optimal alignment.
Q8. What these icons represent in BLAST output page?
a) G
 “G” refers to Gene. It maintains information about genes from genomes of interest to th
e RefSeq group.

b) U
 “U” refers to UniGene. It provides an organized view of the transcriptome. Each UniGen
e entry is a set of transcript sequences that appear to come from the same transcription
locus, together with information on protein similarities, gene expression, cDNA clone re
agents, and genomic location.

c) M
 “M” refers to Map Viewer. It provides a wide variety of genome mapping and sequencin
g data. Map Viewer allows us to view and search an organism's complete genome, displ
ay chromosome maps, and zoom into progressively greater levels of detail, down to the
sequence data for a region of interest.

d) S
 “S” refers to Structure. It gives free access to macromolecular structures, conserved do
mains and protein classification, activity of small molecules and other such related infor
mation.

Q9. How scores and E- values help in finding an optimal alignment?


Ans. Score or bit score is a value calculated from the number of gaps and substitutions
associated with each aligned sequence. The higher the score, the more significant the alignment.
Each score links to the corresponding pairwise alignment between query sequence and output
sequence.
The Expect value (E) is a parameter that describes the number of hits one can "expect" to
see by chance when searching a database of a particular size. It decreases exponentially
as the Score (S) of the match increases. Essentially, the E value describes the random
background noise. The lower the E-value, or the closer it is to zero, the more "significant"
the match is. However, keep in mind that virtually identical short alignments have
relatively high E values. This is because the calculation of the E value takes into account
the length of the query sequence. These high E values make sense because shorter
sequences have a higher probability of occurring in the database purely by chance.
Experiment-5

AIM: To explore PROSITE Database.


URL: expasy.org/prosite/

Q1. What is PROSITE?


Ans. PROSITE is a database of protein families and domains. PROSITE consists of
documentation entries describing protein domains, families and functional sites as well as
associated patterns and profiles to identify them. It is based on the observation that, while
there is a huge number of different proteins, most of them can be grouped, on the basis of
similarities in their sequences, into a limited number of families. Proteins or protein
domains belonging to a particular family generally share functional attributes and are
derived from a common ancestor.

Q2. What are patterns and profiles in PROSITE?


Ans. It is apparent, when studying protein sequence families, that some regions have
been better conserved than others during evolution. These regions are generally important
for the function of a protein and/or for the maintenance of its three- dimensional
structure. By analyzing the constant and variable properties of such groups of similar
sequences, it is possible to derive a signature for a protein family or domain, which
distinguishes its members from all other unrelated proteins.
A fingerprint is generally sufficient to identify a given individual. Similarly, a protein
signature can be used to assign a newly sequenced protein to a specific family of proteins
and thus to formulate hypotheses about its function.
These signatures constitute the patterns and profiles present in PROSITE database.
Q3. Fetch these two queries and feed them to PROSITE.
a) Tyrosine Kinase(Homo Sapiens)
b) Chymotrypsin(Homo Sapiens)
Q4. Explain why few letters are in small case and some are in upper case in the said
signature.
Ans:

Q5. What are rule. Give some examples.


Ans: The ProRule section of PROSITE is constituted of manually created rules that
increase the discriminatory power of PROSITE motifs (generally profiles) by providing
additional information about functionally and/or structurally critical amino acids and can
automatically generate annotation based on PROSITE motifs. Each rule contains
information used to provide template based annotation associated with the domain or
family detected by the PROSITE motif.
For example; when searched for chymotrypsin, it gives us the propagated annotation that
gives us information about residues involved in catalytic activity apart from other
information.
EXPERIMENT-6

AIM: To predict the 3-D structure of our protein using SWISS-MODEL present on
EXPASY server.

SWISS-MODEL is a fully automated protein structure homology-modeling server,


accessible via the ExPASy web server, or from the program DeepView (Swiss Pdb-
Viewer). The purpose of this server is to make Protein Modelling accessible to all
biochemists and molecular biologists WorldWide.
SWISS-MODEL (http://swissmodel.expasy.org) is a server for automated comparative
modeling of three-dimensional (3D) protein structures. SWISS-MODEL provides several
levels of user interaction through its World Wide Web interface: in the 'first approach
mode' only an amino acid sequence of a protein is submitted to build a 3D model.
Template selection, alignment and model building are done completely automated by the
server. In the 'alignment mode', the modeling process is based on a user-defined target-
template alignment. Complex modeling tasks can be handled with the 'project mode'
using DeepView.

Q1. What is the primary, secondary and tertiary structure of a protein?


Ans. Primary Structure: The primary structure refers to the sequence of the different
amino acids of the peptide or protein. The primary structure is held together by covalent
or peptide bonds. It’s one end is known as carboxyl-terminal and the other is known as
amine terminal.
Secondary Structure: It refers to highly regular local sub-structures. Two main types of
secondary structure, the alpha helix and the beta strand including beta turns come under
this category. These secondary structures are defined by patterns of hydrogen bonds
between the main-chain peptide groups. Several sequential secondary structures may
form a supersecondary structure like motifs.
Tertiary Structure: It refers to three-dimensional structure of a single protein molecule.
The alpha-helices and beta-sheets are folded into a compact globule or long chains. The
folding in case of globular proteins is driven by the non-specific hydrophobic
interactions. The structure is stabilized via salt bridges, hydrogen bonds, and the tight
packing of side chains and disulfide bonds.

Q2. What is homology?


Ans. Protein homology is biological homology between proteins, meaning that the
proteins are derived from a common ancestor. It is of two types:
 Orthology: The proteins may be in different species, with the ancestral protein being the
form of the protein that existed in the ancestral species. For example, myoglobin in hum
ans and mice.
 Paralogy: Proteins may be in the same species, but have evolved from a single protein
whose gene was duplicated in the genome. For example, haemoglobin and myoglobin

Q3. Explain homology based approach of protein structure prediction.


Ans. Homology modeling, of protein refers to constructing an atomic-resolution model of
the "target" protein from its amino acid sequence and an experimental three-dimensional
structure of a related homologous protein (the "template"). Homology modeling relies on
the identification of one or more known protein structures likely to resemble the structure
of the query sequence, and on the production of an alignment that maps residues in the
query sequence to residues in the template sequence. It has been shown that protein
structures are more conserved than protein sequences amongst homologues, but
sequences falling below a 20% sequence identity can have very different structure.
Evolutionarily related proteins have similar sequences and naturally occurring
homologous proteins have similar protein structure. It has been shown that three-
dimensional protein structure is evolutionarily more conserved than expected due to
sequence conservation
Q4. What is ANOLEA? What is its purpose?
Ans. ANOLEA (Atomic Non-Local Environment Assessment) is a server that performs
energy calculations on a protein chain, evaluating the "Non- Local Environment" (NLE)
of each heavy atom in the molecule. The energy of each pairwise interaction in this non-
local environment is taken from a distance-dependent knowledge-based mean force
potential that has been derived from a database of 147 non-redundant protein chains with
a sequence identity below 25% and solved by X-Ray crystallography with a resolution
lower than 3 Å.

The calculations involve the non-local interactions between all the heavy atoms of the
twenty standard amino acids in the molecule. The input of the server is a PDB file
containing one or more protein chains. The output is an energy profile, which gives an
energy value for each amino acid of the protein. High energy zones (HEZs) in the profile
correlate with errors or with potential interacting zones of proteins.
It can help us verify the experimental and theoretical models of protein structures for
errors. In this way, it helps in verifying protein structures.
Q5. Record results for your query.
Ans.Protein chosen: Glucogon(Homo Sapiens)
Glucagon, a hormone secreted by the pancreas, raises blood glucose levels. Its effect is
opposite that of insulin, which lowers blood glucose levels.
EXPERIMENT-7

AIM: To predict transmembrane regions in the query sequence.


PROGRAM USED: TMHMM
TMHMM -- TransMembrane prediction using Hidden Markov Models -- is a program
for predicting transmembrane helices based on a hidden Markov model. It reads a
FASTA formatted protein sequence and predicts locations of transmembrane,
intracellular and extracellular regions.
Q1. Give examples of TM Proteins.
Ans. T-cell receptor, Voltage-gated ion channels, mitochondrial carrier proteins, insulin
recptor, aquaporins, receptor tyrosine kinases etc.
Q2. What is the difference between Transmembrane and Globular protein?
Ans. The major difference between transmembrane and globular proteins is that the
former is utilised mainly for contact across membranes whereas latter performs various
intra and extra-cellular functions.
Almost all transmembrane proteins have similar structure whereas globular proteins exist
in a wide variety of shapes and sizes.
Q3. What is the average length of a transmembrane helix traversing a cell membrane?
Ans. Transmembrane helices are usually about 20 amino acids in length, although they
may be much longer or shorter.
Q4. How many transmembrane helices are predicted in the query?
Ans. It depends on the query sequence. For example, for Rhodopsin, which is a type of
G-protein coupled receptor, 7-transmemrane helices were predicted.
Q5. Give examples of globular proteins and record its results using TMHMM server.
Ans. Globular proteins are highly diverse and include some of the most common proteins
like haemoglobin, myoglobin, various secretary immunoglobins (like IgA, IgG etc),
various enzymes etc.

Globular protein called IgA has been used here as the query sequence.
 Length: the length of the protein sequence. Here it is 131 amino acids.

 Number of predicted TMHs: The number of predicted transmembrane helices. Here it


is zero.

 Expected number of Amino Acids in Transmembrane helicess: The expected number


of amino acids intransmembrane helices. If this number is larger than 18 it is very likely
to be a transmembrane protein (OR have a signal peptide). Here it is 1.6 which signifies
that its not a transmembrane protein.
 Expectation number, first 60 Amino acids: The expected number of amino acids in
transmembrane helices in the first 60 amino acids of the protein. If this number more than
a few, you should be warned that a predicted transmembrane helix in the N-term could be
a signal peptide. Here it is 1.5.

 Total prob of N-in: The total probability that the N-term is on the cytoplasmic side of
the membrane. Here it is 0.08, which means that it’s N-terminal is not in the cytoplasmic
region.

The overall highest probability is that the protein is present outside the transmembrane
region given by pink line.

You might also like