Professional Documents
Culture Documents
Nirmal Exercise 1
Nirmal Exercise 1
Aim : To access, explore and understand the representation of biological sequence and
structure data.
Procedure :
The Browser was opened and following sites were accessed for the answers to the questions:
NCBI- Genbank,
EMBL-EBI
ENA
DDB,
Uniprot
PDB
CATH
SCOP2
GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly
available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42)
The European Nucleotide Archive (ENA) captures and presents information relating to
experimental workflows that are based around nucleotide sequencing.ENA records this
information in a data model that covers input information (sample, experimental setup,
machine configuration), output machine data (sequence traces, reads and quality scores) and
interpreted information.
5) Correlate how ENA & EMBLEBI is associated? Also outline its services.
EMBLEBI is highly focused on providing cutting edge tools and technology that can
really be used in bioinforrmatics research .Various database query tools are provided on
their website like UNIPROT ,BLAST search,ENSEMBL,PDBe,ChEMBL etc.
The ENA is produced and maintained by the European Bioinformatics Institute and is a
member of the International Nucleotide Sequence Database Collaboration (INSDC)
along with the DNA Data Bank of Japan and GenBank. The EMBL Nucleotide
Sequence Database (also known as EMBL-Bank) is the section of the ENA which
contains high-level genome assembly details, as well as assembled sequences and their
functional annotation.
6) What do you infer about PDB's data format? Also list the features for application.
The Protein Data Bank (PDB) format provides a standard representation for
macromolecular structure data derived from X-ray diffraction and NMR studies.
This representation was created in the 1970's and a large amount of software using it has
been written. A typical PDB formatted file includes a large "header" section of text that
summarizes the protein, citation information, and the details of the structure solution,
followed by the sequence and a long list of the atoms and their coordinates .Online
tools,allows us to search and explore the information under the PDB header.
DDBJ is expanded as "DNA Data Bank of Japan".The principal purpose of DDBJ operations
is to improve the quality of INSD, as public domains. When researchers make their data open
to the public through INSD and commonly shared in world wide, DDBJ center describes
information on the data as rich as possible, according to the unified rules of INSD, preferably
without any stress by using DDBJ.
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence
and annotation data. UniProt is a collaboration between the European Bioinformatics Institute
(EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource
(PIR). EMBL-EBI and SIB together used to produce Swiss-Prot and TrEMBL, while PIR
produced the Protein Sequence Database (PIR-PSD).TrEMBL (Translated EMBL Nucleotide
Sequence Data Library) was originally created because sequence data was being generated at a
pace that exceeded Swiss-Prot’s ability to keep up.In 2002 the three institutes decided to pool
their resources and expertise and formed the UniProt consortium.
1) Obtain the nucleotide sequence for any known gene of yours from GenBank-
NCBI(iterate the steps including the keyword and filters) (e.g. nucleotide sequence
for p53 gene of Bubalus bubalis)
(HEAVILY EDITED FOR COMPATIBILTITY PURPOSES)
ORIGIN
1 atggaagaat cacaggcaga actcaatgtg gagccccctc tgagtcagga gacattttcc
61 gacttgtgga acctacttcc tgaaaataac cttctgtcct ccgagctctc tgcacctgtg
...
1141 gaggggcctg actcagactg a
ACCESSION JF792632
VERSION JF792632.2
3)List out all the information that are retrieved, other than the sequence.
/translation="MEESQAELNVEPPLSQETFSDLWNLLPENNLLSSELSAPVDDLL
PYTDVATWLDECPNEVPQMPEPSAPAAPPPATPAPATSWPLSSFVPSQKTYPGNYGFR
LGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVDSPPPPGTRVRAMAIYKKLEH
MTEVVRRCPHHERSSDYSDGLAPPQHLIRVEGNLRAEYLDDRNTFRHSVVVPYESPEI
DSECTTIHYNFMCNSSCMGGMNRRPILTIITLEDSCGNLLGRNSFEVRVCACPGRDRR
TEEENFRKKGQSCPEPPPGSTKRALPTNTSSSPQQKKKRLDEEYFTLQIRGLKRYEMF
RELNDALELKDALDGREPGESRAHSSHLKSKKRPSPSCHKKPMLKREGPDSD"
4) Compare and contrast between Genbank and FASTA sequence formats.
FASTA format is a text-based format for representing either nucleotide sequences or peptide
sequences, in which base pairs or amino acids are represented using single-letter codes. A
sequence in FASTA format begins with a single-line description, followed by lines of sequence
data. A sequence file in Genbank format can contain several sequences.
One sequence in Genbank format starts with a line containing the word LOCUS and a number
of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the
end of the sequence is marked by two slashes ("//").
5) Access the EMBL-EBI site and retrieve a nucleotide sequence for any gene of your choice and
give its file format.
6) Retrieve the protein sequences of your protein of interest from Uniprot site and give its file
format.
Fasta Format is used here
1) Access the PDB site, retrieve the structure result page for Human insulin A and obtain
the PDB format and view its features by opening with word pad.
CONCLUSION:
The knowledge on how to access various sequence database and structural database was obtained.
Moreover, the representation of various file formats of sequence and structure data was
understood.