Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Exercise 1 Date : 2/07/2019

Representation of biological sequence and structure data

Aim : To access, explore and understand the representation of biological sequence and
structure data.

Procedure :
The Browser was opened and following sites were accessed for the answers to the questions:
NCBI- Genbank,
EMBL-EBI
ENA
DDB,
Uniprot
PDB
CATH
SCOP2

Questions that were addressed for database:

1) Explore and find out the significance of Genbank.

GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly
available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42)

2) What are the 3 organizations that pool up for Genbankdb?

GenBank is part of the International Nucleotide Sequence Database Collaboration, which


comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and
GenBank at NCBI

3) How is Genbank associated with NCBI?

GenBank is a comprehensive public database of nucleotide sequences and supporting


bibliographical and biological annotation.NCBI builds GenBank primarily from the
submission of sequence data from authors and from the bulk submission of expressed sequence
tag (EST), genome survey sequence (GSS) and other high-throughput data from sequencing
centers

4) Give your idea about ENA.

The European Nucleotide Archive (ENA) captures and presents information relating to
experimental workflows that are based around nucleotide sequencing.ENA records this
information in a data model that covers input information (sample, experimental setup,
machine configuration), output machine data (sequence traces, reads and quality scores) and
interpreted information.
5) Correlate how ENA & EMBLEBI is associated? Also outline its services.

EMBLEBI is highly focused on providing cutting edge tools and technology that can
really be used in bioinforrmatics research .Various database query tools are provided on
their website like UNIPROT ,BLAST search,ENSEMBL,PDBe,ChEMBL etc.
The ENA is produced and maintained by the European Bioinformatics Institute and is a
member of the International Nucleotide Sequence Database Collaboration (INSDC)
along with the DNA Data Bank of Japan and GenBank. The EMBL Nucleotide
Sequence Database (also known as EMBL-Bank) is the section of the ENA which
contains high-level genome assembly details, as well as assembled sequences and their
functional annotation.

6) What do you infer about PDB's data format? Also list the features for application.

The Protein Data Bank (PDB) format provides a standard representation for
macromolecular structure data derived from X-ray diffraction and NMR studies.
This representation was created in the 1970's and a large amount of software using it has
been written. A typical PDB formatted file includes a large "header" section of text that
summarizes the protein, citation information, and the details of the structure solution,
followed by the sequence and a long list of the atoms and their coordinates .Online
tools,allows us to search and explore the information under the PDB header.

7)Explore DDBJ-Expand and explore its functions.

DDBJ is expanded as "DNA Data Bank of Japan".The principal purpose of DDBJ operations
is to improve the quality of INSD, as public domains. When researchers make their data open
to the public through INSD and commonly shared in world wide, DDBJ center describes
information on the data as rich as possible, according to the unified rules of INSD, preferably
without any stress by using DDBJ.

8)Give your understanding about UniprotKB, Swiss prot, TrEMBL, PIR .

The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence
and annotation data. UniProt is a collaboration between the European Bioinformatics Institute
(EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource
(PIR). EMBL-EBI and SIB together used to produce Swiss-Prot and TrEMBL, while PIR
produced the Protein Sequence Database (PIR-PSD).TrEMBL (Translated EMBL Nucleotide
Sequence Data Library) was originally created because sequence data was being generated at a
pace that exceeded Swiss-Prot’s ability to keep up.In 2002 the three institutes decided to pool
their resources and expertise and formed the UniProt consortium.

9)Read the protein spot light in Uniprot site and brief on it .

Protein Spotlight (ISSN 1424-4721) is a monthly review written by the Swiss-Prot


team of the SIB Swiss Institute of Bioinformatics. The protein that was described for tis
month was SETD3 and its relation with actin protein.
10) Compare and contrast the features of CATH2 and SCOP2 .

CATH is a classification of protein structures downloaded from the Protein Data


Bank.Gene3D uses the information in CATH to predict the locations of structural
domains on millions of protein sequences available in public databases.
SCOP sorts protein domains into classes, folds, superfamilies and families while the four
major levels of CATH are class, architecture, topology and homologous superfamily.The
building process of CATH contains more automatic steps and less human intervention
compared to SCOP .

Question that were addressed for sequence data

1) Obtain the nucleotide sequence for any known gene of yours from GenBank-
NCBI(iterate the steps including the keyword and filters) (e.g. nucleotide sequence
for p53 gene of Bubalus bubalis)
(HEAVILY EDITED FOR COMPATIBILTITY PURPOSES)

ORIGIN
1 atggaagaat cacaggcaga actcaatgtg gagccccctc tgagtcagga gacattttcc
61 gacttgtgga acctacttcc tgaaaataac cttctgtcct ccgagctctc tgcacctgtg
...
1141 gaggggcctg actcagactg a

2) Identify the Genbank accession Id of your sequence.

ACCESSION JF792632

VERSION JF792632.2
3)List out all the information that are retrieved, other than the sequence.

LOCUS JF792632 1161 bp mRNA linear MAM 17-NOV-2014

DEFINITION Bubalus bubalis p53 mRNA, complete cds.


KEYWORDS .

SOURCE Bubalus bubalis (water buffalo)


ORGANISM Bubalus bubalis
Eukaryota; Metazoa; Chordata; CEuteleostomi;
Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia;
Pecora; Bovidae; Bovinae; Bubalus.
REFERENCE 1 (bases 1 to 1161)
AUTHORS Singh,M., Dogra,N., Mohanty,A., Mishra,B. and
Mukhopadhyay,T.
TITLE p53 regulation under altered cellular environment
JOURNAL (in) ? (Ed.);
PROCEEDINGS OF THE ALL INDIA CELL BIOLOGY CONFERENCE 34: 109;
(2010)

... (Heavily edited for compatiblity purposes)

REMARK Nucleotide and amino acid sequences updated by submitter


COMMENT On Nov 17, 2014 this sequence version replaced
JF792632.1.
FEATURES Location/Qualifiers
source 1..1161
/organism="Bubalus bubalis"
/mol_type="mRNA"
/db_xref="taxon:89462"
gene 1..1161
/gene="p53"
CDS 1..1161
/gene="p53"
/codon_start=1
/product="p53"
/protein_id="AEG21062.2"

/translation="MEESQAELNVEPPLSQETFSDLWNLLPENNLLSSELSAPVDDLL

PYTDVATWLDECPNEVPQMPEPSAPAAPPPATPAPATSWPLSSFVPSQKTYPGNYGFR

LGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVDSPPPPGTRVRAMAIYKKLEH

MTEVVRRCPHHERSSDYSDGLAPPQHLIRVEGNLRAEYLDDRNTFRHSVVVPYESPEI

DSECTTIHYNFMCNSSCMGGMNRRPILTIITLEDSCGNLLGRNSFEVRVCACPGRDRR

TEEENFRKKGQSCPEPPPGSTKRALPTNTSSSPQQKKKRLDEEYFTLQIRGLKRYEMF

RELNDALELKDALDGREPGESRAHSSHLKSKKRPSPSCHKKPMLKREGPDSD"
4) Compare and contrast between Genbank and FASTA sequence formats.

FASTA format is a text-based format for representing either nucleotide sequences or peptide
sequences, in which base pairs or amino acids are represented using single-letter codes. A
sequence in FASTA format begins with a single-line description, followed by lines of sequence
data. A sequence file in Genbank format can contain several sequences.
One sequence in Genbank format starts with a line containing the word LOCUS and a number
of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the
end of the sequence is marked by two slashes ("//").

5) Access the EMBL-EBI site and retrieve a nucleotide sequence for any gene of your choice and
give its file format.

>sp|P04637|P53_HUMAN Cellular tumour antigen p53 OS=Homo sapiens OX=9606


GN=TP53 PE=1 SV=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

Fasta File format is used here.

6) Retrieve the protein sequences of your protein of interest from Uniprot site and give its file
format.
Fasta Format is used here

>sp|P56593|CP2AC_MOUSE Cytochrome P450 2A12 OS=Mus musculus OX=10090


GN=Cyp2a12 PE=1 SV=2
MLGSGLLLLAILAFLSVMVLVSVWQQKIRGKLPPGPIPLPFIGNYLQLNRKDVYSSITQL
QEHYGPVFTIHLGPRRVVVLYGYDAVKEALVDHAEEFSGRGEQATFNTLFKGYGVAFSNG
ERAKQLRRFSIATLRDFGMGKRGVEERIQEEAGCLIKMLQGTCGAPIDPTIYLSKTASNV
ISSIVFGDRFNYEDKEFLSLLQMMGQVNKFAASPTGQLYDMFHSVMKYLPGPQQQIIKDS
HKLEDFMIQKVKQNQSTLDPNSPRDFIDSFLIHMQKEKYVNSEFHMKNLVMTSLNLFFAG
SETVSSTLRYGFLLLMKHPDVEAKVHEEIDRVIGRNRQPQYEDHMKMPYTQAVINEIQRF
SNFAPLGIPRRITKDTSFRGFFLPKGTEVFPILGSLMTDPKFFSSPKDFNPQHFLDDKGQ
LKKIPAFLPFSTGKRFCLGDSLAKMELFLFFTTILQNFRFKFPRKLEDINESPTPEGFTR
IIPKYTMSFVP
Questions that were addressed for structural data

1) Access the PDB site, retrieve the structure result page for Human insulin A and obtain
the PDB format and view its features by opening with word pad.

HEADER HORMONE 01-JUL-09


3I40
TITLE HUMAN INSULIN
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: INSULIN A CHAIN;
COMPND 3 CHAIN: A;
COMPND 4 ENGINEERED: YES;
COMPND 5 MOL_ID: 2;
COMPND 6 MOLECULE: INSULIN B CHAIN;
COMPND 7 CHAIN: B;
COMPND 8 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
SOURCE 3 ORGANISM_COMMON: HUMAN;
SOURCE 4 ORGANISM_TAXID: 9606;
SOURCE 5 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE 6 EXPRESSION_SYSTEM_TAXID: 562;
...
HETATM 448 O HOH B 53 -27.581 8.994 -11.659 1.00 53.89
O
CONECT 43 76
CONECT 49 231
CONECT 76 43
CONECT 162 321
CONECT 231 49
CONECT 321 162
MASTER 279 0 0 4 0 0 0 6 438 2 6 5
END
(Heavily edited for compatiblity purposes)

CONCLUSION:

The knowledge on how to access various sequence database and structural database was obtained.
Moreover, the representation of various file formats of sequence and structure data was
understood.

You might also like