Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 49

Lecture 2

INTRODUCTION TO BIOLOGICAL
DATABASES
DATA & INFORMATION
4
DATA
Data is raw, unorganized
facts that need to be
processed.
Example:- Each student's
test score is one
piece of data.

INFORMATION
When data is processed,
organized, structured
or presented in a given
context so as to make it
useful, it is called
information.
Example:- The average
score of a class or of
the entire school is
DATABASE??? 7

A database is a
collection of data
in an organized
manner, which is
accessible in
various ways.

 Databases are convenient system to properly store, search


and retrieve any type of data
WHAT ARE DATABASES?
 Structured collection of information.

 Consists of basic units called records or entries.

 Each record consists of fields, which hold pre-defined


data related to the record.

 For example, a protein database would have protein


entries as records and protein properties as fields (e.g.,
name of protein, length, amino-acid sequence)
WHAT IS BIOLOGICAL DATABASE
 Biological databases are libraries of life sciences
information collected from scientific experiments,
published literature, high-throughput experiment
technology and computational analysis.
 They contain information from genomics, proteomics,
microarray gene expression.
 Information contained in biological databases include
gene structure, function, localization (both cellular and
chromosomal),biological sequences and structures.
WHAT ARE THE BIOLOGICAL
DATABASES ???
8

Minimum redundancy.

Easy retrieval of data.


9

Biological Databases serve a critical purpose in the collation


and organization of data related to biological systems.

They provide a computational support and a user-friendly


interface to a researcher for a meaningful analysis of
biological data.
GENBANK/EMBL/DDBJ
INTERNATIONAL NUCLEOTIDE
SEQUENCE DATABASES

DDBJ: DNA Data Bank of Japan


CIB: Center for Information Biology and
DNA Data Bank of Japan
NIG: National Institute of Genetics

IAM: International Advisory Meeting


ICM: International Collaborative
Meeting

EMBL:
European Molecular Biology
Laboratory
EBI:
European Bioinformatics Institute
NCBI:
National Center for Biotechnology Information
NLM:
National Library of Medicine
TYPES OF DATABASES
10
 Primary Databases

 Secondary Databases
PRIMARY VS. SECONDARY SEQUENCE
DATABASES
ACGT
GC RefSeq
C TC T
A A
ATCATC
GAG
GAG Labs
TATAGCCG
TA AGCTCCGATA
TA CCGATGACAA
GC
Sequencing CG
TG C
Centers G Genome
CGT
A C A
CT
T
TG
A
TGA Curators Assembly
G

AT
A

CA
A

CG
TGC

GA
TT TTGACA Updated
ACG A

C
CGTGA

CG
AC
T

TAT AT A
G
CG G C
ATTGTGA

T
C GAC

continually
GT

A
GC

TAAGC TGA G AC
TG

TAT

C
C
C T A
C C TG
TA

GA
T
T T GA GC TTATAGCCG
TATAGCCGA G CA by NCBI
A

T
A
AT A T
TATAGCCG
TATAGCCG A
T A T A
TA

TA G C
TT
GA GenBank
AT UniGene
Updated ONLY
by submitters
TACTTTCTT C TC T
GAGA A A
GAGA GAG
GAG
T
A ATCA C ATCATC Algorithms
PRIMARY DATABASES
11

 Contains bio-molecular data in its original form.


Experimental results are submitted directly into the
database by researchers, and the data are essentially
archival in nature.
 Once given a database accession number, the data in
primary databases are never changed.
GenBank
(Genetic Sequence http://www.ncbi.nlm.nih.gov/genbank/

Databank)
Database from NCBI (National Center for
Biotechnology), includes sequences from publicly
available resources.
It was established in the year 1982.
GenBank (Genetic Sequence Databank)

• DNA sequences can be submitted to GenBank using several different


methods.
• It contains publicly available nucleotide sequences for more than 240,000
named organisms, obtained primarily through submissions from individual
laboratories and batch submissions from large-scale sequencing projects.
• It has a flat file structure that is a text file, readable & downloadable by both
humans and computers.
• There are two main ways of making batch sequence submissions to GenBank:
NCBI’s Barcode Submission Tool (BarSTool) and Sequin.
EMBL http://www.ebi.ac.uk/

 European Molecular Biological Laboratory


 Nucleic acid database from EBI (European Bioinformatics
Institute)
 Produced in collaboration with DDBJ and GenBank
 Search engine – SRS (Sequence Retrieval System)
EMBL

• The European Molecular Biology Laboratory (EMBL) is a molecular biology


research institution supported by 27 member states, one prospect and two associate
member states.
• EMBL was created in 1974 and is an intergovernmental organization funded by
public research money from its member states.
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and
outstations in Hinxton (the European Bioinformatics Institute (EBI), in England),
Grenoble (France), Hamburg (Germany), and Monterotondo (near Rome).
• EMBL groups and laboratories perform basic research in molecular biology and
molecular medicine as well as training for scientists, students and visitors.
• Israel is the only Asian state that has full membership.
• It is used to incorporate and distributes nucleotide sequences from
public sources.
• The database is a part of an international collaboration with DDBJ
(Japan) and GenBank (USA).
• Data are exchanged between the collaborating databases on a daily
basis.
• The web-based tool, Webin, is the preferred system for individual
submission of nucleotide sequences, including Third Party
Annotation (TPA) and alignment data.
DDBJ http://www.ddbj.nig.ac.jp/
14

 DNA Databank of Japan


 Started in 1986 in collaboration with GenBank
 Produced and maintained at NIG (National Institute
of Genetics)
SWISS PROT http://www.ebi.ac.uk/uniprot/
15

 Annotated sequence database established in 1986


 Consists of sequence entries of different lie formats
 Similar format to EMBL
 http://us.expasy.org/sprot/sprot-top.html
PIR http://pir.georgetown.edu/

 Protein Information Resource


A division of National Biomedical Research
Foundation (NBRF) in U.S.
 One can search for entries or do sequence similarity
search at PIR site.
TREMBL http://www.ebi.ac.uk/trembl/
17
 Translated European Molecular Biology Laboratory
 Computer annotated supplement of SWISS PROT.
 Contains all the translations of EMBL nucleotide
sequence entries not yet integrated in SWISS PROT.
SECONDARY DATABASES
20

 Contains data derived from the results of analysing


primary data
 Manually created or automatically
generated
 Contains more relevant and useful information
structured to specific requirements
 Example :- PROSITE, PRINTS, BLOCKS,
Pfam
SECONDARY DATABASES
SECONDARY PRIMARY INFORMATION
DATABASE SOURCE STORED

PROSITE SWISS PROT Regular


expression

BLOCKS PROSITE/PRIN Aligned


TS motifs(blocks)

OWL
PRINTS Aligned motifs
(Composite DB)
Hidden Markov
Pfam SWISS PROT
Models
Weighted
Profile SWISS PROT
Matrices (profile)
PROSITE http://ca.expasy.org/prosite/
22
Families of proteins
Can search using regular
expressions
Similar to unix commands using
wildcards, etc.
E.g., [AC]-x-V-x(4)-{ED}
Interpreted as:
[Ala or Cys]-any-Val-any-any-any-
any-{any but Glu or Asp}
Families exhibit these patterns So
we can search over families
BLOCKS 23

 Motifs/blocks
are created
by
automatically
detecting
the most
conserved
regions of
each protein
family.
24
PRINTS

 Most protein families are characterized not by one,


but by several conserved motifs
 Fingerprints
are groups of conserved motifs excised
from sequence alignments
 Taken together, they provide diagnostic family
signatures. They are the basis of the PRINTS
database, and are stored in the form of aligned
motifs.
 Input about protein families is done manually
PFAM http://pfam.sanger.ac.uk/ 25
Maintained by the Sanger Centre (Cambridge)

Protein families aligned using HMMs Hidden Markov Models


Given a new sequence

Find families which the sequence might fit into Sequence

Coverage
11912 families

Split into Pfam-A (high quality) and Pfam-B (low quality)


26
THE NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION

Bethesda,MD

Created in 1988 as a part of the


National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
NCBI DATABASES AND
SERVICES

• GenBank primary sequence database


• Free public access to biomedical literature
• PubMed free Medline (3 million searches per day)
• PubMed Central full text online access
• Entrez integrated molecular and literature databases
• BLAST highest volume sequence search service
(100–200 K searches per day)
FLAT FILE STORAGE DATA
FORMATS

• When GenBank, EMBL and DDBJ formed a collaboration (1986),


sequence databases had moved to a defined flat file format with a
shared feature table format and annotation standards.
• The flat file formats from the sequence databases are still used to
access and display sequence and annotation. They are also convenient
for storage of local copies.
TRADITIONAL GENBANK RECORD

Accession
•Stable
ACCESSION U07418 •Reportable
•Universal
VERSION U07418.1 GI:466461

Version GI number
Tracks changes in sequence NCBI internal use

well annotated

the sequence is the data


EMBL FORMAT

ID LISOD standard; DNA; PRO; 756 BP.


XX
AC X64011; S78972;
XX
SV X64011.1 XX
DT
DT XX28-APR-1992
DE XX KW(Rel.
XX 31,
OS Created)
OC OC XX RN RX
RA RT RT
30-JUN-1993 (Rel. 36, Last updated, Version 6)
the RT
L.ivanovii sod gene for superoxide dismutase

sod gene; superoxide dismutase.

Listeria ivanovii
Bacteria; Firmicutes; Bacillus/Clostridium group;
Bacillus/Staphylococcus group; Listeria.

[1]
MEDLINE; 92140371.
Haas A., Goebel W.;
"Cloning of a superoxide dismutase gene from Listeria ivanovii by
functional complementation in Escherichia coli and characterization of

gene product.";
ID - Identification.
AC - Accession number(s). DT - Date.
DE - Description.
GN - Gene name(s). OS - Organism species. OG -
Organelle.
OC - Organism classification. RN - Reference
number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references. RA - Reference
authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references. KW - Keywords.
FT - Feature table data.

SQ - Sequence header.

- (blanks) sequence data.

// - Termination line.

Some entries do not contain all of the line types, and some line types occur many times in a single entry. Each entry must begin with an
identification line (ID) and end with a terminator line (//).
PubMed
• PubMed is a free search engine accessing primarily the
MEDLINE database of references and abstracts on sciences
and biomedical topics.
• The PubMed system was offered free to the public in
1997.
• The United States National Library of Medicine (NLM) the
National Institutes of Health maintains the part of the
Entrez system of information retrieval.
• PMID is the unique identifier number used in
• They are assigned to each article record when it enters the
PubMed system.
• The PMID# is always found at the end of a PubMed
citation.
• PubMed Central (PMC) is a free digital system that
archives publicly accessible full-text scholarly articles that
have been published within the biomedical and life
sciences journal literature.
• A "PubMed Mobile" option, providing access to a mobile
REFSEQ
 Curated transcripts and proteins

 Model transcripts and proteins

 Assembled Genomic Regions

 Chromosome records
 Human genome
 microbial
 organelle

ftp://ftp.ncbi.nih.gov/refseq/release/
SELECTED REFSEQ ACCESSION
NUMBERS
mRNAs and Proteins
NM_123456 Curated mRNA
NP_123456 Curated Protein
NR_123456 Curated non-coding RNA
XM_123456 Predicted mRNA
XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA
Gene Records
NG_123456 Reference Genomic Sequence
Chromosome
NC_123455 Microbial replicons, organelle
genomes, human chromosomes
AC_123455 Alternate assemblies
Assemblies
NT_123456 Contig
NW_123456 WGS Supercontig
REFSEQ BENEFITS
 Non-redundancy  

 Updates to reflect current sequence data and biology

 Data validation

 Format consistency

 Distinct accession series

 Stewardship by NCBI staff and collaborators


SRS

• SRS is a Sequence Retrieval System


• Data retrieval tool developed by EBI
• Integrates 80 molecular biology DBs
• An Open source software (Can be installed locally)
• SRS has an associated scripting language called Icarus
• Central resource for molecular biology data
• more than 250 databanks have been indexed. More than 35 SRS
servers over the WWW(world wide web)
• Information retrieval
• Easy way to retrieve information from sequence and sequence-related
databases
• Possibility to search for multiple words/other criteria
• Linkage between different databases
• e.g. Find all primary structures with known three-dimensional
• Different types of database in SRS
• Sequence & structure
• DNA, protein, three-dimensional structures
• Sequence-related
• Gene-related
• Genome, mapping, mutations, transcription factors
• SNP
• Bibliographic
• SRS main toolbar tabs:
• Top Page: displays databases in different database groups
• Query: displays either the standard or extended query form
• Results or “the query manager”: maintains a history of all the results
obtained during a session
• Projects or “the project manager”: maintains a history of all queries
and views used during a session
• Views: allows a user to define a user specific view for one or more
databases
• Databanks: contains a list and some facts about the databases available
in the system
• Search terms in SRS
• SRS indexed fields can be searched using any of the
• Single word search
• Multiple word phrases
• Numbers and dates
• Regular expressions
• Wildcards
ENTREZ

• WWW-based data retrieval system.


• Developed by NCBI (National Centre for
Biotechnology Information).
• Integrates information held in different DBs.
ENTREZ: A DISCOVERY SYSTEM
Word weight

PubMed
abstracts

Pre-computed and pre-compiled data.


Taxonomy 3 -D
3-D
•A Structure
Structure
VAST
potential “gold mine” of undiscovered
relationships. Gene
Phylogeny Neighbors
•Used less than expected.
Related Structures

BLAST Nucleotide Protein BLAST


sequences sequences
Neighbors Neighbors
Related Sequences Hard Link
Related Sequences
BLink
Domains
Data bases covered by Entrez

• PopSet: From GenBank


• Nucleic acid: GenBank,
• OMIM: OMIM
RefSeq.
• Taxonomy: NCBI
• Protein seqs: SWISS- taxonomy database
PROT, PIR, PDB. • Books: Bookshelf
• 3D structures: MMDB
• Literature: PubMed
• Genomes: Many sources
‘TAKE HOME MESSAGE’ ADVANTAGES OF
DATA INTEGRATION
 More relevant inter-related information in one place

 Makes it easier to find additional relevant information


related to your initial query

 Potentially find information indirectly linked, but


relevant to your subject of interest
 uncover non-obvious genetic features that explain phenotype
or disease

 Easier to build a ‘story’ based on multiple pieces of


biological evidence

You might also like