Professional Documents
Culture Documents
Biological Databases Lec 2,3
Biological Databases Lec 2,3
INTRODUCTION TO BIOLOGICAL
DATABASES
DATA & INFORMATION
4
DATA
Data is raw, unorganized
facts that need to be
processed.
Example:- Each student's
test score is one
piece of data.
INFORMATION
When data is processed,
organized, structured
or presented in a given
context so as to make it
useful, it is called
information.
Example:- The average
score of a class or of
the entire school is
DATABASE??? 7
A database is a
collection of data
in an organized
manner, which is
accessible in
various ways.
Minimum redundancy.
EMBL:
European Molecular Biology
Laboratory
EBI:
European Bioinformatics Institute
NCBI:
National Center for Biotechnology Information
NLM:
National Library of Medicine
TYPES OF DATABASES
10
Primary Databases
Secondary Databases
PRIMARY VS. SECONDARY SEQUENCE
DATABASES
ACGT
GC RefSeq
C TC T
A A
ATCATC
GAG
GAG Labs
TATAGCCG
TA AGCTCCGATA
TA CCGATGACAA
GC
Sequencing CG
TG C
Centers G Genome
CGT
A C A
CT
T
TG
A
TGA Curators Assembly
G
AT
A
CA
A
CG
TGC
GA
TT TTGACA Updated
ACG A
C
CGTGA
CG
AC
T
TAT AT A
G
CG G C
ATTGTGA
T
C GAC
continually
GT
A
GC
TAAGC TGA G AC
TG
TAT
C
C
C T A
C C TG
TA
GA
T
T T GA GC TTATAGCCG
TATAGCCGA G CA by NCBI
A
T
A
AT A T
TATAGCCG
TATAGCCG A
T A T A
TA
TA G C
TT
GA GenBank
AT UniGene
Updated ONLY
by submitters
TACTTTCTT C TC T
GAGA A A
GAGA GAG
GAG
T
A ATCA C ATCATC Algorithms
PRIMARY DATABASES
11
Databank)
Database from NCBI (National Center for
Biotechnology), includes sequences from publicly
available resources.
It was established in the year 1982.
GenBank (Genetic Sequence Databank)
OWL
PRINTS Aligned motifs
(Composite DB)
Hidden Markov
Pfam SWISS PROT
Models
Weighted
Profile SWISS PROT
Matrices (profile)
PROSITE http://ca.expasy.org/prosite/
22
Families of proteins
Can search using regular
expressions
Similar to unix commands using
wildcards, etc.
E.g., [AC]-x-V-x(4)-{ED}
Interpreted as:
[Ala or Cys]-any-Val-any-any-any-
any-{any but Glu or Asp}
Families exhibit these patterns So
we can search over families
BLOCKS 23
Motifs/blocks
are created
by
automatically
detecting
the most
conserved
regions of
each protein
family.
24
PRINTS
Coverage
11912 families
Bethesda,MD
Accession
•Stable
ACCESSION U07418 •Reportable
•Universal
VERSION U07418.1 GI:466461
Version GI number
Tracks changes in sequence NCBI internal use
well annotated
Listeria ivanovii
Bacteria; Firmicutes; Bacillus/Clostridium group;
Bacillus/Staphylococcus group; Listeria.
[1]
MEDLINE; 92140371.
Haas A., Goebel W.;
"Cloning of a superoxide dismutase gene from Listeria ivanovii by
functional complementation in Escherichia coli and characterization of
gene product.";
ID - Identification.
AC - Accession number(s). DT - Date.
DE - Description.
GN - Gene name(s). OS - Organism species. OG -
Organelle.
OC - Organism classification. RN - Reference
number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references. RA - Reference
authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references. KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
// - Termination line.
Some entries do not contain all of the line types, and some line types occur many times in a single entry. Each entry must begin with an
identification line (ID) and end with a terminator line (//).
PubMed
• PubMed is a free search engine accessing primarily the
MEDLINE database of references and abstracts on sciences
and biomedical topics.
• The PubMed system was offered free to the public in
1997.
• The United States National Library of Medicine (NLM) the
National Institutes of Health maintains the part of the
Entrez system of information retrieval.
• PMID is the unique identifier number used in
• They are assigned to each article record when it enters the
PubMed system.
• The PMID# is always found at the end of a PubMed
citation.
• PubMed Central (PMC) is a free digital system that
archives publicly accessible full-text scholarly articles that
have been published within the biomedical and life
sciences journal literature.
• A "PubMed Mobile" option, providing access to a mobile
REFSEQ
Curated transcripts and proteins
Chromosome records
Human genome
microbial
organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
SELECTED REFSEQ ACCESSION
NUMBERS
mRNAs and Proteins
NM_123456 Curated mRNA
NP_123456 Curated Protein
NR_123456 Curated non-coding RNA
XM_123456 Predicted mRNA
XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA
Gene Records
NG_123456 Reference Genomic Sequence
Chromosome
NC_123455 Microbial replicons, organelle
genomes, human chromosomes
AC_123455 Alternate assemblies
Assemblies
NT_123456 Contig
NW_123456 WGS Supercontig
REFSEQ BENEFITS
Non-redundancy
Data validation
Format consistency
PubMed
abstracts