Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

INTRODUCTION TO

BIOLOGICAL DATABASES
GenBank at the National Center for
Three publicly Biotechnology Information(NCBI) of the
National Institutes of Health (NIH) in
accessible Bethesda, USA
databases DNA Database of Japan (DDBJ) at the
store large National Institute of Genetics in Mishima,
amounts of Japan

nucleotide and
European Molecular Biology Laboratory
protein (EMBL) Nucleotide Sequence Database at
sequence data the European Bioinformatics Institute
(EBI) in Hinxton, England
• database consisting of most known public DNA
GENBANK: and protein sequences

DATABASE OF • contains bibliographic and biological


annotation
MOST KNOWN • Data from GenBank are available free of charge
from the National Center for Biotechnology
NUCLEOTIDE Information (NCBI) in the National Library of
Medicine at the NIH
AND PROTEIN
• URL of GenBank:
SEQUENCES https://www.ncbi.nlm.nih.gov/genbank/
• Over 260,000 different species are represented in
Organisms in GenBank, with over 1000 new species added per
month
GenBank
To help organize the available information,
each sequence name in a GenBank record
is followed by its data file division and
primary accession number

• the following codes are used to


designate the data file divisions:
Types of Data
in GenBank
• Types of sequence data in
GenBank and other databases
using human beta globin as an
example.
• Note that “globin” may refer to
a gene or other DNA feature,
an RNA transcript (or its
corresponding complementary
DNA), or a protein.
NATIONAL CENTER FOR
BIOTECHNOLOGY
INFORMATION (NCBI)

• NCBI creates public


databases, conducts research
in computational biology,
develops software tools for
analyzing genome data, and
disseminates biomedical
information
• URL: https://www.ncbi.nlm.nih.gov/
Accession numbers are
labels for sequences
• NCBI includes databases (such as GenBank) that contain
information on DNA, RNA, or protein sequences.

• You may want to acquire information beginning with a


query such as the name of a protein of interest, or the raw
nucleotides comprising a DNA sequence of interest.

• DNA sequences and other molecular data are tagged


with accession numbers that are used to identify a
sequence or other record relevant to molecular data.
What is an accession number?

An accession number is label that used to identify a


sequence. It is a string of letters and/or numbers that
corresponds to a molecular sequence.

Examples (all for beta globin, HBB):


X02775 GenBank genomic DNA sequence
NG_000007.3 RefSeqGene DNA
rs192792910 dbSNP (single nucleotide polymorphism)

AA970968.1 An expressed sequence tag (1 of 2,345) RNA


NM_000518.4 RefSeq DNA sequence (from a transcript)

NP_000509.1 RefSeq protein


CAA00182.1 GenBank protein protein
Q14473 SwissProt protein
1YE0|B Protein Data Bank structure record
NCBI’s important RefSeq project:
best representative sequences
RefSeq (accessible via the main page of NCBI)
provides an expertly curated accession number that
corresponds to the most stable, agreed-upon “reference”
version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######


Complete chromosome NC_######
Genomic contig NT_######
mRNA (DNA format) NM_###### e.g. NM_000518
Protein NP_###### e.g. NP_000509
• search service from the
National Library of Medicine
(NLM) that provides access to
over 18 million citations in
MEDLINE (Medical Literature
Analysis, and Retrieval System
PubMed Online) and other related
databases, with links to
participating online journals.

• URL:
https://www.ncbi.nlm.nih.gov/
pubmed/
• integrates the scientific
literature, DNA and protein
sequence databases, three-
dimensional protein structure
data, population study data sets,
and assemblies of complete
genomes into a tightly coupled
Entrez system
• PubMed is the literature
component of Entrez.

• URL:
https://www.ncbi.nlm.nih.gov/W
eb/Search/entrezfs.html
Access to sequences: Entrez Gene at
NCBI • Entrez Gene is a great starting point: it collects
• key information on each gene/protein from
• major databases. It covers all major organisms.

• RefSeq provides a curated, optimal accession


number for each DNA (NM_000518.4 for beta globin
DNA corresponding to mRNA) or protein
(NP_000509.1)
From the NCBI home
page, type “beta globin”
and hit “Search”
Fig. 2.5
Page 28

Follow the link to “Gene”


Entrez Gene is in the header
Note the “Official Symbol” HBB for beta globin
Note the “limits” option
Entrez Gene (top of page):
Note a useful summary, and links to other databases
Entrez Protein:
accession,
organism,
literature…
Entrez Protein:
…features of a protein, and its sequence
in the one-letter amino acid code
You should learn the one-letter amino acid code!

Name 3-Letter 1-Letter Name 3-Letter 1-Letter


Alanine Ala A Leucine Leu L
Arginine Arg R Lysine Lys K
Asparagine Asn N Methionine Met M
Aspartic acid Asp D Phenylalanine Phe F
Cysteine Cys C Proline Pro P
Glutamic Acid Glu E Serine Ser S
Glutamine Gln Q Threonine Thr T
Glycine Gly G Tryptophan Trp W
Histidine His H Tyrosine Tyr Y
Isoleucine Ile I Valine Val V
Entrez Protein:
You can change the display (as shown)…
FASTA format:
versatile, compact with one header line
followed by a string of nucleotides or amino acids
in the single letter code

Fig. 2.9
Page 32
While FASTA is one file format, there are many others

FASTA Sequences in one letter DNA or protein code


FASTQ DNA sequences with quality scores for each base
BAM compressed binary version of SAM
SAM Sequence Alignment/Map file (tab-delimited)
VCF variant call format (genomic variants; indels)

(See genome.ucsc.edu/FAQ/FAQformat.html for the following:)

BED a table including chromosome, start, end


WIG wiggle format (displays dense, continuous data)
GFF General Feature Format (tab separated)

Also, besides Excel (.xls, .xlsx) spreadsheets can also be:


.txt tab-delimited text file (or space delimited)
.csv comma separated text file
• BLAST (Basic Local Alignment
Search Tool) is NCBI’s sequence
similarity search tool designed
to support analysis of
nucleotide and protein
databases
BLAST • set of similarity search
programs designed to explore
all of the available sequence
databases regardless of
whether the query is protein
or DNA
• Online Mendelian Inheritance in
Man (OMIM) is a catalog of
human genes and genetic
disorders.
• created by Victor McKusick and
his colleagues and developed for
OMIM the World Wide Web by NCBI
• contains links to PubMed articles
and sequence information
• URL:
https://www.ncbi.nlm.nih.gov/o
mim
• NCBI taxonomy website includes a
taxonomy browser for the major
divisions of living organisms (archaea,
bacteria, eukaryota, and viruses).
• features taxonomy information such as
genetic codes and taxonomy resources
Taxonomy and additional information such as
molecular data on extinct organisms
and recent changes to classification
schemes.
• URL:
https://www.ncbi.nlm.nih.gov/taxono
my
• NCBI structure site maintains the Molecular
Modelling Database (MMDB), a database of
macromolecular three-dimensional structures,
as well as tools for their visualization and
comparative analysis
• MMDB contains experimentally determined
biopolymer structures obtained from the

Structure Protein Data Bank (PDB)


• Structure resources at NCBI include PDBeast (a
taxonomy site within MMDB), Cn3D (a three-
dimensional structure viewer), and a vector
alignment search tool (VAST) which allows
comparison of structures.
• URL:
https://www.ncbi.nlm.nih.gov/Structure/index.
shtml
• Entrez Gene:
https://www.ncbi.nlm.nih.gov/gene
• Entrez Protein:
https://www.ncbi.nlm.nih.gov/protein/
• Entrez Nucleotide:
Looking for a https://www.ncbi.nlm.nih.gov/nuccore
/
gene, protein or • UniGene:
sequence? http://www.bioinfo.org.cn/relative/NC
BI-UniGene.htm
• UniProt: https://www.uniprot.org/ -
The Universal Protein Resource
(UniProt) is the most comprehensive,
centralized protein sequence catalog)

You might also like