8e5bbfunda Seq Anals

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 52

Fundamentals of

Sequence Analysis
Fourie Joubert

FASTA File Format

First line contains > followed by a space and a


short descriptor
Sequence usually 60 or 80 characters per column
on following lines
May repeat after inserting a blank line

FASTA Example
> mysequence
ACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCAT
CAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTA
CAGTCGATCGATGCAT
> mysequence2
ACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACG
CAGTCGTAGCATGCTAACGTCGATCGTA
> mysequence3
CAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAA
CAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTA

Genbank File Format

File Header
The first line in the file must have "GENETIC SEQUENCE DATA BANK" in spaces
20 through 46.
The next 8 lines may contain arbitrary text. They are ignored but are required to
maintain the GenBank format.
Sequence Data Entries
Each sequence entry in the file should have the following format:
1st line: Must have LOCUS in the first 5 spaces. The genetic locus name or
identifier must be in spaces 13 - 22. The length of the sequences is right
justified in spaces 23 through 29.
2nd line: Must have DEFINITION in the first 10 spaces. Spaces 13 - 80 are free
form text to identify the sequence.
3rd line: Must have ACCESSION in the first 9 spaces. Spaces 13 - 18 must hold
the primary accession number.
4th line: Must have ORIGIN in the first 6 spaces. Nothing else is required on this
line, it indicates that the nucleic acid sequence begins on the next line.
5th line: Begins the nucleotide sequence. The first 9 spaces of each sequence
line may either be blank or may contain the position in the sequence of the first
nucleotide on the line. The next 66 spaces hold the nucleotide sequence in six
blocks of ten nucleotides. Each of the six blocks begins with a blank space
followed by ten nucleotides. Thus the first nucleotide is in space eleven of the
line while the last is in space 75.
Last line: Must have // in the first 2 spaces to indicate termination of the
sequence.
NOTE: Multiple sequences may appear in each file. To begin another sequence
go back to a) and start again.

Genbank Example
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM

NM_079846 1190 bp mRNA linear INV 15-DEC-2001


Drosophila melanogaster Triose phosphate isomerase (Tpi), mRNA.
NM_079846
NM_079846.1 GI:17864111
.
fruit fly.
Drosophila melanogaster
Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta;
Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;
Muscomorpha; Ephydroidea; Drosophilidae; Drosophila.
REFERENCE 1 (bases 1 to 1190)
AUTHORS
Shaw-Lee,R.L., Lissemore,J.L. and Sullivan,D.T.
TITLE
Structure and expression of the triose phosphate isomerase (Tpi) gene of
Drosophila melanogaster JOURNAL Mol. Gen. Genet. 230 (1-2), 225-229 (1991)
MEDLINE
92079900
PUBMED
1720860
COMMENT
PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI
review. The reference sequence was derived from AE003772.1.
FEATURES
Location/Qualifiers
source
1..1190
/organism="Drosophila melanogaster
/db_xref="taxon:7227
/chromosome="3
/map="99E1-99E2
gene
1..1190
/gene="Tpi
/note="TPI; TPIS; CG2171; CT6334
/db_xref="FLYBASE:FBgn0003738
/db_xref="LocusID:43582

CDS

181..924
/gene="Tpi
/EC_number="5.3.1.1
/note="Nucleotide sequence of the Celera sequence differs from the published
sequence for this transcript.
/codon_start=1
/db_xref="FLYBASE:FBgn0003738
/db_xref="LocusID:43582
/product="Triose phosphate isomerase
/protein_id="NP_524585.1
/db_xref="GI:17864112"
/translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPA
IYLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFG
ESDALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVV
VAYEPVWAIGTGQTATPDQAQEVHAFLRQWLSDNISKEVSASLRIQYGGSVTAANAKE
LAKKPDIDGFLVGGASLKPEFVDIINARQ

misc_feature 187..921
/note="TIM; Region: Triosephosphate isomerase
BASE COUNT 279 a 368 c 323 g 220 t
ORIGIN
1 ttaatctcga atctgggaaa aatctgagtg
61 agttacccac ttgaaattat cagttccaaa
121 cccgatccgc agttctacgc caatttcagc
181 atgagccgaa agttctgcgt gggaggcaac
241 gccgagatcg ccaagaccct gagctcggcc
301 ggctgcccgg ccatctacct gatgtacgcc
361 gccggccaga atgcctacaa ggtggccaag
421 atgctgaagg
//

gaaaagtcga
cactctaata
accgattgca
tggaagatga
gccctcgacc
cgcaacctgc
ggcgcattca

cggcgagcct
gcagtcccct
ccgacagcaa
acggcgacca
ccaacacgga
tgccctgcga
ccggcgagat

ccagtcatcg
tgttttgtcc
cagcaacaac
gaagtccatc
ggtggtcatc
gctgggtctg
ctcccctgcg

EMBL File Format

Unlike the GenBank file format the EMBL file format does not require a series
of header lines. Thus the first line in the file begins the first sequence entry
of the file.
The first line of each sequence entry contains the two letters ID in the first
two spaces. This is followed by the EMBL identifier in spaces 6 through 14.
The second line of each sequence entry has the two letters AC in the first two
spaces. This is followed by the accession number in spaces 6 through 11.
The third line of each sequence entry has the two letters DE in the first two
spaces. This is followed by a free form text definition in spaces 6 through 72.
The fourth line in each sequence entry has the two letters SQ in the first two
spaces. This is followed by the length of the sequence beginning at or after
space 13. After the sequence length there is a blank space and the two
letters BP.
The nucleotide sequence begins on the fifth line of the sequence entry. Each
line of sequence begins with four blank spaces. The next 66 spaces hold the
nucleotide sequence in six blocks of ten nucleotides. Each of the six blocks
begins with a blank space followed by ten nucleotides. Thus the first
nucleotide is in space 6 of the line while the last is in space 70.
The last line of each sequence entry in the file is a terminator line which has
the two characters // in the first two spaces.
Multiple sequences may appear in each file. To begin another sequence go
back to item 1 and start again.

EMBL Example
ID
XX
AC
XX
SV
XX
DT
DT
XX
DE
XX
KW
XX
OS
OC
OC
OC
XX
RN
RP
RA
RT
RL
RL
RL
XX

DMTPIG

standard; DNA; INV; 3419 BP.

X57576; S70377;
X57576.1
20-JAN-1992 (Rel. 30, Created)
19-AUG-1996 (Rel. 49, Last updated, Version 10)
D.melanogaster Tpi gene for Triosephosphate isomerase
glycolytic enzyme; tpi gene; triosephosphate isomerase.
Drosophila melanogaster (fruit fly)
Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota;
Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea;
Drosophilidae; Drosophila.
[1]
1-3419
Sullivan D.T.;
;
Submitted (07-FEB-1991) to the EMBL/GenBank/DDBJ databases.
D.T. Sullivan, Biological Research Laboratories, 130 College Pl, Syracuse
University, Syracuse, NY 13244, USA

RN
RX
RA
RT
RT
RL
XX
DR
DR
XX
FH
FH
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT

[3]
MEDLINE; 92079900.
Shaw-Lee R.L., Lissemore J.L., Sullivan D.T.;
"Structure and expression of the triose phosphate isomerase (Tpi) gene of
Drosophila melanogaster.";
Mol. Gen. Genet. 230:225-229(1991).
FLYBASE; FBgn0003738; Tpi.
SWISS-PROT; P29613; TPIS_DROME.
Key
source

Location/Qualifiers

1..3419
/db_xref="taxon:7227"
/germline
/organism="Drosophila melanogaster"
/strain="Oregon-R"
/clone_lib="EMBL-4"
CDS
join(2237..2773,2830..3036)
/db_xref="FLYBASE:FBgn0003738"
/db_xref="SWISS-PROT:P29613"
/gene="Tpi"
/EC_number="5.3.1.1"
/product="triosephosphate isomerase"
/protein_id="CAA40804.1"
/translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPAI
YLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFGES
DALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVVVAY
EPVWAIGTGKTATPDQAQEVHASLRQWLSDNISKEVSASLRIQYGGSVTAANAKELAKK
PDIDGFLVGGASLKPEFLDIINARQ"
mRNA
join(2004..2028,2186..2773,2830..3036)
/gene="Tpi"
prim_transcript 2004..3296

FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
XX
SQ

//

exon
exon
exon
intron
intron
misc_feature
misc_feature
polyA_signal

2008..2032
/number=1
2189..2773
/number=2
2830..3296
/number=3
2033..2188
/number=1
2774..2829
/number=2
2147..2151
/note="intron 1 lariat sequence"
2789..2793
/note="intron 2 lariat sequence"
3258..3262

Sequence 3419 BP; 855


gatctcgagc gagaaatgtg
accagctacg agttcccttc
gttccacagt cccaccagct
atgacaacca caactacagt
ttgaggacgt attcgtgccg
ccgtggaact gcgtcgctcc
cggactccag gccaatgggc
gaccgctctc cactcaaaca
aactgcttgc tgggcaactc
tcctgctccc caggagcaat
aaatgggcac aagcctaagt
gagccgatcc tgcaactgta
cagtaatctc tgcagggatc
ccatcccaca gagaactttt

A; 933 C; 849 G; 778 T; 4 other;


gaacatagtg gaggcctcca gtggcgccga
ccccgctccg gttcccagcg cagcagtgaa
cctcctgctc ctgcgaagcc ctcagttccg
ttcagccagg atgaggacga agatgatgat
gccagctctg ttccaaatcc cgttcagcct
ctggctttgg tcatgaggga gaaattgcga
aacaatcagg atcttcccat agatgaacag
tctcccacaa atggcccact tccggctctt
nnnncaatag cgctcactgc ctgccaggat
ccggtatctt tgtgatcgat agtgaggcga
atcgaaaggg cacggcattc actcggagtt
gctccatcgc taagggacga ggggtccacg
aggagtcctc tgtacttcca cagcatccgc

gctgggtgaa
cgaaatagca
tccgcctcct
gatctggagt
ggcatagatc
tcggatgaca
tccagggaga
ctgagggcca
ccacggcgag
gtcccggctc
cgctgaagaa
acgagcccag
agccagccaa

60
120
180
240
300
360
420
480
540
600
660
720
780

PHYLIP File Format

Interleaved and Sequential formats


The sequences can continue over multiple lines;
when this is one the sequences must be
either in "interleaved" format, similar to the
output of alignment programs, or "sequential"
ormat. These are described in the main
document file. In sequential format all of one
sequence is given, possibly on multiple lines,
before the next starts. In interleaved format the
first part of the file should contain the first
part of each of the sequences, then possibly a line
containing nothing but a carriage-return
character, then the second part of each
sequence, and so on. Only the first parts of the
sequences should be preceded by names.

Interleaved
18
206
a121
a241
c-s8c1
c1nov
o1brazl
o1campos
o1kauf
ken1-76
ken34-84
ken
uga97-1
bec1-65
zim88-3
knp10-90
zim96-3
zim7-83
knp196-9
zam4-96

MNTTNCFIAL
MNTTDCFIAL
MNTTDCFIAV
MNTTDCFIAV
MNTTDCFIAL
MNTTDCFIAL
MNTTDCFIAL
MNTTDCFIAL
MNTTDCFIAL
MNTTDCFIAL
MNTTDCFIAL
MKTTDCFNVL
MKTTDCFDVL
MKTTDCFNVL
MKTTGCFDVL
MKTTDCFNVL
MKTTDCFSVL
MKTTDCFDAL

VHAIREIRAF
VTAIREIRAF
VNAIKEVRAL
VNAIREIRAL
VQAIREIKAL
VQAIREIKAL
VQAIREIKAL
LRAFREIKTL
VRAIREFKIL
VQAIREIKLL
VQAIREIKSL
FEIFHRFGQT
LEIFHRFRQT
LETFHRFRNV
IEIAHRLRQL
LEIIYRFRHT
FEIFHRLRHT
LEAFHRLRQT

FLSRATG-KM
FLPRATG-RM
FLPRTAG-KM
FLPRTTG-KM
FLPRTTG-KM
FLPRTTG-KM
FLSRTTG-KM
FLSRVRG-KM
FSLRPLARKM
FKG--IR-KM
FRS--SR-KM
FKA--DR-KM
FKT--DR-KM
FKT--DR-KM
NKT--DR-KM
FKT--DR-KM
LKT--ER-KM
FKT--DR-KM

EFTLYNGERK
EFTLHNGERK
EFTLHDGEKK
EFTLHDGEKK
ELTLYNGEKK
ELTLYNGEKK
ELTLYNGEKK
EFTLYNGEKK
EFTLYNGIKK
KLTLYNGEKK
EFTLYNGEKK
EFTLYNGEKK
EFTLYNGEKK
EFTLYNGDKK
EFTLYNGEKK
EFTLYNGEKK
EFTLYNGERK
EFTLYNGEKK

TFYSRPNNHD
VFYSRPNNHD
VFYSRPNNHD
VFYSRPNNHD
TFYSRPNNHD
TFYSRPNNHD
TFYSRPNNHD
TFYSRPNNHD
TFYSRPNKHD
TFYSRPNSHD
TFYSRPNNHD
TFYSRPNTHG
TFYSRPNTHG
TFYSRPNTHG
TFYSRPNTHG
TFYSRPNKHG
TFYSRPNKHG
TFYSRPNRHG

NCWLNTILQL
NCWLNTILQL
NCWLNTILQL
NCWLNTILQL
NCWLNAILQL
NCWLNAILQL
NCWLNAILQL
NCWLNAILQL
NCWLNAILQL
NCWLNTILQL
NCWLNTILQL
NCWLNSLLQL
NCWLNSLLQL
NCWLNSLLQL
NCWLNSLLQL
NCWLNSLLQL
NCWLNSLLQL
NCWLNSLLQL

FRYVDEPFFD
FRYVGEPFFD
FRYVDEPFFD
FRYVDEPFFD
FRYVEEPFFD
FRYVEEPFFD
FRYVEEPFFD
FRYVDEPFFE
FRYVDEPFFD
FRYVDEPFFD
FRYVDEPFFD
FRYVDEPLFE
FRYVDEPLFE
FRYVDEPLFE
FRYVDEPLFE
FRYVDEPLFE
FRYVDEPLFE
FRYVDEPLFE

WVYNSPENLT
WVYDSPENLT
WVYNSPENLT
WVYNSPENLT
WVYSTPENLT
WVYSTPENLT
WVYSSPENLT
WVYDSPENLT
WVYESPENLT
WVYNSPENLT
WVYNSPENLT
SEYLSPENKT
SEYLSPENKT
SEYLSPENKT
SEYLSPENKT
SEYLSPENKT
SEYLSPENKT
SEYLSPENKT

LAAIKQLEEL
LEAIEQLEEL
LEAIKQLEEL
LEAIKQLEEL
LEAIKQLEDL
LEAIKQLEDL
LEAIKQLEDL
VEAIRQLEEL
IQAIGQLEEL
LRAIEQLEEL
LQAIEQLEEL
LDMIKQLSDY
LDMIKQLSDY
LDMIKRLSDY
LDMIKQLSDY
LDMIKQLSDY
LDMIKQLSDY
LDMIKQLSDY

TGLELHEGGP
TGLELHEGGP
TGLELREGGP
TGLELREGGP
TGLELHEGGP
TGLELHEGGP
TGLELHEGGP
TGLELHEGGP
TGLDLREGGP
TGLELREGGP
TGLELHEGGP
TKLDLSDGGP
TKLDLSDGGP
TKLDLSDGGP
TKLDLSDGGP
TKLDLSDGGP
TKLDLSDGGP
TKLDLSDGGP

PALVIWNIKH
PALVIWNIKH
PALVIWNIKH
PALVIWNIKH
PALVIWNIKH

LLQTGIGTAS
LLHTGIGTAS
LLHTGIGTAS
LLHTGIGTAS
LLHTGIGTAS

RPAR-CMVDG
RPSEVCMVDG
RPSEVCMVDG
RPSEVCMVDG
RPSEVCMVDG

TNMCLADFHA
TNMCLADFHA
TDMCLADFHA
TDMCLADFHA
TDMCLADFHA

GIFLKEQEHA
GIFLKGQEHA
GIFMKGREHA
GIFMKGQEHA
GIFLKGQEHA

Sequential
18 206 YF
a121

a241

c-s8c1

c1nov

o1brazl

o1campos

MNTTNCFIAL
NCWLNTILQL
PALVIWNIKH
VFACVTSNGW
LK---MNTTDCFIAL
NCWLNTILQL
PALVIWNIKH
VFACVTSNGW
LK---MNTTDCFIAV
NCWLNTILQL
PALVIWNIKH
VFACVTSNGW
LKGAGQ
MNTTDCFIAV
NCWLNTILQL
PALVIWNIKH
VFACVTSNGW
LKGAGQ
MNTTDCFIAL
NCWLNAILQL
PALVIWNIKH
VFACVTSNGW
LK---MNTTDCFIAL
NCWLNAILQL
PALVIWNIKH
VFAC

VHAIREIRAF
FRYVDEPFFD
LLQTGIGTAS
YAIDDEDFYP

FLSRATG-KM
WVYNSPENLT
RPAR-CMVDG
WTPDPSDVLV

EFTLYNGERK
LAAIKQLEEL
TNMCLADFHA
FVPYDQEPLN

TFYSRPNNHD
TGLELHEGGP
GIFLKEQEHA
GGWKANVQRK

VTAIREIRAF
FRYVGEPFFD
LLHTGIGTAS
YAIDDDDFYP

FLPRATG-RM
WVYDSPENLT
RPSEVCMVDG
WTPDPSDVLV

EFTLHNGERK
LEAIEQLEEL
TNMCLADFHA
FVPYDQEPLN

VFYSRPNNHD
TGLELHEGGP
GIFLKGQEHA
GEWKTKVQQK

VNAIKEVRAL
FRYVDEPFFD
LLHTGIGTAS
YAIDDEDFYP

FLPRTAG-KM
WVYNSPENLT
RPSEVCMVDG
WTPDPSDVLV

EFTLHDGEKK
LEAIKQLEEL
TDMCLADFHA
FVPYDQEPLN

VFYSRPNNHD
TGLELREGGP
GIFMKGREHA
EGWKASVQRK

VNAIREIRAL
FRYVDEPFFD
LLHTGIGTAS
YAIDDEDFYP

FLPRTTG-KM
WVYNSPENLT
RPSEVCMVDG
WTPDPSDVLV

EFTLHDGEKK
LEAIKQLEEL
TDMCLADFHA
FVPYDQEPLN

VFYSRPNNHD
TGLELREGGP
GIFMKGQEHA
EGWKANVQRK

VQAIREIKAL
FRYVEEPFFD
LLHTGIGTAS
YAIDDEDFYP

FLPRTTG-KM
WVYSTPENLT
RPSEVCMVDG
WTPDPSDVLV

ELTLYNGEKK
LEAIKQLEDL
TDMCLADFHA
FVPYDQEPLN

TFYSRPNNHD
TGLELHEGGP
GIFLKGQEHA
GEWKAKVQRK

VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD


FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP
LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA

PDB File Format


COLUMNS
DATA TYPE
FIELD
DEFINITION
--------------------------------------------------------------------------------1 - 6
Record name
"ATOM "
7 - 11
Integer
serial
Atom serial number.
13 - 16
Atom
name
Atom name.
17
Character
altLoc
Alternate location indicator.
18 - 20
Residue name
resName
Residue name.
22
Character
chainID
Chain identifier.
23 - 26
Integer
resSeq
Residue sequence number.
27
AChar
iCode
Code for insertion of residues.
31 - 38
Real(8.3)
x
Orthogonal coordinates for X in
Angstroms.
39 - 46
Real(8.3)
y
Orthogonal coordinates for Y in
Angstroms.
47 - 54
Real(8.3)
z
Orthogonal coordinates for Z in
Angstroms.
55 - 60
Real(6.2)
occupancy
Occupancy.
61 - 66
Real(6.2)
tempFactor
Temperature factor.
73 - 76
LString(4)
segID
Segment identifier, left-justified.
77 - 78
LString(2)
element
Element symbol, right-justified.
79 - 80
LString(2)
charge
Charge on the atom.

PDB Example
HEADER
TITLE
TITLE
COMPND
COMPND
COMPND
COMPND
COMPND
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
KEYWDS
KEYWDS
EXPDTA
AUTHOR
AUTHOR
REVDAT
REVDAT
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
REMARK
REMARK
REMARK
REMARK

LYASE
06-JUL-99
1QU4
CRYSTAL STRUCTURE OF TRYPANOSOMA BRUCEI ORNITHINE
2 DECARBOXYLASE
MOL_ID: 1;
2 MOLECULE: ORNITHINE DECARBOXYLASE;
3 CHAIN: A, B, C, D;
4 EC: 4.1.1.17;
5 ENGINEERED: YES
MOL_ID: 1;
2 ORGANISM_SCIENTIFIC: TRYPANOSOMA BRUCEI;
3 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
4 EXPRESSION_SYSTEM_COMMON: BACTERIA;
5 EXPRESSION_SYSTEM_STRAIN: B21/DG3;
6 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID
POLYAMINE METABOLISM, PYRIDOXAL 5'-PHOSPHATE, ALPHA-BETA
2 BARREL, LYASE
X-RAY DIFFRACTION
N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS,
2 E.J.GOLDSMITH
2
29-DEC-99 1QU4
1
JRNL
COMPND REMARK
1
17-NOV-99 1QU4
0
AUTH
N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS,
AUTH 2 E.J.GOLDSMITH
TITL
X-RAY STRUCTURE OF ORNITHINE DECARBOXYLASE FROM
TITL 2 TRYPANOSOMA BRUCEI: THE NATIVE STRUCTURE AND THE
TITL 3 STRUCTURE IN COMPLEX WITH
TITL 4 ALPHA-DIFLUOROMETHYLORNITHINE
REF
BIOCHEMISTRY
V. 38 15174 1999
REFN
ASTM BICHAW US ISSN 0006-2960
1
2
2 RESOLUTION. 2.90 ANGSTROMS.

DBREF
DBREF
DBREF
DBREF
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES

1QU4 A
1QU4 B
1QU4 C
1QU4 D
1 A
2 A
3 A
4 A
5 A
6 A
7 A
8 A
9 A
10 A
11 A
12 A
13 A
14 A
15 A
16 A
17 A
18 A
19 A
20 A
21 A
22 A
23 A
24 A
25 A
26 A
27 A
28 A
29 A
30 A
31 A
32 A
33 A

1
1
1
1
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425
425

425 SWS
425 SWS
425 SWS
425 SWS
GLY ALA MET
ARG PHE LEU
LYS LYS ILE
PHE PHE VAL
GLU THR TRP
TYR ALA VAL
THR LEU ALA
ASN THR GLU
PRO GLU LYS
SER HIS ILE
MET THR PHE
LYS THR HIS
THR ASP ASP
PHE GLY ALA
GLN ALA LYS
PHE HIS VAL
ALA GLN ALA
GLY THR GLU
GLY GLY GLY
PHE GLU GLU
LYS HIS PHE
GLU PRO GLY
ALA VAL ASN
GLN THR ASP
SER PHE MET
PHE ASN CYS
LEU PRO GLN
PRO SER SER
GLN ILE VAL
GLY GLU TRP
VAL VAL GLY
THR ILE TYR
VAL ARG GLU

P07805
DCOR_TRYBB
P07805
DCOR_TRYBB
P07805
DCOR_TRYBB
P07805
DCOR_TRYBB
ASP ILE VAL VAL ASN ASP
GLU GLY PHE ASN THR ARG
SER MET ASN THR CYS ASP
ALA ASP LEU GLY ASP ILE
LYS LYS CYS LEU PRO ARG
LYS CYS ASN ASP ASP TRP
ALA LEU GLY THR GLY PHE
ILE GLN ARG VAL ARG GLY
ILE ILE TYR ALA ASN PRO
ARG TYR ALA ARG ASP SER
ASP CYS VAL ASP GLU LEU
PRO LYS ALA LYS MET VAL
SER LEU ALA ARG CYS ARG
LYS VAL GLU ASP CYS ARG
LYS LEU ASN ILE ASP VAL
GLY SER GLY SER THR ASP
ILE SER ASP SER ARG PHE
LEU GLY PHE ASN MET HIS
PHE PRO GLY THR ARG ASP
ILE ALA GLY VAL ILE ASN
PRO PRO ASP LEU LYS LEU
ARG TYR TYR VAL ALA SER
VAL ILE ALA LYS LYS VAL
VAL GLY ALA HIS ALA GLU
TYR TYR VAL ASN ASP GLY
ILE LEU TYR ASP HIS ALA
ARG GLU PRO ILE PRO ASN
VAL TRP GLY PRO THR CYS
GLU ARG TYR TYR LEU PRO
LEU LEU PHE GLU ASP MET
THR SER SER PHE ASN GLY
TYR VAL VAL SER GLY LEU
LEU LYS SER GLN LYS SER

21
21
21
21
ASP LEU
ASP ALA
GLU GLY
VAL ARG
VAL THR
ARG VAL
ASP CYS
ILE GLY
CYS LYS
GLY VAL
GLU LYS
LEU ARG
LEU SER
PHE ILE
THR GLY
ALA SER
VAL PHE
ILE LEU
ALA PRO
ASN ALA
THR ILE
ALA PHE
THR PRO
SER ASN
VAL TYR
VAL VAL
GLU LYS
ASP GLY
GLU MET
GLY ALA
PHE GLN
PRO ASP

445
445
445
445
SER CYS
LEU CYS
ASP PRO
LYS HIS
PRO PHE
LEU GLY
ALA SER
VAL PRO
GLN ILE
ASP VAL
VAL ALA
ILE SER
VAL LYS
LEU GLU
VAL SER
THR PHE
ASP MET
ASP ILE
LEU LYS
LEU GLU
VAL ALA
THR LEU
GLY VAL
ALA GLN
GLY SER
ARG PRO
LEU TYR
LEU ASP
GLN VAL
TYR THR
SER PRO
HIS VAL

HET
HET
HET
HET
HETNAM
HETSYN
FORMUL
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
HELIX
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET

PLP
PLP
PLP
PLP
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
1
2
3
4
5
6

A 600
15
B 600
15
C 600
15
D 600
15
PLP PYRIDOXAL-5'-PHOSPHATE
PLP VITAMIN B6 COMPLEX
PLP
4(C8 H10 N1 O6 P1)
1 LEU A
45 LEU A
59
2 LYS A
69 ASN A
71
3 ASP A
73 GLY A
84
4 SER A
91 ILE A 101
5 PRO A 104 GLU A 106
6 GLN A 116 SER A 126
7 CYS A 135 HIS A 146
8 LYS A 173 GLU A 175
9 ASP A 176 LEU A 187
10 ALA A 205 LEU A 225
11 LYS A 247 PHE A 263
12 GLY A 276 ALA A 281
13 PHE A 326 HIS A 333
14 THR A 390 THR A 394
15 SER A 396 PHE A 400
A 6 GLN A 365 PRO A 373
A 6 LEU A 350 TRP A 356
A 6 SER A 313 VAL A 318
A 6 PHE A 284 THR A 296
A 6 PHE A 40 ASP A 44
A 6 THR A 404 VAL A 408
A1 6 GLN A 365 PRO A 373
A1 6 LEU A 350 TRP A 356
A1 6 SER A 313 VAL A 318
A1 6 PHE A 284 THR A 296
A1 6 TRP A 380 PHE A 383
A1 6 PRO A 338 PRO A 340

1
5
1
1
5
1
1
5
1
1
1
1
1
5
5
0
-1
1
-1
-1
1
0
-1
1
-1
-1
-1

15
3
12
11
3
11
12
3
12
21
17
6
8
5
5
N
O
N
O
O

TYR
PHE
ILE
PHE
THR

A
A
A
A
A

351
314
291
40
404

O
N
O
N
N

LEU
SER
TYR
ALA
PHE

A
A
A
A
A

372
354
317
287
41

N
O
N
N
O

TYR
PHE
ILE
LEU
LEU

A
A
A
A
A

351
314
291
381
339

O
N
O
O
N

LEU
SER
TYR
VAL
LEU

A
A
A
A
A

372
354
317
288
382

CRYST1
66.800 151.700
85.350 90.00 102.30
ORIGX1
1.000000 0.000000 0.000000
ORIGX2
0.000000 1.000000 0.000000
ORIGX3
0.000000 0.000000 1.000000
SCALE1
0.014970 0.000000 0.003264
SCALE2
0.000000 0.006592 0.000000
SCALE3
0.000000 0.000000 0.011992
ATOM
1 N
ASP A 35
34.731 -5.686
ATOM
2 CA ASP A 35
34.249 -5.884
ATOM
3 C
ASP A 35
33.320 -4.750
ATOM
4 O
ASP A 35
33.474 -3.594
ATOM
5 CB ASP A 35
33.558 -7.247
ATOM
6 CG ASP A 35
33.566 -7.887
ATOM
7 OD1 ASP A 35
33.717 -9.133
ATOM
8 OD2 ASP A 35
33.419 -7.182
ATOM
9 N
GLU A 36
32.332 -5.073
ATOM
10 CA GLU A 36
31.446 -4.080
ATOM
11 C
GLU A 36
32.259 -2.944
ATOM
12 O
GLU A 36
32.220 -1.813
ATOM
13 CB GLU A 36
30.419 -3.638
ATOM
14 CG GLU A 36
29.111 -3.155
ATOM
15 CD GLU A 36
27.791 -3.597
ATOM
16 OE1 GLU A 36
27.308 -4.727
ATOM
17 OE2 GLU A 36
27.115 -2.806
ATOM
18 N
GLY A 37
33.018 -3.192
ATOM
19 CA GLY A 37
33.624 -2.167
ATOM
20 C
GLY A 37
32.598 -1.167
ATOM
21 O
GLY A 37
32.236 -1.162
ATOM
22 N
ASP A 38
32.135 -0.248
ATOM
23 CA ASP A 38
31.136
0.700
ATOM
24 C
ASP A 38
31.794
1.722
ATOM
25 O
ASP A 38
33.029
1.896
ATOM
26 CB ASP A 38
30.500
1.242
ATOM
27 CG ASP A 38
29.583
0.207
ATOM
28 OD1 ASP A 38
29.408 -0.876
ATOM
38 CA PHE A 40
32.728
6.727
...
CONECT1117911177
CONECT1118011177
MASTER
482
0
4
60
80
0
0
END

90.00 P 1 21 1
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
15.000 1.00 98.44
13.629 1.00 98.39
13.203 1.00 98.13
13.603 1.00 98.29
13.545 1.00 98.38
12.170 1.00 98.36
12.114 1.00 98.26
11.148 1.00 98.39
12.378 1.00 97.79
11.787 1.00 95.51
11.199 1.00 90.65
11.692 1.00 94.96
12.840 1.00 97.63
12.261 1.00 98.19
12.824 1.00 98.33
12.601 1.00 98.28
13.520 1.00 98.43
10.131 1.00 52.86
9.299 1.00 39.88
8.712 1.00 34.34
7.531 1.00 31.44
9.564 1.00 37.23
9.138 1.00 36.44
8.228 1.00 33.49
8.156 1.00 34.06
10.405 1.00 42.06
11.047 1.00 44.59
10.434 1.00 45.72
7.615 1.00 20.51

611176

64

N
C
C
O
C
C
O
O
N
C
C
O
C
C
C
O
O
N
C
C
O
N
C
C
O
C
C
O
C

132

Atom serial number

Name

Residue Chain

Seq Nr

X Y

CRYST1
66.800 151.700
85.350 90.00 102.30
ORIGX1
1.000000 0.000000 0.000000
ORIGX2
0.000000 1.000000 0.000000
ORIGX3
0.000000 0.000000 1.000000
SCALE1
0.014970 0.000000 0.003264
SCALE2
0.000000 0.006592 0.000000
SCALE3
0.000000 0.000000 0.011992
ATOM
1 N
ASP A 35
34.731 -5.686
ATOM
2 CA ASP A 35
34.249 -5.884
ATOM
3 C
ASP A 35
33.320 -4.750
ATOM
4 O
ASP A 35
33.474 -3.594
ATOM
5 CB ASP A 35
33.558 -7.247
ATOM
6 CG ASP A 35
33.566 -7.887
ATOM
7 OD1 ASP A 35
33.717 -9.133
ATOM
8 OD2 ASP A 35
33.419 -7.182
ATOM
9 N
GLU A 36
32.332 -5.073
ATOM
10 CA GLU A 36
31.446 -4.080
ATOM
11 C
GLU A 36
32.259 -2.944
ATOM
12 O
GLU A 36
32.220 -1.813
ATOM
13 CB GLU A 36
30.419 -3.638
ATOM
14 CG GLU A 36
29.111 -3.155
ATOM
15 CD GLU A 36
27.791 -3.597
ATOM
16 OE1 GLU A 36
27.308 -4.727
ATOM
17 OE2 GLU A 36
27.115 -2.806
ATOM
18 N
GLY A 37
33.018 -3.192
ATOM
19 CA GLY A 37
33.624 -2.167
ATOM
20 C
GLY A 37
32.598 -1.167
ATOM
21 O
GLY A 37
32.236 -1.162
ATOM
22 N
ASP A 38
32.135 -0.248
ATOM
23 CA ASP A 38
31.136
0.700
ATOM
24 C
ASP A 38
31.794
1.722
ATOM
25 O
ASP A 38
33.029
1.896
ATOM
26 CB ASP A 38
30.500
1.242
ATOM
27 CG ASP A 38
29.583
0.207
ATOM
28 OD1 ASP A 38
29.408 -0.876
ATOM
38 CA PHE A 40
32.728
6.727
...
CONECT1117911177
CONECT1118011177
MASTER
482
0
4
60
80
0
0
END

Occupancy Temp Factor

90.00 P 1 21 1
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
15.000 1.00 98.44
13.629 1.00 98.39
13.203 1.00 98.13
13.603 1.00 98.29
13.545 1.00 98.38
12.170 1.00 98.36
12.114 1.00 98.26
11.148 1.00 98.39
12.378 1.00 97.79
11.787 1.00 95.51
11.199 1.00 90.65
11.692 1.00 94.96
12.840 1.00 97.63
12.261 1.00 98.19
12.824 1.00 98.33
12.601 1.00 98.28
13.520 1.00 98.43
10.131 1.00 52.86
9.299 1.00 39.88
8.712 1.00 34.34
7.531 1.00 31.44
9.564 1.00 37.23
9.138 1.00 36.44
8.228 1.00 33.49
8.156 1.00 34.06
10.405 1.00 42.06
11.047 1.00 44.59
10.434 1.00 45.72
7.615 1.00 20.51

611176

64

Element

N
C
C
O
C
C
O
O
N
C
C
O
C
C
C
O
O
N
C
C
O
N
C
C
O
C
C
O
C

132

File Format Conversions

Wide variety of formats


Common tools
readseq (all flavors of Unix)
1.
2.
3.
4.
5.
6.
7.
8.
9.

IG/Stanford
GenBank/GB
NBRF
EMBL
GCG
DNAStrider
Fitch
Pearson/Fasta
Zuker (in-only)

10.
11.
12.
13.
14.
15.
16.
17.
18.

Olsen (in-only)
Phylip3.2 (Sequential)
Phylip (Interleaved)
Plain/Raw
PIR/CODATA
MSF
ASN.1
PAUP/NEXUS
Pretty (out-only)

seqret (EMBOSS)

gcg GCG 9.x and 10.x format


embl
swiss
fasta
genbank
nbrf
pir NBRF (PIR)
codata CODATA format.
strider DNA strider format
clustal
phylip PHYLIP non-interleaved multiple alignment format.
acedb ACeDB format
msf Wisconsin Package GCG's MSF multiple sequence format.
hennig86 Hennig86 format
jackknifer Jackknifer format
jackknifernon Jackknifernon format
nexus
paup Nexus/PAUP format
treecon Treecon format
mega Mega format
ig IntelliGenetics format.
staden
text

Many GUI packages such as GCG


SeqLab (Unix), BioEdit (Windows), etc.
have built in conversion utilities between
different file formats
Forcon is handy for converting between
phylogenetic multiple alignment formats

Structure file formats

Major formats
PDB Protein Database
mol2 Tripos Sybyl
mmCIF - Macromolecular
Crystallographic Information File
XYZ

Some packages can automatically


convert between these formats

Babel
Alchemy
Biosym .CAR
Cambridge CADPAC
Chem3D Cartesian 2
CSD GSTAT
Feature
Gaussian Output
Gaussian 94 Output
Hyperchem HIN
Mac Molecule
MM2 Input
MMADS
Mopac Cartesian
PC Model
PS-GVB Output
ShelX
Spartan Semi-Empirical
Sybyl Mol2
XYZ

AMBER PREP
Boogie
CHARMm
CSD CSSR
Dock Database
Free Form Fractional
Gaussian Z-Matrix
GAMESS Output (A)
MDL Isis (SDF)
Macromodel
MM2 Ouput
MDL MOLfile
Mopac Internal
PDB
Quanta MSF
SMILES
Spartan Mol. Mechanics
Conjure
XED

Ball and Stick


Cacao Cartesian
Chem3D Cartesian 1
CSD FDAT
Dock PDB
GAMESS Output
Gaussian 92 Output
GROMOS96 (nm)
M3D
Micro World
MM3
MOLIN
Mopac Output
PS-GVB Input
Schakal
Spartan
Sybyl Mol
UniChem XYZ

Also has the ability to add and delete hydrogens


Available for Unix (AIX, Ultrix, Sun-OS, Convex, SGI, Cray, Linux), MS-DOS, and
on Macs running at least System 7.0.

babel -imm2out mm2.grf -omopint mopac.dat

Some programming tools for


conversions
bioperl
use Bio::SeqIO
;
Bio::SeqIO;
$in = Bio::SeqIO
->new(-file => "inputfilename" , '-format' => 'Fasta');
Bio::SeqIO->new(-file
'Fasta');
$out = Bio::SeqIO
->new(-file => ">outputfilename" , '-format' => 'EMBL');
Bio::SeqIO->new(-file
'EMBL');
while ( my $seq = $in->next_seq() )
{ $out->write_seq($seq);
}
or
use Bio::SeqIO
;
Bio::SeqIO;
$in = Bio::SeqIO
->newFh(-file => "inputfilename" , '-format' => 'Fasta');
Bio::SeqIO->newFh(-file
'Fasta');
$out = Bio::SeqIO
->newFh('-format' => 'EMBL');
Bio::SeqIO->newFh('-format'
'EMBL');
# World's shortest Fasta<->EMBL format converter:
print $out $_ while <$in>;

biopython

Scanner - The part of the parser that actually does


the work or going through the file and extracting
useful information. This useful information is
converted into events.
Consumer - The consumer does the job of
processing the useful information and spitting it out
in a format that the programmer can use. The
consumer does this by receiving the events created
by the scanner.
You may be required to write your own scanner and
consumer for certain formats

Translating nucleotide formats

Factors to take into account


Translate in all 6 reading frames
3 forward, 3 reverse
The use of non-standard genetic codes for
different organisms
Stop codons
Output format

1 letter
3 letter

EMBOSS
transeq
It can translate in any of the 3 forward or three reverse
sense frames, or in all three forward or reverse frames,
or in all six frames.
It can translate specified regions corersponding to the
coding regions of your sequences.
It can translate using the standard ('Universal') genetic
code and also with a selection of non-standard codes.
Termination (STOP) codons are translated as the
character '*'.
The output peptide sequence is always in the standard
one-letter IUPAC code.
prettyseq
This writes out a nicely formatted display of the
sequence with the translation (within specified ranges)
displayed beneath it.
Slightly unusually, this application uses the codon usage
tables to translate the codons

Web tools
Expasy translate tool

EBI translation machine

Viewers for sequencer data

abiview (EMBOSS)
Trev (Unix)
EditView (Mac)
Chromas (Windows)
AbiView (Windows)

Most viewers allow you to:


View the traces
Change the scale
Edit the basecalling
Preserve the original sequence
Export the data

Analysis of primary data from


sequencers
Staden Package (MRC-LMB)

Preparing sequence trace data for analysis


for assembly
pregap4

Graphical user interface


Prepare trace data
Automation
Trace format conversion
Quality analysis
Vector clipping
Contaminant screening
Repeat searching.

Assembly program
gap4

Assembly
Contig joining
Assembly checking
Repeat searching
Experiment suggestion
Read pair analysis
Contig editing
Graphical views of contigs
Database

Consed

Phred: base caller


Phrap: assembler
Consed: Editor and finishing program
Quality values

Phred designed for gel-based sequencers


Being checked for capillary data

Finding open reading frames

GRAIL

Neural network
Combine evidence fron 7 different statistical
measures

Frame bias
Periodicities
Fractal dimensions
Coding 6-tuples
In-frame 6-tuples
K-tuple commonality
Repetitive 6-tuple words

At each position of the sequence, info is


weighted, integrated and scored for ORF or
intergenic region

Organism/dataset specificity
Genscan

Statistics and probabilistic models of gene


structure

GeneWise

Comparison of translations with known


proteins

NetGene

Donor and acceptor sites

EMBOSS
getorf
plotorf

Determining protein and DNA


characteristics
Web
BCM Search Launcher
Nucleic acid sequence searches
General protein sequence/pattern searches
Species-Specific protein sequence searches
Multiple sequence alignments
Pairwise sequence alignments
Gene feature searches
Sequence utilities
Protein secondary structure prediction

SMART

Protein domain and feature analysis

Pfam

HMM-based protein motif searches

Prosite
Detects signature motifs in proteins
Regular expression searches
Scan sequenes against database

Prints

Protein fingerprints

EMBOSS DNA
cpgplot plots cpg rich areas
restrict restriction sites
tfscan transcription factors
einverted find inverted repeats
chips codon usage
geecee GC content

EMBOSS protein

garnier - predicts protein secondary structure


helixturnhelix - report nucleic acid binding motifs
hmoment - hydrophobic moment calculation
pepcoil - predicts coiled coil regions
pepnet - displays proteins as a helical net
pepwheel - shows protein sequences as helices
tmap - displays membrane spanning regions
topo - draws an image of a transmembrane protein
charge - protein charge plot
checktrans - reports STOP codons and ORF statistics of a protein
sequence
compseq - counts the composition of dimer/trimer/etc words in a
sequence
iep - calculates the isoelectric point of a protein
octanol - displays protein hydropathy
pepinfo - plots simple amino acid properties in parallel
pepstats - protein statistics
pepwindow - displays protein hydropathy
antigenic - finds antigenic sites in proteins
pscan - scans proteins using PRINTS
sigcleave - reports protein signal cleavage sites

Primer Design

Factors
Melting point

Length
Composition
Methods for calculating melting point
Internal stability

Specificity

False priming sites

Internal stability

Hairpin structures

Compatibility

Primer dimers
Compatible melting points

OLIGO Package
Nearest neighbour method for Tm
calculation
Comprehensive analysis suite
$$$

CODEHOP
COnsensus-DEgenerate Hybrid Oligonucleotide Primer
PCR primers designed from protein multiple sequence
alignments

Primer3
You provide the target sequence
It picks primers for PCR reactions, considering
as criteria:

Oligonucleotide melting temperature


Size
GC content
primer-dimer possibilities
PCR product size
Positional constraints within the source sequence
Miscellaneous other constraints.

start len
1 LEFT PRIMER
RIGHT PRIMER

tm

gc%

any 3'

seq

66

20

60.22

55.00

5.00

2.00 AAGAGTCTGGGGGAGCTGAT

259

20

60.19

50.00

4.00

2.00 ATCATTGCTGGGCTGATCTC

PRODUCT SIZE: 194, PAIR ANY COMPL: 4.00, PAIR 3' COMPL: 2.00
2 LEFT PRIMER
RIGHT PRIMER

331

20

60.25

45.00

5.00

2.00 AGCTCATTGGGCAAAAAGTG

529

20

59.55

55.00

2.00

1.00 CCAGTTCCAATAGCCCAGAC

PRODUCT SIZE: 199, PAIR ANY COMPL: 6.00, PAIR 3' COMPL: 1.00
3 LEFT PRIMER
RIGHT PRIMER

331

20

60.25

45.00

5.00

2.00 AGCTCATTGGGCAAAAAGTG

538

20

60.12

45.00

3.00

2.00 GCAGTTTTGCCAGTTCCAAT

PRODUCT SIZE: 208, PAIR ANY COMPL: 7.00, PAIR 3' COMPL: 2.00
4 LEFT PRIMER
RIGHT PRIMER

379

20

59.67

50.00

4.00

2.00 TCATCGCCTGTATTGGTGAG

578

20

60.44

50.00

6.00

2.00 GCGGAGTTTCTTGTGCACTT

PRODUCT SIZE: 200, PAIR ANY COMPL: 3.00, PAIR 3' COMPL: 1.00
Statistics
con

too

in

in

no

tm

tm

high

high

sid

many

tar

excl

GC

too

too

any

3'

poly

end

ered

Ns

get

reg

GC% clamp

low

high compl compl

stab

ok

Left

4198

810

2322

17

65

86

898

Right

4172

807

2281

83

994

bad

high

Pair Stats:
considered 811, unacceptable product size 422, high any compl 1, high end compl 33, ok
355

You might also like