Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

SEQUENCE FORMATS

The biological significances


What a sequence format IS

• Sequence formats are ASCII TEXT.


• They are the required arrangement of
characters, symbols and keywords that specify
what things such as the sequence, ID name,
comments, etc.
• A couple of dozen sequence formats in
existence. Some are much more common than
others.
• designed so as to be able to hold the sequence
data and other information about the sequence.
Sequence formats
• Why different formats?
– Type of information
– Software requirements
– Database requirements
Main file formats used in Bioinformatics

•ASN.1
•EMBL, Swiss Prot
•FASTA
•GCG
•GenBank/GenPept
•PHYLIP
•Plain sequence format
ASN 1: Abstract Syntax Notation 1
used by NCBI
Seq-entry ::= set {
class phy-set ,
descr {
pub {
pub {
article {
title {
name "Cross-species infection of blood parasites between resident
and migratory songbirds in Africa" } ,
authors {
names
std {
{
name
name {
last "Waldenstroem" ,
first "Jonas" ,
initials "J." } } ,
{
name
name {
last "Bensch" ,
first "Staffan" ,
initials "S." } } ,
{
name
name {
last "Kiboi" ,
first "Sam" ,
initials "S." } } ,
{
name
name {
last "Hasselquist" ,
first "Dennis" ,
initials "D." } } ,
{
name
name {
last "Ottosson" ,
EMBL/Swiss Prot
(http://www.ebi.ac.uk/help/formats_frame.html)
• The first line of each sequence entry is the ID definition line which contains entry name, data class,
molecule, division and sequence length.
• XX line contains no data, just a separator
• The AC line lists the accession number.
• DE line gives description about the sequence
• FT precise annotation for the sequence
• Sequence information SQ in the first two spaces.
• The sequence information begins on the fifth line of the sequence entry.
• The last line of each sequence entry in the file is a terminator line which has the two characters // in the
first two spaces.
ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518;
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
DE rRNA and 5.8S rRNA genes, partial sequence.
RX MEDLINE; 94303342.
RX PUBMED; 8030378.
XX
FT rRNA <1..20
FT /product="18S ribosomal RNA"
FT misc_RNA 21..205
FT /standard_name="Internal transcribed spacer 1 (ITS1)"
FT rRNA 206..>237
FT /product="5.8S ribosomal RNA"
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
EMBL/Swiss Prot

• A sequence file in EMBL format can contain


several sequences. 
One sequence entry starts with an identifier
line ("ID"), followed by further annotation
lines. The start of the sequence is marked by a
line starting with "SQ" and the end of the
sequence is marked by two slashes ("//").
FASTA

• A sequence in Fasta format begins with a single-line description,


• followed by lines of sequence data.
• The description line is distinguished from the sequence data by a greater-than (">")
symbol in the first column.
• It is recommended that all lines of text be shorter than 80 characters in length.

>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)


AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC
GCG
•Exactly one sequence
•Begins with annotation lines
•Start of the sequence is marked by a line ending with "..“
•This line also contains the sequence identifier, the sequence length
and a checksum

ID AA03518 standard; DNA; FUN; 237 BP.


XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; AA03518 Length: 237 Check:
4514 ..
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
Phylip format
2 2000
G019uabh ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
G028uaah CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT
 
GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC
TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT
 
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC

• The first line of the input file contains the number of sequences and their
length (all should have the same length) separated by blanks.
• The next line contains a sequence name, next lines are the sequence itself
in blocks of 10 characters. Then follow rest of sequences.
Plain sequence format

• A sequence in plain format may contain only IUPAC characters and spaces (no


numbers!).
• Note: A file in plain sequence format may only contain one sequence, while most
other formats accept several sequences in one file.

• An example sequence in plain format is:

ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCG
CTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGC
GGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCT
CCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGC
AGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACT
GenBank format

• A sequence file in GenBank format can contain


several sequences.

• One sequence in GenBank format starts with a


line containing the word LOCUS and a number
of annotation lines. The start of the sequence
is marked by a line containing "ORIGIN" and
the end of the sequence is marked by two
slashes ("//").

You might also like