Professional Documents
Culture Documents
1 1 4 Databases and Resources (8) : Methods Mol Biol
1 1 4 Databases and Resources (8) : Methods Mol Biol
assume that records in the division files use fixed column positions for items
of data. This is true of EMBL, SWlSS-PROT, and GenBank, but not of
CODATA. Consequently if the column positions used in the CODATA
versions of PIR and NRL3D change from one release to the next, the index
creation routines will not work in the present form. The package does not
currently include routines to calculate taxon indexes or to create indexes
for PROSITE.
The routines described can be obtained as part of our sequence analysis
package.4 They run on DEC alpha machines under Digital UNIX (formerly
known as OSF), and on SUN machines using either SunOS 4.x or Solaris
2.x, and probably on other UNIX systems. For further information see
http://www.mrc-lmb.cam.ac.uk/pubseq/.
[8] S R S : I n f o r m a t i o n R e t r i e v a l S y s t e m f o r M o l e c u l a r
Biology Data Banks
By T H U R E ETZOLD, ANATOLY ULYANOV, and PATRICK ARGOS
Introduction
The explosion of efforts in the field of molecular biology and biotechnol-
ogy has made both the acquisition of information from data banks and
deposit of new data increasingly critical. The types of data are diverse and
can range from the protocol of an experiment to the sequence of an entire
chromosome. The specificity and precision of data such as the exact coordi-
nates of a protein atom in three-dimensional space (Protein Data Bank,
PDB 1) or the free text description of a genetic disease (MIM 2) are also
highly variant. These contrasting data types have been compiled into data
banks by individuals or organizations and are often available to the scientific
community for retrieval as well as updating and addition of information.
The organization of the data banks also varies greatly, ranging from simple
text files to relational or object-oriented systems. Few of these information
list of keywords. The feature table, as described later, defines mostly subse-
quences with particular characteristics (exons or introns, repeats, do-
mains, etc.).
As the information requests gain specificity, more discrimination is
required in the choice of the index and the search word applied in a query.
A search for appropriate entries in biological data banks is often a quest
for the right search word(s) which can be obtained by testing different
queries with successive refinement. A good starting point for such a proce-
dure is request to the AllText index of SRS, which is not related to a real
data field but rather represents all indices of all data fields containing textual
information such as those previously mentioned.
SRS allows the indexing of any word in an entry. An index search
usually requires only a fraction of a second but takes longer if the query
contains regular expressions or wildcards at the beginning of the search
word. Numbers such as the date, sequence length, resolution, or molecular
weight can also be indexed in special structures that allow the retrieval of
numeric ranges; for instance, all sequences with a length between 200 and
400 or all protein structures with a resolution greater than 2 ~.
It is evident that SRS requires much information to parse and extract
data fields and indexable items from data banks that often have their own
unique format and syntax. Two languages have been designed to describe
to the system a data bank whose information is usually deposited in a single
file. The O D D (Object Design and Definition) language is used to detail
the physical structure (data fields, files), whereas the other, an extended
Backus-Naur notation, describes the syntax of the data fields. This results
in a highly flexible environment where data bank representations can be
easily added or changed.
In a system with many data banks, many indices (one per data bank
per data field) have to be created and updated. The link indices (see below)
add substantially to the complexity of the maintenance since links manifest
interdependences between individual data banks. A program has been
written that analyzes the state of the overall system and writes a command
script in case indices need to be rebuilt. The index building is completely
automated, and the program runs automatically at frequent intervals to
check if a new version of a data bank exists. The storage size for the indices
is relatively small, totaling about 10 to 20% of the size of the actual indexed
data banks.
Most interfaces to SRS (see below) shield the user completely or at least
to some extent from the query language expressions, often through a
query form.
The query language is set oriented, where a set is a list of entries from
one or more data banks. The operators require sets as operands and produce
sets of entries as the result. A set can be an entire data bank as denoted
by the data bank name or a portion of it resulting from a query or a retrieval
command. A retrieval command must specify the data bank name, the
name of the index to be searched, and the query string or a numeric range.
For example, the query
[embl-organism : human]
retrieves all entries in EMBL with the word "human" in the organism data
field. The query
[embl-seqlength#400: 500]
retrieves all sequences from the EMBL nucleotide databank with length
between 400 and 500 base pairs. The two queries can be combined with
the logical operator AND, denoted "&," resulting in a list of overlapping
entries obtained from the separate commands:
[embl-organism : human]&[embl-seqlength#400: 500]
SRS provides the three logical operators AND, OR, and BU T N O T
(combination of binary AND and unary NOT) as well as two further link
operators described below. Many retrieval criteria can be specified and
combined by logical operators in a single expression. Only indexed informa-
tion can be used for specifying constraints, and only entire data banks can
be searched.
Retrieval of Subentries
Retrieval systems usually consider the entire entry as the only retriev-
able unit, a useful simplification that, however, completely ignores its inter-
nal structure. This approach may be acceptable for data banks such as
Prosite, 6 where each entry describes one amino acid sequence pattern, but
not for nucleotide sequence data banks such as GenBank 7 or EMBL where
an entry may contain a single exon but also a complete genome or chromo-
some with several hundred genes. Genes, or rather their parts such as the
promoter, coding sequence, introns, exons, and the like, are characterized
in the feature table data field, which increasingly forms the principal portion
of the annotation in EMBL and GenBank. Feature information can also
be found in the protein sequence data banks SWISS-PROT 8 and PIR
(Protein Identification Resource), 9 which document subentries such as
transmembrane segments, calcium binding sites, and disulfide bridges. The
feature table introduces a new structural element into the entry, namely,
a list of subentries. A subentry can be viewed independent of its parent
entry. For instance, the coding sequences (CDS features) from a DNA
sequence entry can be extracted from the parent entry by using the sequence
location information and subsequently translated and aligned with others
as complete entries describing cDNA sequences.
SRS allows the indexing of the feature table such that the whole entry
or the subentry (sequence feature) can be retrieved after an appropriate
query. Each word in the feature description can be indexed and thereby
associated with the feature itself and not just the parent entry. Figure 1
shows four subentries resulting from the query in SWISS-PROT for the
transmembrane regions of the acetylcholine receptor sequence entry
ACHA_BOVIN. The resulting entries contain the ID line of the parent
entry, the relevant part of the feature table, and the subsequence extracted
from the parent sequence as defined by the terminal positions in the fea-
ture annotation.
It is also possible to redefine the begin and end positions so that a
fragment with a "safety" region around them will be provided. This is
useful when extracting, for instance, transmembrane segments which are
mostly putative with much uncertainty as to the termini. Another example
is the retrieval o~ intron/exon boundaries.
truly distributed system with different and yet intertwined services. Indeed,
this has been so successful that most people think of navigation between
data banks in terms of hyperte×t links. Though they are easy to comprehend
and very powerful in making accessible related information, they are not
without limitations.
First, hypertext links are unidirectional unless the linked entries provide
a cross-reference back to the referencing entry. For instance, EMBL and
SWISS-PROT are mutually referenced, but only SWISS-PROT points to
PDB (Protein Data Bank, containing atom coordinates and associated
information for known protein three-dimensional structures), which has no
cross-references of any kind. There are many cases where the referencing
is not mutual, or is undesirable if, for instance, private or temporary data
banks are involved.
Second, it is always possible, but may be very inconvenient, to collect
all entries of one type referenced by a single entry. A PROSITE entry,
120 DATABASES AND RESOURCES [8]
Third, linking entire sets of single entries or even entire data banks is
impractical, as each single entry in the set must be accessed, and then the
hypertext link must be located and the linked entries, if any, accessed and
saved. In effect, hypertext links are too cumbersome to ask straightforward
questions such as which member of a set of lysozymes has a known tertiary
structure or which entries of the current set belong to the family of amino-
transferases.
Finally, often a desired link between two data banks does not exist; for
instance, it is impossible in the absence of cross-reference information to
navigate directly from an entry in SWlSS-PROT to an associated entry in
EPD (Eukaryotic Promoter Database 12) which describes the promoter of
the gene coding for the protein whose amino acid sequence is given in the
SWlSS-PROT entry. However, the link can be found by first accessing a
related entry in EMBL which provides links to EPD. In general, there
are many cases where data banks are not directly but indirectly linked.
Exploitation of this requires of the user intimate and extensive knowledge
regarding the intertwining of all data banks, and in the process of repeated
linking, it is likely the user will get lost in the Web.
The SRS linking algorithm allows the user to overcome facilely these
barriers. The cross-references are processed before query time such that
the available data banks are scanned for cross-references and the results
are inserted into special link indices. For each link between two data banks
an index is built. In case of mutual referencing, it is sufficient to read the
references from only one source, alleviating the necessity to scan all data
banks when building a network with many data banks. Once indexed, a
link can be used bidirectionally. All the links must be defined in the ODD
language (see above) and are therefore known to the system. This link
information is used to build a graph structure in memory, where the data
banks are nodes and the links the edges. In this system a link is defined as
the shortest or optimal path that can be found between any desired pair
of nodes or data banks assuming bidirectional edges. Figure 3 shows the
situation for the data banks installed at EMBL. To add a new data bank
to the network, it is sufficient to add a single link to any of its nodes.
Though the usefulness of the link operation depends on the quality of the
cross-references in the respective data banks, it is also valuable in locating
wrong or missing cross-reference information.
The SRS query language has, apart from the logical operators previously
described, two further and unique operators: link left and link right. Figure
4 illustrates their functionality where the two data banks A and B are
linked through explicit cross-references in the entries of A used to create
12p. Bucher and E. N. Trifonov, Nucleic Acids Res. 14, 10009 (1986).
122 DATABASES AND RESOURCES [8]
\ /
FIG. 3. Network of data banks installed at the EMBL in Heidelberg, Germany. This
diagram can also be called from the SRSWWW server (http://www.embl-heidelberg.de/srs/
srsc?-np), and each box can be clicked to obtain further information about the respective
data bank. Note that LIMB [Listing of Molecular Biology databases, G. Keen, G. Redgrave,
J. Lawton, M. Cinkosky, S. Mishra, J. Fiekett, and C. Burks, Math. Comput. Modelling 16,
93 (1992)] is the only databank without a link since it provides information about all the
others.
A B A>B " - ~
H A<B -"~B
F~G. 4. Function of the SRS link operators exemplified on the two data banks A and B
with respective numbered entries. See text for further explanations.
[8] SRS (SEQUENCERETRIEVALSYSTEM) 123
a link index. Operation (1) A > B returns those entries in B that are cross-
referenced by entries in A. Operation (2) A < B returns entries in A that
cross-reference entries in B.
Since all link indices can be used in both directions, it is of no conse-
quence whether entries in A reference B or entries in B reference those
in A. It is therefore valid to state that entries in A are linked to entries in
B, either by references in A or B. The results of the above expressions can
now be redefined as the evaluation of (1) returns entries in B linked to A
and (2) returns entries in A linked to B. The link operator can be viewed
as an arrow pointing to the set [link right (1) and link left (2)] from which
the result will be taken. Reversing the order of operands has the same
effect; hence, A > B and B < A yield the same result, which is entries in
B linked to A. The operands of the link operators can be a data bank name
or a set returned by a retrieval command or any other SRS query language
expression (see above).
Links can also be specified between data banks that are not direct
neighbors in the link diagram as shown in Fig. 3. Consider the example
pdb > embl which retrieves, for all proteins with solved tertiary structure,
the list of D N A sequences that encode their protein sequences. Algorithmic
examination of the data bank network reveals that the shortest path be-
tween PDB and E M B L is to link PDB first to SWISS-PROT and then to
E M B L , which can then be automatically carried out. It may occur that two
alternative paths exist. The data bank administrator should give priority
to one path by assigning a lower cost value to the link object as specified
in the O D D language.
It is also possible to give a path explicitly by specifying a succession of
links. The expression pdb > hssp > swissprot forces the system not to take
the shortest route from PDB to SWISS-PROT but a deviation via HSSP.
The result is a set of all SWISS-PROT protein sequences with solved tertiary
structure and, in addition, all protein sequences in SWISS-PROT that are
homologous to them. The expression takes advantage of the fact that HSSP
entries list, for every entry in PDB, the SWISS-PROT entries that are
similar above a given percentage residue identity. The link via HSSP can
be understood as an amplification of a set of entries to all their related
counterparts as defined by the data bank chosen. In lieu of HSSP, other
data banks such as Prosite and Prodom, which collect and multiply align
related protein sequences of subsequences, could have been chosen.
It has been discussed previously that subentries can be retrieved by
searching in the feature table index. The result of such a query can be
converted to a set of parent entries by a link to a predefined entity "parent."
The result of the conversion can be used to link to other data banks. The
124 DATABASES AND RESOURCES [8]
following finds all PDB entries with proteins that have calcium binding
sites as annotated in SWlSS-PROT:
[swissprot-features:ca__bind] > parent > pdb
The link in the reverse direction gives all calcium binding sites with solved
tertiary structure:
pdb > swissprot > [swissprot-features:ca_bind]
Note that the "parent" keyword is not needed here since the link from
PDB to SWISS-PROT already results in a list of SWlSS-PROT entries
needed to link to the appropriate SWISS-PROT subentries or features.
Availability
The SRS program and its interfaces are provided together with the SRS
distribution file, which can be obtained via anonymous ftp from the server
felix.embl-heidelberg.de in the directory pub/software/unix/srs. The current
release of SRS is 4.08, with new releases made available every few months.
A news group, bionet.software.srs, provides a forum for discussion and
informs the community of bug fixes, new features, and releases. SRS is
also evolving into a collaborative effort, most of the interfaces have been
developed by other groups, and many of the data bank format descriptions
have been written by individuals maintaining SRS services at their sites.
A large part of the SRS functionality will be accessible through the
program "lookup" of the GCG program package for sequence analysis. 13
It will be distributed starting with release 8.1 of GCG.
13 j. Devereux, P. Haeberli, and O. Smithies, Nucleic Acids Res. 12, 387 (1984).
[81 SRS (SEQUENCERETRIEVALSYSTEM) 125
hflp://wwv¢,embl-heldelberg,de/srs/srsc
Top
Page
Databank
Information
ff ~/,~ Select
Link
Query ak.)
2
Manager hyperfext/iokO
FIG. 5. Structure of the WWW server. Only the main pages are illustrated; arrows indicate
entry paths. Top Page is the entry point of SRSWWW and can be obtained with the specified
URL (Uniform Resource Locator). See text for further details.
inform the user about the indexed data banks, often with links to the
distributor, and also allow the user to explore the data bank network. All
pages without exception are built during run time so that changes in the
system configuration, such as adding a new data bank, are immediately
reflected by the interface.
The S R S W W W server is currently installed at many sites throughout
the world. The number of data banks installed on a single node varies but
can reach as many as 50. A special program polls all these servers once a
day and retrieves the name, the number of entries, the release number, and
[8] SRS (SEQUENCERETRIEVALSYSTEM) 127
the indexing date for each data bank installed at the site. The information is
then compiled in a report which lists all data banks served by all nodes.
This list includes 13 sites with a total of 80 data banks. Figure 6 shows the
entry for the SWISS-PROT nucleotide data bank, which also includes links
to the data bank information page and relates status information for all
nodes serving that data bank. This information is useful both to the user
who can select the closest server to avoid slow internet connections or the
one with the most recent release of a given data bank and to the node
administrators who can verify the currency of their system.
Discussion
Sequence Retrieval System is an integrated system that provides a
homogeneous interface to all flat file data banks retained in their original
format. It is a retrieval system that allows access to, but not the depositing
of, data. Several elements are combined into a system that extends the
power of normal retrieval systems and which rivals that of real databases,
such as a relational system, without compromising speed. These elements
include languages for data bank and syntax definition, a programmable
parser, an indexing system, support for subentries, a novel system for ex-
ploiting links between data banks, and a query language. The database
linking is a unique feature that considerably extends the capability of hyper-
text links.
SWISSPROT
ABC, H u n a a r v 40292 21-Feb-1995 ?.
BEN, B r u s s e l s 43470 07-Jun-1995 ?.
BiO, O s l o 31.0 43470 28-Jun-1995
Biozentrum, Basel 43470 22-Mar-1995
CAOS/CAMM, Nigmeaen 31.0 43470 31-May-1995 ?.
CSC, F i n l a n d 31.0 43470 08-May-1995 ?.
EBI, H i n x t o n , U K 43470 18-Jun-1995 ?.
EMBL, H e i d e l b e r a 43470 22-Mar-1995 ?.
IUBio Archive Indiana 31.0 43470 15-Apr-1995 ?.
INSERM. F r a n c e 43470 02-Jun-1995 ?.
Sanaer. H i n x t o n . U K 31.0 43470 05-Jun-1995
S k i r b a l l Inst,, N Y 43470 13-Jun-1995
Weizmann. Israel 43470 Ol-Jun-1995
FIG. 6. Status of the data bank SWISS-PROT as installed on 13 SRSWWW servers. The
entry has been taken from the status page (http://www.embl-heidelberg.de/srs/status.html)
describing all data banks served by all 13 SRSWWW nodes. From left to right are listed the
following: a hypertext link to the server; if specified, the release number of the installed
version of SWISS-PROT; the number of entries; the indexing date; and the hypertext link that
leads to a page with further information about SWISS-PROT as installed at the respective site.
128 DATABASES AND RESOURCES [8]
SRS nonetheless has its limitations. T o benefit from the linking capabili-
ties, all data banks in the network must be installed at one site. This is in
contrast to World Wide Web hypertext links, which very naturally allow
navigation from one site to another. All information needs to be indexed
before it can be accessed such that addition of new entries requires rein-
dexing of the entire data bank. Work is in progress to overcome these
limitations. It is also intended to enhance SRS from a server of textual
information to an object server so that the retrieved data can be conve-
niently submitted to analysis programs such as sequence analysis tools.
Acknowledgment
The authors are grateful for financial support from the European Union (Grant Gene-
CT-93-0043) under the Biomed I program.