Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

114 DATABASES AND RESOURCES [8]

assume that records in the division files use fixed column positions for items
of data. This is true of EMBL, SWlSS-PROT, and GenBank, but not of
CODATA. Consequently if the column positions used in the CODATA
versions of PIR and NRL3D change from one release to the next, the index
creation routines will not work in the present form. The package does not
currently include routines to calculate taxon indexes or to create indexes
for PROSITE.
The routines described can be obtained as part of our sequence analysis
package.4 They run on DEC alpha machines under Digital UNIX (formerly
known as OSF), and on SUN machines using either SunOS 4.x or Solaris
2.x, and probably on other UNIX systems. For further information see
http://www.mrc-lmb.cam.ac.uk/pubseq/.

4 R. Staden, Methods MoL BioL 25, 9 (1994).

[8] S R S : I n f o r m a t i o n R e t r i e v a l S y s t e m f o r M o l e c u l a r
Biology Data Banks
By T H U R E ETZOLD, ANATOLY ULYANOV, and PATRICK ARGOS

Introduction
The explosion of efforts in the field of molecular biology and biotechnol-
ogy has made both the acquisition of information from data banks and
deposit of new data increasingly critical. The types of data are diverse and
can range from the protocol of an experiment to the sequence of an entire
chromosome. The specificity and precision of data such as the exact coordi-
nates of a protein atom in three-dimensional space (Protein Data Bank,
PDB 1) or the free text description of a genetic disease (MIM 2) are also
highly variant. These contrasting data types have been compiled into data
banks by individuals or organizations and are often available to the scientific
community for retrieval as well as updating and addition of information.
The organization of the data banks also varies greatly, ranging from simple
text files to relational or object-oriented systems. Few of these information

i E. E. Abola, F. C. Bernstein, and T. F. Koetzle, in "Computational Molecular Biology:


Sources and Methods for Sequence Analysis" (A. M. Lesk, ed.), p. 69. Oxford Univ. Press,
Oxford, 1988.
2 p. L. Pearson, C. Francomano, P. Foster, C. Bocchini, P. Li, and V. A. McKusick, Nucleic
Acids Res. 22, 3470 (1994).

Copyright © 1996by AcademicPress, Inc.


METHODS IN ENZYMOLOGY, VOL. 266 All rights of reproductionin any form reserved.
[81 SRS (SEQUENCERETRIEVAL SYSTEM) 115

sources are completely isolated, as each assimilates or references informa-


tion in other data banks.
It is not surprising that in this complex system of coexisting data banks
built and maintained with many different philosophies, the linguafranca
is still the text file in a fiat file format where entries follow in sequential
order. The contents of the file are read directly by humans or by an innumer-
able number of parsing routines that make the data available for further
analysis or for incorporation into more advanced database management
systems. We present here a retrieval system called SRS (Sequence Retrieval
System 3,4) that acts on data banks in a fiat file or text format. It provides
a homogeneous interface to about 80 biological databanks for accessing
and querying their contents and for navigating among them.

Indexing and Parsing


A fast query system usually relies on indices created before query time.
SRS has its own indexing system and treats entire data banks as sequences
of entries, each with different data fields. The contents of the data fields
are parsed, and selected words are extracted and inserted into an index.
T h e r e is usually a separate index created for each data field.
In data banks such as the E M B L ( E u r o p e a n Molecular Biology Labora-
tories) collection of nucleotide sequences, 5 this separation is important even
if some information in different data fields appears redundant at first glance.
For instance, the keywords, the definition, the reference title, and feature
table data fields in a given sequence entry may contain the word "kinase."
The least consistent use of "kinase" is in the reference title field, where it
must be viewed within its context and might only indirectly describe the
sequence in the entry as, for example, in " . . . is phosphorylated by casein
kinase II . . . . " T h e intent of the definition field is to render a short descrip-
tion of the entry in free text format. However, no discernible convention
need be applied; for instance, the same protein or enzyme can be described
by m o r e than one n a m e or even the same n a m e spelled differentily in
different entries. In the case of entries containing a complete genome, the
definition field m a y not even mention the protein names associated with
the genes therein. The keywords data field relies on a controlled dictionary
which classifies, rather than describes, the entry. Entries of related se-
quences should be attributed with the same, or at least largely overlapping,

3 T. Etzold and P. Argos, Cornput. Appl. BioscL 9, 49 (1993).


4 T. Etzold and P. Argos, Comput. AppL Biosci. 9, 59 (1993).
s C. M. Rice, R. Fuchs, D. G. Higgins, P. J. Stoehr, and G. N. Cameron, Nucleic Acids Res.
21, 2967 (1993).
116 DATABASES AND RESOURCES [8]

list of keywords. The feature table, as described later, defines mostly subse-
quences with particular characteristics (exons or introns, repeats, do-
mains, etc.).
As the information requests gain specificity, more discrimination is
required in the choice of the index and the search word applied in a query.
A search for appropriate entries in biological data banks is often a quest
for the right search word(s) which can be obtained by testing different
queries with successive refinement. A good starting point for such a proce-
dure is request to the AllText index of SRS, which is not related to a real
data field but rather represents all indices of all data fields containing textual
information such as those previously mentioned.
SRS allows the indexing of any word in an entry. An index search
usually requires only a fraction of a second but takes longer if the query
contains regular expressions or wildcards at the beginning of the search
word. Numbers such as the date, sequence length, resolution, or molecular
weight can also be indexed in special structures that allow the retrieval of
numeric ranges; for instance, all sequences with a length between 200 and
400 or all protein structures with a resolution greater than 2 ~.
It is evident that SRS requires much information to parse and extract
data fields and indexable items from data banks that often have their own
unique format and syntax. Two languages have been designed to describe
to the system a data bank whose information is usually deposited in a single
file. The O D D (Object Design and Definition) language is used to detail
the physical structure (data fields, files), whereas the other, an extended
Backus-Naur notation, describes the syntax of the data fields. This results
in a highly flexible environment where data bank representations can be
easily added or changed.
In a system with many data banks, many indices (one per data bank
per data field) have to be created and updated. The link indices (see below)
add substantially to the complexity of the maintenance since links manifest
interdependences between individual data banks. A program has been
written that analyzes the state of the overall system and writes a command
script in case indices need to be rebuilt. The index building is completely
automated, and the program runs automatically at frequent intervals to
check if a new version of a data bank exists. The storage size for the indices
is relatively small, totaling about 10 to 20% of the size of the actual indexed
data banks.

Sequence Retrieval System Query Language


SRS has its own query language to represent index searches, the com-
bining of sets with Boolean operators, and links between data hanks.
[8] SRS (SEOUENCEREVRIEVALSYSTEM) 117

Most interfaces to SRS (see below) shield the user completely or at least
to some extent from the query language expressions, often through a
query form.
The query language is set oriented, where a set is a list of entries from
one or more data banks. The operators require sets as operands and produce
sets of entries as the result. A set can be an entire data bank as denoted
by the data bank name or a portion of it resulting from a query or a retrieval
command. A retrieval command must specify the data bank name, the
name of the index to be searched, and the query string or a numeric range.
For example, the query
[embl-organism : human]
retrieves all entries in EMBL with the word "human" in the organism data
field. The query
[embl-seqlength#400: 500]
retrieves all sequences from the EMBL nucleotide databank with length
between 400 and 500 base pairs. The two queries can be combined with
the logical operator AND, denoted "&," resulting in a list of overlapping
entries obtained from the separate commands:
[embl-organism : human]&[embl-seqlength#400: 500]
SRS provides the three logical operators AND, OR, and BU T N O T
(combination of binary AND and unary NOT) as well as two further link
operators described below. Many retrieval criteria can be specified and
combined by logical operators in a single expression. Only indexed informa-
tion can be used for specifying constraints, and only entire data banks can
be searched.

Retrieval of Subentries
Retrieval systems usually consider the entire entry as the only retriev-
able unit, a useful simplification that, however, completely ignores its inter-
nal structure. This approach may be acceptable for data banks such as
Prosite, 6 where each entry describes one amino acid sequence pattern, but
not for nucleotide sequence data banks such as GenBank 7 or EMBL where
an entry may contain a single exon but also a complete genome or chromo-
some with several hundred genes. Genes, or rather their parts such as the
promoter, coding sequence, introns, exons, and the like, are characterized

6 A. Bairoch and P. Bucher, Nucleic Acids Res. 22, 3583 (1994).


7 D. A. Benson, M. S. Boguski, D. J. Lipman, and J. Ostell, Nucleic Acids Res. 22~ 3441 (1994).
118 DATABASES AND RESOURCES 18]

in the feature table data field, which increasingly forms the principal portion
of the annotation in EMBL and GenBank. Feature information can also
be found in the protein sequence data banks SWISS-PROT 8 and PIR
(Protein Identification Resource), 9 which document subentries such as
transmembrane segments, calcium binding sites, and disulfide bridges. The
feature table introduces a new structural element into the entry, namely,
a list of subentries. A subentry can be viewed independent of its parent
entry. For instance, the coding sequences (CDS features) from a DNA
sequence entry can be extracted from the parent entry by using the sequence
location information and subsequently translated and aligned with others
as complete entries describing cDNA sequences.
SRS allows the indexing of the feature table such that the whole entry
or the subentry (sequence feature) can be retrieved after an appropriate
query. Each word in the feature description can be indexed and thereby
associated with the feature itself and not just the parent entry. Figure 1
shows four subentries resulting from the query in SWISS-PROT for the
transmembrane regions of the acetylcholine receptor sequence entry
ACHA_BOVIN. The resulting entries contain the ID line of the parent
entry, the relevant part of the feature table, and the subsequence extracted
from the parent sequence as defined by the terminal positions in the fea-
ture annotation.
It is also possible to redefine the begin and end positions so that a
fragment with a "safety" region around them will be provided. This is
useful when extracting, for instance, transmembrane segments which are
mostly putative with much uncertainty as to the termini. Another example
is the retrieval o~ intron/exon boundaries.

Linking Data Banks


Most data banks provide cross-references to others. One of the best
examples is SWISS-PROT, which maintains links to more than 20 other
data banks. A natural usage of these cross-references, as is popular with
many browsing systems in the World Wide Web (WWW), results in dis-
played entries with cross-references marked as hypertext links where the
user need only click on the highlighted text to view the associated informa-
tion or entry. Figure 2 shows a SWlSS-PROT entry as displayed by the
SRSWWW server (see below). A hypertext link can access any WWW
server in the world where the related data are contained, resulting in a

8 A. Bairoch and B. Boeckmann, Nucleic Acids Res. 22, 3578 (1994).


9 D. G. George, W. C. Barker, H.-W. Mewes, F. Pfeiffer, and A. Tsugita, Nucleic Acids Res.
22, 3569 (1994).
[8] SRS (SEQUENCERETRIEVALSYSTEM) 119
ID ACHA BOVIN STANDARD; PRT; 457 AA.
DE A C E T Y L C H O L I N E R E C E P T O R PROTEIN, A L P H A C H A I N P R E C U R S O R .
FT TRANSMEM 231 255
>PI;ACHA BOVIN
Length: 25
PLYFIVNVII PCLLFSFLTG LVFYL*

ID ACHA BOVIN STANDARD; PRT; 457 AA.


DE A C E T Y L C H O L I N E R E C E P T O R PROTEIN, A L P H A C H A I N P R E C U R S O R .
FT TRANSMEM 263 281
>PI;ACHA BOVIN
Length: 19
MTLSISVLLS LTVFLLVIV*

ID ACHA BOVIN STANDARD; PRT; 457 AA.


DE A C E T Y L C H O L I N E R E C E P T O R PROTEIN, A L P H A C H A I N PRECURSOR.
FT TRANSMEM 297 316
>PI;ACHA BOVIN
Length: 20
YMLFTMVFVI ASIIITVIVI*

ID ACHA BOVIN STANDARD; PRT; 457 AA.


DE A C E T Y L C H O L I N E R E C E P T O R PROTEIN, A L P H A C H A I N P R E C U R S O R .
FT TRANSMEM 429 447
>P 1 ;A C H A B O V I N
Length: 19
ILLAVFMLVC IIGTLAVFA*

FIG. 1, Transmembrane regions of the SWISS-PROT entry A C H A _ B O V I N extracted by


SRS. Each of the four subentries includes the ID and definition (DE) lines of the parent
entry and the relevant part of the feature table (FF) together with the subsequence it describes
(see also the parent entry in Fig. 2).

truly distributed system with different and yet intertwined services. Indeed,
this has been so successful that most people think of navigation between
data banks in terms of hyperte×t links. Though they are easy to comprehend
and very powerful in making accessible related information, they are not
without limitations.
First, hypertext links are unidirectional unless the linked entries provide
a cross-reference back to the referencing entry. For instance, EMBL and
SWISS-PROT are mutually referenced, but only SWISS-PROT points to
PDB (Protein Data Bank, containing atom coordinates and associated
information for known protein three-dimensional structures), which has no
cross-references of any kind. There are many cases where the referencing
is not mutual, or is undesirable if, for instance, private or temporary data
banks are involved.
Second, it is always possible, but may be very inconvenient, to collect
all entries of one type referenced by a single entry. A PROSITE entry,
120 DATABASES AND RESOURCES [8]

ID ACHA BOVIN STANDARD; PRT; 457 AA.


AC P02709;
DT 2 1 - J U L - 1 9 8 6 (REL. 01, CREATED)
DT 2 1 - J U L - 1 9 8 6 (REL. 01, LAST S E Q U E N C E UPDATE)
DT 0 1 - F E B - 1 9 9 4 (REL. 28, LAST A N N O T A T I O N UPDATE)
DE A C E T Y L C H O L I N E R E C E P T O R PROTEIN, A L P H A C H A I N P R E C U R S O R .
OS BOS T A U R U S (BOVINE).
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC EUTHERIA; ARTIODACTYLA.
RN [I]
RP S E Q U E N C E F R O M N.A.
RM 84039794
RA N O D A M., F U R U T A N I Y., T A K A H A S E I H., T O Y O S A T O M., T A N A B E T.,
RA S H I M I Z U S., K I K Y O T A N I S., K A Y A N O T., H I R O S E T., I N A Y A M A S., N U M A S.;
RL NATURE 305:818-823(1983).
CC - ! - F U N C T I O N : A F T E R B I N D I N G A C E T Y L C H O L I N E , THE A C H R R E S P O N D S BY AN
CC E X T E N S I V E C H A N G E IN C O N F O R M A T I O N T H A T A F F E C T S A L L S U B U N I T S A N D
CC L E A D S TO O P E N I N G OF AN I O N - C O N D U C T I N G C H A N N E L A C R O S S THE P L A S M A
CC MEMBRANE.
CC - [ - SUBUNIT: P E N T A M E R OF T W O A L P H A CHAINS, A N D ONE E A C H OF THE BETA,
CC DELTA, A N D G A M M A CHAINS.
CC -[- SUBCELLULAR LOCATION: INTEGRAL MEMBRANE PROTEIN.
CC - ! - S I M I L A R I T Y : B E L O N G S TO THE L I G A N D - G A T E D I O N I C C H A N N E L S FAMILY.
DR EMBL; X 0 2 5 0 9 ; B T A C B R A I .
DR PIR; A 0 3 1 6 9 ; A C B O A I .
DR P R O S I T E ; PS00236; N E U R O T R ION CHANNEL.
KW R E C E P T O R ; P O S T S Y N A P T I C M E M B R A N E ; I O N I C C H A N N E L ; G L Y C O P R O T E I N ; SIGNAL;
KW TRANSMEMBRANE.
FT SXGNAL 1 20
FT CHAXN 21 457 A C E T Y L C H O L I N E R E C E P T O R PROTEIN, ALPHA.
FT DOMAXN 21 230 EXTRACELLULAR.
FT TRANSIG~M 231 255
FT TR~NSMEM 263 281
FT TRANSfOrM 297 316
FT DOMAIN 317 428 CYTOPLASMIC.
FT TRANSMEM 429 447
FT DISULFID 148 162
FT DISULFXD 212 213 ASSOCIATED WITH RECEPTOR ACTIVATION.
FT CARBOHYD 161 161 PROBABLE.
SQ SEQUENCE 457 AA; 51947 MW; 1 2 0 1 3 7 3 CN;

FiG. 2. SWISS-PROT entry A C H A B O V I N as displayed by SRSWWW. The hypertext


links, underlined and in boldface type, refer to related entries in other data banks. Note that
each item in the feature table contains a hypertext link which leads to the respective subse-
quence.

which provides a subsequence pattern or motif shared by a protein family,


also lists all members of that family, of which there can be hundreds. To
obtain all related members of such a family, the user needs to click as many
times as there are members and save each accessed entry as a single file.
A growing number of data banks such as PROSITE, ProDom, 1° and HSSP, 11
provide references to sequence groups related by family membership or
homology, and often the inconvenience of hypertext links prohibits their
full exploitation.

10 E. L. L. Sonnhammer and D. Kahn, Protein Sci. 3, 482 (1994).


11 C. Sander and R. Schneider, Nucleic Acids Res. 22, 3597 (1994).
[81 SRS (SEQUENCERETRIEVALSYSTEM) 121

Third, linking entire sets of single entries or even entire data banks is
impractical, as each single entry in the set must be accessed, and then the
hypertext link must be located and the linked entries, if any, accessed and
saved. In effect, hypertext links are too cumbersome to ask straightforward
questions such as which member of a set of lysozymes has a known tertiary
structure or which entries of the current set belong to the family of amino-
transferases.
Finally, often a desired link between two data banks does not exist; for
instance, it is impossible in the absence of cross-reference information to
navigate directly from an entry in SWlSS-PROT to an associated entry in
EPD (Eukaryotic Promoter Database 12) which describes the promoter of
the gene coding for the protein whose amino acid sequence is given in the
SWlSS-PROT entry. However, the link can be found by first accessing a
related entry in EMBL which provides links to EPD. In general, there
are many cases where data banks are not directly but indirectly linked.
Exploitation of this requires of the user intimate and extensive knowledge
regarding the intertwining of all data banks, and in the process of repeated
linking, it is likely the user will get lost in the Web.
The SRS linking algorithm allows the user to overcome facilely these
barriers. The cross-references are processed before query time such that
the available data banks are scanned for cross-references and the results
are inserted into special link indices. For each link between two data banks
an index is built. In case of mutual referencing, it is sufficient to read the
references from only one source, alleviating the necessity to scan all data
banks when building a network with many data banks. Once indexed, a
link can be used bidirectionally. All the links must be defined in the ODD
language (see above) and are therefore known to the system. This link
information is used to build a graph structure in memory, where the data
banks are nodes and the links the edges. In this system a link is defined as
the shortest or optimal path that can be found between any desired pair
of nodes or data banks assuming bidirectional edges. Figure 3 shows the
situation for the data banks installed at EMBL. To add a new data bank
to the network, it is sufficient to add a single link to any of its nodes.
Though the usefulness of the link operation depends on the quality of the
cross-references in the respective data banks, it is also valuable in locating
wrong or missing cross-reference information.
The SRS query language has, apart from the logical operators previously
described, two further and unique operators: link left and link right. Figure
4 illustrates their functionality where the two data banks A and B are
linked through explicit cross-references in the entries of A used to create

12p. Bucher and E. N. Trifonov, Nucleic Acids Res. 14, 10009 (1986).
122 DATABASES AND RESOURCES [8]

\ /

FIG. 3. Network of data banks installed at the EMBL in Heidelberg, Germany. This
diagram can also be called from the SRSWWW server (http://www.embl-heidelberg.de/srs/
srsc?-np), and each box can be clicked to obtain further information about the respective
data bank. Note that LIMB [Listing of Molecular Biology databases, G. Keen, G. Redgrave,
J. Lawton, M. Cinkosky, S. Mishra, J. Fiekett, and C. Burks, Math. Comput. Modelling 16,
93 (1992)] is the only databank without a link since it provides information about all the
others.

A B A>B " - ~

H A<B -"~B
F~G. 4. Function of the SRS link operators exemplified on the two data banks A and B
with respective numbered entries. See text for further explanations.
[8] SRS (SEQUENCERETRIEVALSYSTEM) 123

a link index. Operation (1) A > B returns those entries in B that are cross-
referenced by entries in A. Operation (2) A < B returns entries in A that
cross-reference entries in B.
Since all link indices can be used in both directions, it is of no conse-
quence whether entries in A reference B or entries in B reference those
in A. It is therefore valid to state that entries in A are linked to entries in
B, either by references in A or B. The results of the above expressions can
now be redefined as the evaluation of (1) returns entries in B linked to A
and (2) returns entries in A linked to B. The link operator can be viewed
as an arrow pointing to the set [link right (1) and link left (2)] from which
the result will be taken. Reversing the order of operands has the same
effect; hence, A > B and B < A yield the same result, which is entries in
B linked to A. The operands of the link operators can be a data bank name
or a set returned by a retrieval command or any other SRS query language
expression (see above).
Links can also be specified between data banks that are not direct
neighbors in the link diagram as shown in Fig. 3. Consider the example
pdb > embl which retrieves, for all proteins with solved tertiary structure,
the list of D N A sequences that encode their protein sequences. Algorithmic
examination of the data bank network reveals that the shortest path be-
tween PDB and E M B L is to link PDB first to SWISS-PROT and then to
E M B L , which can then be automatically carried out. It may occur that two
alternative paths exist. The data bank administrator should give priority
to one path by assigning a lower cost value to the link object as specified
in the O D D language.
It is also possible to give a path explicitly by specifying a succession of
links. The expression pdb > hssp > swissprot forces the system not to take
the shortest route from PDB to SWISS-PROT but a deviation via HSSP.
The result is a set of all SWISS-PROT protein sequences with solved tertiary
structure and, in addition, all protein sequences in SWISS-PROT that are
homologous to them. The expression takes advantage of the fact that HSSP
entries list, for every entry in PDB, the SWISS-PROT entries that are
similar above a given percentage residue identity. The link via HSSP can
be understood as an amplification of a set of entries to all their related
counterparts as defined by the data bank chosen. In lieu of HSSP, other
data banks such as Prosite and Prodom, which collect and multiply align
related protein sequences of subsequences, could have been chosen.
It has been discussed previously that subentries can be retrieved by
searching in the feature table index. The result of such a query can be
converted to a set of parent entries by a link to a predefined entity "parent."
The result of the conversion can be used to link to other data banks. The
124 DATABASES AND RESOURCES [8]

following finds all PDB entries with proteins that have calcium binding
sites as annotated in SWlSS-PROT:
[swissprot-features:ca__bind] > parent > pdb
The link in the reverse direction gives all calcium binding sites with solved
tertiary structure:
pdb > swissprot > [swissprot-features:ca_bind]
Note that the "parent" keyword is not needed here since the link from
PDB to SWISS-PROT already results in a list of SWlSS-PROT entries
needed to link to the appropriate SWISS-PROT subentries or features.

Interfaces to Sequence Retrieval System


Sequence Retrieval System is written in the programming language
ANSI C and runs on almost all UNIX platforms, as well as VMS, IBM-
compatible, and Macintosh computers. Several interfaces have been written
for SRS including graphical, character terminal-based, and World Wide
Web (WWW) interfaces. Discussed subsequently are the availability of
SRS, the application programmers interface (API), the command line inter-
face, and the WWW server.

Availability
The SRS program and its interfaces are provided together with the SRS
distribution file, which can be obtained via anonymous ftp from the server
felix.embl-heidelberg.de in the directory pub/software/unix/srs. The current
release of SRS is 4.08, with new releases made available every few months.
A news group, bionet.software.srs, provides a forum for discussion and
informs the community of bug fixes, new features, and releases. SRS is
also evolving into a collaborative effort, most of the interfaces have been
developed by other groups, and many of the data bank format descriptions
have been written by individuals maintaining SRS services at their sites.
A large part of the SRS functionality will be accessible through the
program "lookup" of the GCG program package for sequence analysis. 13
It will be distributed starting with release 8.1 of GCG.

SRS Programming Interface


The core of SRS consists of a library of functions to perform queries
and extract and parse entries from flat file data banks in their original

13 j. Devereux, P. Haeberli, and O. Smithies, Nucleic Acids Res. 12, 387 (1984).
[81 SRS (SEQUENCERETRIEVALSYSTEM) 125

format. Entries can be processed in sequential order or by random access.


Full entries or individual data fields can be extracted from about 80 different
flat file data banks. A number of functions inform the user of the status and
structure of the indexed data banks and the data bank network. Emphasis is
on the extraction and output of sequences and subsequences in various
formats. The function library is documented and is particularly suited for
extending interpreted languages such as Perl, Tcl, and Python. SRS relies
on a number of powerful tools such as the parser and the indexer, which
can be directly used through the API.

Command Line Interface


The command line interface, the program getz, is a U N I X style com-
mand which gives access to all the retrieval power of SRS. Although graphi-
cal interfaces are more intuitive to use, getz provides a much more direct
mode of operation and can be called from other programs or, through
piping, can be combined with other UNIX commands. Also of interest is
an enhanced version, hgetz, that accesses local as well as remote SRS
servers through the H A S S L E protocol. TM

World Wide Web Server


The most popular access to SRS is through its World Wide Web server.
The server combines the power of SRS links with the ease of WWW
hypertext navigation. In contrast to most other W W W services, it maintains
state, that is, previous user actions and selections are r e m e m b e r e d so that
all queries in the session can be reinspected or combined by SRS query
language expressions.
Figure 5 illustrates the structure of the server. The top page allows the
selection of the data banks to be queried and assigns a user ID on first
entry. The next page presents a query form and options for the display of
single entries or the entry list. A successful query leads to the entry list
page, which gives access to the single entries and options to link the entire
result set or single entries to other data banks, which must be selected on
a separate page. Cross-references in single entries are, whenever possible,
converted to hypertext links, which lead to entries served by the local or
a remote S R S W W W installation or other servers such as the one from
G D B ( G e n o m e Data Base). 15 The query manager page lists all successful
queries conducted during the session and lets the user combine them by
SRS query language expressions. Auxiliary pages, one for each data bank,

14 R. Doelz, Comput. Appl. Biosci. 10, 31 (1994).


a5 K. H. Fasman, A. J. Cuticchia, and D. T. Kingsbury, Nucleic Acids Res. 22, 3462 (1994).
126 DATABASES AND RESOURCES [8]

hflp://wwv¢,embl-heldelberg,de/srs/srsc

Top
Page
Databank
Information
ff ~/,~ Select
Link

Query ak.)
2
Manager hyperfext/iokO
FIG. 5. Structure of the WWW server. Only the main pages are illustrated; arrows indicate
entry paths. Top Page is the entry point of SRSWWW and can be obtained with the specified
URL (Uniform Resource Locator). See text for further details.

inform the user about the indexed data banks, often with links to the
distributor, and also allow the user to explore the data bank network. All
pages without exception are built during run time so that changes in the
system configuration, such as adding a new data bank, are immediately
reflected by the interface.
The S R S W W W server is currently installed at many sites throughout
the world. The number of data banks installed on a single node varies but
can reach as many as 50. A special program polls all these servers once a
day and retrieves the name, the number of entries, the release number, and
[8] SRS (SEQUENCERETRIEVALSYSTEM) 127

the indexing date for each data bank installed at the site. The information is
then compiled in a report which lists all data banks served by all nodes.
This list includes 13 sites with a total of 80 data banks. Figure 6 shows the
entry for the SWISS-PROT nucleotide data bank, which also includes links
to the data bank information page and relates status information for all
nodes serving that data bank. This information is useful both to the user
who can select the closest server to avoid slow internet connections or the
one with the most recent release of a given data bank and to the node
administrators who can verify the currency of their system.

Discussion
Sequence Retrieval System is an integrated system that provides a
homogeneous interface to all flat file data banks retained in their original
format. It is a retrieval system that allows access to, but not the depositing
of, data. Several elements are combined into a system that extends the
power of normal retrieval systems and which rivals that of real databases,
such as a relational system, without compromising speed. These elements
include languages for data bank and syntax definition, a programmable
parser, an indexing system, support for subentries, a novel system for ex-
ploiting links between data banks, and a query language. The database
linking is a unique feature that considerably extends the capability of hyper-
text links.

SWISSPROT
ABC, H u n a a r v 40292 21-Feb-1995 ?.
BEN, B r u s s e l s 43470 07-Jun-1995 ?.
BiO, O s l o 31.0 43470 28-Jun-1995
Biozentrum, Basel 43470 22-Mar-1995
CAOS/CAMM, Nigmeaen 31.0 43470 31-May-1995 ?.
CSC, F i n l a n d 31.0 43470 08-May-1995 ?.
EBI, H i n x t o n , U K 43470 18-Jun-1995 ?.
EMBL, H e i d e l b e r a 43470 22-Mar-1995 ?.
IUBio Archive Indiana 31.0 43470 15-Apr-1995 ?.
INSERM. F r a n c e 43470 02-Jun-1995 ?.
Sanaer. H i n x t o n . U K 31.0 43470 05-Jun-1995
S k i r b a l l Inst,, N Y 43470 13-Jun-1995
Weizmann. Israel 43470 Ol-Jun-1995

FIG. 6. Status of the data bank SWISS-PROT as installed on 13 SRSWWW servers. The
entry has been taken from the status page (http://www.embl-heidelberg.de/srs/status.html)
describing all data banks served by all 13 SRSWWW nodes. From left to right are listed the
following: a hypertext link to the server; if specified, the release number of the installed
version of SWISS-PROT; the number of entries; the indexing date; and the hypertext link that
leads to a page with further information about SWISS-PROT as installed at the respective site.
128 DATABASES AND RESOURCES [8]

SRS nonetheless has its limitations. T o benefit from the linking capabili-
ties, all data banks in the network must be installed at one site. This is in
contrast to World Wide Web hypertext links, which very naturally allow
navigation from one site to another. All information needs to be indexed
before it can be accessed such that addition of new entries requires rein-
dexing of the entire data bank. Work is in progress to overcome these
limitations. It is also intended to enhance SRS from a server of textual
information to an object server so that the retrieved data can be conve-
niently submitted to analysis programs such as sequence analysis tools.

Acknowledgment
The authors are grateful for financial support from the European Union (Grant Gene-
CT-93-0043) under the Biomed I program.

You might also like