Download as pdf or txt
Download as pdf or txt
You are on page 1of 223

FILE STRUCTURES

FOR ON-LINE SYSTEMS

DAVID LEFKOVITZ, Ph.D


Associate Professor
Moore School of Engineering
University of Pennsylvania

Staff Consultant
Computer Command and Control Company
Philadelphia

~ Spartan Books
Macmillan Education
Dedicated to the memory of my father
Joseph Lefkovitz

ISBN 978-1-349-00695-3 ISBN 978-1-349-00693-9 (eBook)


DOI 10.1007/978-1-349-00693-9
Library of Congress Catalog Card Number 68-26073

Hayden Book Company, Inc.


50 Essex Street, Rochelle Park, New Jersey 07662
Copyright © 1969 by Computer Command and Control Company. All rights
reserved. No part of this book may be reprinted, or reproduced, or utilized
in any form or by any electronic, mechanical, or other means, now known
or hereafter invented, including photocopying and recording, or in any
information storage and retrieval system, without permission in writing from
the copyright holder and the publisher.
Softcover reprint of the hardcover 1st edition 1969 978-0-333-10113-1
Spartan Books are distributed throughout the world by Hayden Book
Company, Inc., and its agents.

3 4 5 6 7 8 9 PRINTING

75 76 77 78 YEAR
PREFACE

In June, 1967, The Professional Development Seminars of the


Association for Computing Machinery offered a one-day course on
File Structures for On-Line Systems to its membership, and to the
data processing community at large. These seminars were given
twenty-four times throughout the United States during 1967 and 1968,
and it was the continued and widespread interest in this subject that
influenced the author to produce this book. The subject is approached
here in much the same way as presented in the lectures, although a
number of topics are treated more comprehensively than was possible
in the lectures, a more complete treatment of memory requirement
and response time formulations is presented, and the initial chapters
provide a perspective from which one can relate over-all system
functional requirements to file structure design criteria.
The structuring of files for on-line systems is pertinent to two broad
categories of use: on-line programming systems and on-line informa-
tion storage and retrieval systems. The demands of the latter are
usually more stringent because of the significantly greater number of
keyed entries to the file, the requirement for multiple key access (i.e.,
logical key combinations) to the file, and a considerably higher vol-
ume of data. The response time requirements of the programming
systems may be more critical than those of information systems, but
the simplicity of access modes tends to offset file structuring problems
caused by this requirement. An example of an on-line programming
system file structure problem is the construction of an output file.
Since the quantity of output generated by a given program in some
time interval cannot be known in advance (except possibly within a
system or programmer imposed upper limit), storage allocation on a
direct access storage device (DASD) for the output data can most
expeditiously be made via a chained list structure. A file with a name
(or key) that can be related to a program and a terminal device is
opened on the DASD with a relatively small quantum of allocated
space, such as a seventy-two character line of type or some
small number of lines. Should the program generated data ex-
ceed this buffer capacity, another such buffer could be linked by
a chained address, since the contiguous space on the disk may
have been assigned, in the interim, to another program. This list
structure is what was first described in the literature as a threaded
list.!8) It is easy to construct, conserving of memory space, and
fast to retrieve because every record on the list (i.e., every link
of the chain) is a desired record for subsequent processing; however,
as soon as multiple key entry to a file is allowed, the situation becomes
somewhat more complex, because every record on the list is not
necessarily required, and the problem of list retrieval efficiency must
be faced. To extend this same example, if it is assumed that an addi-
tional key denoting output device were attached to each output buffer
(in which case these records assume a logical significance since the
entire file of records is no longer a continuous data stream), and if
the situation were to arise wherein it is desired to access "the next out-
put record for Program X to be printed on the line printer," then a
number of alternatives are possible for the organization of this file,
including, (1) threading the records on a Program tag and qualifying
it after accession by a Terminal tag, (2) threading the records on a
Terminal tag and qualifying by a Program tag, or (3) threading the
records on both tags and accessing records from the shorter list. The
latter file structure is often referred to as a multiple threaded list or
Multilist.[lO] This example illustrates the difference between the
single and multiple key record, but it is more typical of generalized
information storage and retrieval system file structuring requirements
than of on-line programming requirements. The generation and
maintenance of input, intermediate, output, and program files for
on-line programming systems generally involves at most the simple
threaded list and can frequently make use of a serial structure. The file
structure of information systems, on the other hand, is often multiple
key, which imposes design considerations related to speed of initial,
successive, and total system response, speed of update, quality of
presearch statistics, ease of programming, and list structure overhead
in time and space.
This book, therefore, uses the information system as a frame of
reference for discussion of on-line file structures, largely because file
management techniques for programming systems can easily be drawn
from those developed for the former. The first chapter of the book
relates file structures functionally to the requirements of automated
information systems. In the second chapter direct access storages are
divided into three classes, and concepts relating pertinent hardware
features of these devices to software design are introduced. The third
chapter distinguishes the concepts of information structure and file
structure, where the former is defined to be an inherent property of
information as it may be generated and organized by the people who
utilize the system, while the latter is the means by which the automa-
tion designer organizes these data into various files for storage in the
random access memories and retrieval and update by the computer
programs. Two basic types of information structure, associative and
hierarchic, are analyzed in terms of alternative file structures. This
chapter views the file design process from inside out; that is, the con-
struction of record and subrecord controls for the file structure are
examined, which satisfy the information structure requirements. In
Chapter IVan outside in view of the file structure is presented by
means of a prototype query language that might appear as the inter-
face between the users and the file system.
In Chapter V, all of the techniques to be described in the remainder
of the book are classified, and their general characteristics and prin-
cipal design applications are discussed. Also presented in this chapter
is the notion of the two-stage retrieval process: Directory decoding
and file search. In Chapter VI, various techniques for the design and
construction of Directories and their decoders are presented along
with storage requirement and timing formulations; in Chapter VII,
methods of structuring or organizing the files containing the search-
able records are presented along with storage and timing formulations.
The final chapter of the book deals with the on-line updating and
maintenance of the various file organizations discussed in Chapter VII.
The primary objective of the book is to present the concepts of file
design as well as sufficiently detailed descriptions of techniques that
have been used to implement these designs. In presenting concepts, the
areas of maneuver or design tradeoff are emphasized, particularly
with regard to system response, cost, and programming complexity.
The description of techniques is at a level that a somewhat experienced
programmer should have little difficulty in implementing, and, in
addition, the basic elements of design enable either the programmer
or the analyst to modify any of the specific techniques to suit a given
problem.
The author is indebted to a number of persons who have partici-
pated in various ways to the accumulation and evaluation of material
contained in this book. Firstly, the author acknowledges the contri-
butions of Dr. Noah S. Prywes who supervised much of the govern-
ment sponsored research and development work under which many of
these file structuring techniques were developed and tested. The
author's colecturer in the ACM seminars, Thomas Angell of Computer
Command and Control Company, has contributed through discussion,
evaluation, and improvement of presentation of the material. James
Adams, ACM Professional Development Seminar coordinator was
primarily instrumental in the organization and promotion of the lec-
tures and, in addition, contributed substantially to their improvement
through evaluation of the attendee critique sheets. Others who have
made contributions through discussion, comment, and programming
design and application are James Russell of Computer Command
and Control Company, and Ruth Powers of the University of
Pennsylvania.
The author also wishes to thank Mrs. Esther Cramer and Mrs. Lois
Porten for their excellent typing of the several drafts, and Miss Mary
Jane Potter for drafting all of the book's illustrations.
CONTENTS

I. The Information System 1


1. The Information System Model

II. Direct Access Storage Devices 27

III. Information Structure and File Organization 37


1. Functional Requirements on File Organization 37
2. Information Structure and File Organization 43

IV. The Query Language 60

V. Classification of File Organization Techniques 82

VI. Techniques of Directory Decoding 92


1. The Tree with Fixed Length Key-word Truncation 93
2. The Tree with Variable Length Key-word Unique
Truncation 98
3. The Complete Variable Length Key-word Tree 104
4. The Randomizing Method 106
5. Formulation of Decoder Memory Requirements
and Access Time 111
5.1 Tree Decoder Formulations 111
5.2 Randomizer Formulations 116
5.3 Memory Requirement Comparisons 118
6. Decoding Speed 122
VII. Techniques of Search File Organization 126
1. The Multilist File Organization 127
2. The Inverted List File Organization 129
3. The Controlled List Length Multilist 132
4. Cellular Partitions 136
5. Automatic Classification 143
6. aU-Line Generation of List Structured Files 143
7. File Access Timing 150

VIII. On-Line File Update and Maintenance 155


1. On-Line Update of a Multilist File 157
2. On-Line Update of an Inverted List Structure 165
3. On-Line Update of Cellular Partitions 169
4. Space Maintenance 169
5. Update Timing 172
6. Summary of File Structuring Techniques 177
7. Conclusions 180

Appendix A The Information System Implementation


Process 181

Appendix B Automatic Classification and Its


Application to the Retrieval Process 186
1. Retrieval by a Conjunction of Keys 189
2. Construction of the Classification Tree, Tc 191
2.1 Construction of an Intermediate Tree, TI 191
2.2 General Description of the TI Construction
Process 193
2.3 Construction of Tc from TI 194
2.4 Formal Description of the TI Construction
Algorithm 194
3. An Illustrative Example of TI and Tc
Construction 196

Appendix C 202

Appendix D Discussion Topics and Problems for


Solution 209

BiblIography 211

Index 213
CHAPTER I

THE INFORMATION SYSTEM

1. The Information System Model

AN INFORMATION SYSTEM is a very complex and sophisticated com-


munication system, but, unlike a telephone system, its major complex-
ity lies not in the switching of lines among various users, but rather
in the provision of file structures as a central medium of communica-
tion through which relatively complicated data processing takes place.
Another difference is that in a telephone system, the several users
approach and use the system in essentially the same mode of operation.
That is, they are usually transmitters as well as receptors of informa-
tion. Furthermore, processing of the information that they transmit
or receive is usually considered to be a detractive or undesirable prop-
erty of the communication system, if it in any way modifies the infor-
mation per se, its meaning or content. The information system, on
the other hand, generally has two different classes of users, each of
which approaches the system in a different way and for a different
purpose, although certain individuals may belong to both classes.
Figure 1 presents the model of an information system, and these two
classes of people are there labeled Generator and User. The medium
of communication, as indicated within the ovals in the center of the
diagram are the files. The Generators of information transmit to the
files via depositions of various types that may generically be called
Documents. These may commonly include monographs, serials, re-
ports, letters, memoranda, messages, ledgers, and transactions, or they
may include graphic images and voice, spectral or TV recordings.
The documents are processed in two ways. First there is a-straight-

I
2 FILE STRUCTURES FOR ON-LINE SYSTEMS

forward physical inventory of the documents into a file. The means of


deposition into this file may vary from the untransformed original copy
to microfilm reduction, through to some form of encoding such as
might take place in the digital storage of a magnetic or optical medium.
The second line of document processing is through Data Reduction,
wherein some manner of intelligence is applied that will reduce or
extract meaningful information from the document in order to make
it more accessible through an auxiliary or Reference File to the User.
The Reference File is generally much smaller than the Document File
since it is basically an index to the latter, although it may contain
synopses or abstracts or even significant data extracts, in which case
it might serve as a surrogate for the Document File. The User of the
system will generally approach the Reference File first in order to
define a subclass of the Document File and then, if required, to access
complete documents from the latter.

Documents Documents

Dota
Documents
Reduction

Documents Documents
Dota Dota

User Interface

Dota

Fig. 1. Model of an Information System


THE INFORMATION SYSTEM 3

The two most commonly applied methods of Data Reduction are


indexing and abstracting, and it is the interposition of these processes
between the Generator and User that constitutes the essential differ-
ence between the telephone and information systems within the class
of communication systems.
Indexing is the process of assigning descriptors or terms or tags,
as they are variously called, to a document. These terms telegraph-
ically describe the essential information content of the document and
are incorporated into a Reference File record that relates to the docu-
ment and which, in an abbreviated format, is designed for browsing
or quick retrieval from the Document File. Indexing has always been
performed by human beings because of the intellectual effort involved.
Furthermore the relevance of retrievals as well as the comprehensive-
ness of recall is directly related to the quality of indexing. More re-
cently experimenters have given some attention to machine assignment
of terms based upon text or abstract analysis,[l,2] but these techniques
cannot be expected to substitute for human indexing for quite some
time.
The abstract is a very brief statement in natural language of the
essential content of the document; there are two types in general use
today, the informative abstract and the indicative abstract. The former
provides information and data that may be extracted directly from the
document itself, and, if well conceived, can sometimes be used in lieu
of the document in order to obtain information, since it may contain
particular values or results that are developed by the document. The
indicative abstract, which is more popular with libraries, indicates, as
the name suggests, what the document is about and what kind of
information is contained therein, rather than the explicit citation of
facts, results, and conclusions of the document. After Data Reduc-
tion, as incicated by the open bro3d arrows on the diagram, the
Generator of information has been completely served. His original
depositions have been inventoried and filed, and a means has been
created for people to scan, browse, and find relevant documents in
the file by means of the Data Reduction and the subsequent creation
of the Reference File.
In the other direction, starting at the bottom of Fig. 1, the User
approaches the file. He generally works through an lnterface which
is designed to provide him with the most expedient access to the file
system. This Interface provides both an input as well as an output
means of communication and translation for the User. As an input
medium the Interface may involve verbal or written communication
4 FILE STRUCTURES FOR ON-LINE SYSTEMS

through a pe-rson such as a reference librarian, or it may involve an


automated Interface such as mechanized typing through a computer
console. Information flow from the User Interface to the files is almost
exclusively via the Reference File, since its primary function is to
simplify and enhance access to the Document File. Except in the case
where a classification system is employed within the document re-
pository itself, such as in a library, the User, either himself or under
the aegis of the interface, does not usually approach the Document
File directly. Furthermore, in a library, where the documents are
shelved according to a classification system, browsing along shelves
is usually localized and is used after consulting the Reference File for
the particular shelving area of interest. Output to the User via the
Interface, however, may come from either the Reference or the Docu-
ment File, and its form may be verbal or written from a reference
librarian. It may consist of microfilm or hard copy of documents
from the file, or it may consist of console typing or CRT graphics
from an automated system.
This book is largely concerned with the design of Reference File
structures for automated systems in which the storage medium is a
large scale direct access memory, such as a magnetic disk file. There-
fore, the focus of attention throughout will be within the oval labeled
Reference File in Fig. 1, and its communicating components within
the system-Data Reduction, Document File, and User Interface-
will be considered only insofar as they may affect the design of the
Reference File.
Information systems can broadly be classified into two levels, the
first called storage and retrieval systems and the second, control
systems. The former performs a very singular function and is essen-
tially a mechanized extension of the present-day library concept. It
stores and updates a data collection, catalogs (or indexes) it, and
enables retrieval of the stored data. Utilization, of the data, once ob-
tained from the Document File, is a function that is outside the domain
of this system. Examples of storage and retrieval systems are:

1) Document referral systems. (libraries) .


2) Simple parts inventories in which the quantity, description,
and auxiliary data about individual parts is maintained.
3) Fact retrieval systems. (handbooks) .
4) Systematic Inventories in which the individual parts are in
some systematic way related. Such systems are used not only
to log and account for the individual parts, but also to in-
THE INFORMATION SYSTEM 5

quire into and to find meaningful relationships among the


various parts. There must therefore be some systematization
or procedure whereby relationships among the parts can be
formally defined.
An example of such an inventory is that of chemical com-
pounds, wherein the structural formula of the compound is
explicitly stored in the document file. Each part has a sys-
tematic relationship to the other parts via a description of
the chemical compound, which is essentially a graph. Hence
classes of related chemicals may be retrieved based upon
graphical similarities. In a like category is electronic circuits.
Delivery routes is another example, where each route may
be considered to be a record or a part of the inventory, and
where there is an interrelationship among the routes that
forms an overall network, and hence various traversals or
selection of routes may be made based upon distance, cost of
usage, and availability. In a similar way, schedules, reserva-
tions, surveillance systems and, at times, personnel or man-
power assignment systems may be considered to be in the
class of Systematic Inventories.

The control information systems contain within them a storage and


retrieval system and, in addition, provide further processing that
imparts to them a semblance of intelligent or heuristic behavior.
These systems aid in decision making by automatically retrieving and
correlating appropriate records and by preparing graphs, decision
tables, and summaries. The more popular names of systems that
fall into this class are Command and Control systems and Manage-
ment Information systems. For example, a military logistic system
will invariably contain both parts and systematic inventories in order
to determine quantities on hand, availabilities, locations, means of
routing, etc., and, having access to such information systems, will
combine the retrieval capability with a higher order of intelligence
that may be partially automated in order to provide the essentials for
decision making or to make the decision itself; hence, the name
Command and Control. Similarly, Command and Control systems
could be constructed for military strategics and tactics, the former
being somewhat more complex, because of its nonmilitary implica-
tions, which may be political, economic, social, and legal, and is
therefore more likely in the present state of the art to be less suscep-
tive of automation. Finally, a Management Information system is
6 FILE STRUCTURES FOR ON-LINE SYSTEMS

also a Command and Control system because it provides basic deci-


sion making data to management and, at times, may actually be relied
upon to make or assist in the making of certain types of decisions.
One of the most crucial design considerations of an information
system is the separation of manual and automated functions, and
within the latter, the determination of the level of automation. In
general, the performance of a large scale IS & R system is most sensi-
tive to the quality of indexing since this governs both the comprehen-
siveness and the relevance of recall. Secondary to this quality factor
are the quantitative factors of file size and access and update speed.
It is also interesting to note that indexing, which relates to system
quality, is best performed, at present, by humans because of the high
intellectual requirement. The quantitative factors, on the other hand,
involve numerous repetitive and highly mechanical functions, and are
therefore the proper focus of automation, where the degree of auto-
mated sophistication may be in proportion to the quantitative demands
of system performance. That is, the functions implied by Data
Reduction in Fig. 1 are largely performed by humans, whereas the
procedures associated with the generation and search of files are a
combination of manual and mechanical labor. In time, useful tech-
niques will be developed to give computer assistance to indexers and
abstracters, so as to further divide that labor, using the large scale
memories and high speed retrieval of the computer to best advantage,
leaving the application of experience, reasoning, and decision making
to humans. In the more distant future, these preserves will ultimately
be breached by the computer, and another division of labor will occur,
and, thus, automation in the system may progress from its present
strong position (in the purely mechanical functions of sorting, listing,
storing, retrieving, printing, etc.) toward the functions of higher
intellect.
Again, to maintain perspective, the techniques described in this
book are directed toward these purely mechanical functions, not those
that would aid directly in the performance of intellectual system func-
tions such as indexing and abstracting. However, before proceeding
to the systematic development of these techniques, it is instructive to
look a little more closely at the information system and its model in
order to place the software requirements in a proper perspective, and
to provide the necessary basis for the development of the techniques
themselves.
The automated IS & R system has four major functional com-
ponents, three of which are directly related to hardware devices, and
THE INFORMATION SYSTEM 7

the fourth, which is conceptual, has hardware analogs. These four


components and examples of their implementation are presented in
Table 1. Immediately evident from this table is the range of com-
ponent implementation from manual to electronically sophisticated.
They are all based upon the three existing means of information
deposition-paper and ink, light sensitive film, and magnetic media.
Furthermore, information theorists recognize two basic abstract modes
of information deposition, analog and digital, both of which are
responsive to all of the above three media. The use of positioned
symbols each with a finite code set imprinted on paper would
be regarded as a digital or digitized deposition of information, and
the recording of graphical or continuous line data would be con-
sidered to be analog in the printed medium. Light sensitive film may
be (1) a micro image of paper and ink, (2) a motion picture film,
which provides the additional analog dimension of time sequencing,
resulting in the apparency of physical motion, and (3) a storage
medium for binary encoded digital information that can be sensed
and decoded by logical computing elements. Finally, the magnetic
media can store analog information for voice, video, and other fre-
quencies of spectral recording, or digital information in the form of
computer magnetic tape.
As indicated by Table 1, the major system hardware components
fall into three device categories-Input, File Storage, and Output-
and the method of file organization is largely determined by the
mechanism for retrieval. The techniques listed in the table are par-
ticularly relevant to the Reference File although they may also deter-
mine the arrangement of storage in the Document File.
Input Devices for information systems include the pen, typewriter,
copier, and printer, which are used in conjunction with paper and
ink deposition. The paper tape and punched card equipment are
used for digital recording, also in a paper medium. The optical
scanner is a means for transferring from paper and ink into an elec-
tronically digitized format for subsequent recording on either film or
magnetic media. The tape recorder is used exclusively as a magnetic
medium for video or voice storage, and the CRT/keyboard/light pen
and electro writer are used for the input of either digital or analog
data directly from hand manipulation to a digital format via an analog
to digital conversion process.
File Storage Media includes sheet paper, which is usually stored in
shelves or cabinets. Punched card files may also be stored in steel
cabinets of special design. Microfilm files are usually store in reels
8 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 1
FUNCTIONAL COMPONENTS OF AN INFORMATION SYSTEM

Input Devices

pen
typewriter
copier
printer
paper tape and punch card
optical scanner
tape recorder
CRT/keyboard/light pen
electro writer

File Storage Media

sheet paper
punched cards
microfilm
magnetics (analog and
digital)

Output Devices

typewriter
computer printer
microfilm reader
tape recorder
CRT
slide or film projector

Retrieval Mechanisms

catalog cards
punched cards
coordinate index cards
EDP serial systems
EDP random systems
THE INFORMATION SYSTEM 9

or on microfiche cards, and the magnetic media are stored in a


variety of forms, including reels, cards, strips, disks, and drums. The
speed with which information can be accessed from the files is largely
dependent on the physical characteristics of the file storage devices.
Furthermore, as may be supposed, it is generally-the case that in-
creased cost of the storage devices is associated with higher density
of storage, more ramified access to the storage, and higher access
speed from the device.
In contrast to the Input Device, the primary function of the Output
or Display Device is to present information to the user in a con-
venient form. Individual responses from an automated system usually
appear on typewriter, a high-speed line printer or cathode-ray tube.
Responses from the Document File depository, if not in a bound
hard copy, may be displayed on a microfilm reader or on a slide or
film projector, and where the information deposition is on voice
recordings they may be presented on a standard tape recorder.
Considering the long history of classification and storage and re-
trieval systems, there are surprisingly few retrieval mechanisms,
although there are numerous variants on the few themes; two broad
classes of file organization dictate the exact form of these mechanisms.
These are hierarchic (classified) systems and coordinate index sys-
tems, and the retrieval mechanisms for a particular system may em-
ploy both. For example, most libraries shelve their books according
to some classification schedule, such as the Dewey or Universal
Decimal System. This conceptually places the book at a node in a
classification tree, and, in principle, a search could be made for this
book by branching from a universal node into paths and subpaths
of greater specificity, until a node of the tree is reached that con-
tains the desired subject definition (e.g. Electrical Engineering is
[621.], Automation is [621.8], and Systems Engineering is [621.81]
in the Dewey Classification). Then, since the sheH assignments reflect
the classification codes associated with each of the tree nodes, the
search is concluded by a physical scan of the indicated shelf, which
is internally ordered by document author. A shortcoming of this
system is that cross referencing of subject areas is very difficult. For
example, a book relating to the impact of the steam engine on the
nineteenth-century-British economy could not readily be located by
the above procedure, since the subject area of steam engines would
probably be at a different node in the tree than that of the nineteenth-
century-British economy. A search under steam engine or nineteenth-
century-British economy would, of course, be very time consuming.
10 FILE STRUCTURES FOR ON-LINE SYSTEMS

Therefore, ap. alphabetic subject catalog is superimposed upon this


system, as a Reference File. A few subject headings are assigned to
each document, and as many cards with the document title, author,
bibliographic imprint, and subject headings are generated. A catalog
is then created by ordering all the cards alphabetically by subject
heading, title, and author. Unfortunately most libraries do not assign
more than three or four subject headings per document so that the
depth of indexing is very shallow and, therefore, usually generic.
However, the above described search could be conducted as a coordi-
nate index search in the card catalog if the relevant documents had
the indicated descriptors, by accessing the cards titled "steam engine"
and then scanning them for documents with the subject heading "nine-
teenth-century-British economy," or "British economy." The shelf
location of each responding document would appear on the card as
a classification code. Thus, the original intent of the classification
system is partially obviated by the alphabetic subject catalog. How-
ever, the catalog performs the essential and vital function of cross
referencing, since some of the coordinated "hits" may be classified
under "steam engines" (or its nearest class code) and others under
"British economy." The search may then be resumed in the Docu-
ment File (the library stacks) where the exact title may be located,
or serendipitous scanning of the shelf may ensue. All of today's
mechanized retrieval systems are based upon some variant of coordi-
nate indexing or coordinate indexing with a classification schedule
overlay. The five most common implementations are listed in Table 1
under Retrieval Mechanisms. Most manual systems utilize some form
of card catalog, as described above. For higher speed handling of
the cards, data from the catalog card may be suitably coded and
punched onto an EAM card and scanned at a rate of about 200
cards per minute on a card sorter. A third mechanization is the
coordinate index card system, which is used in two modes, one
oriented toward the manipulation of cards that contain the records
of interest themselves and that are indexed by holes or notches along
the edge, and the other oriented toward manipulation of index cards
that serve to locate file records or documents by means of coordinates
placed on the index cards. These latter systems are usually called
optical coincidence or peekaboo systems,[3 J since they are commonly
implemented by punching holes on a descriptor card at the coordi-
nate location representing a document containing the given descriptor.
The coincidence or logical conjunction of two or more descriptors
at a particular document coordinate will allow light to pass through
THE INFORMATION SYSTEM 11

the holes in the respective descriptor cards. A commercial system


is available that permits the coordination of up to 10,000 documents
per card.[4]
All systems that employ a digital computer fall into one of two
classes, the EOP (electronic data processing) serial systems and the
EOP random systems.
Within the above scheme of system components and retrieval mech-
anisms, the attention of this book is focused narrowly and specifically
upon file storage devices of the digital, random access magnetic surface
or strip type, generically referred to as direct access storage devices
(OASO), and EOP random retrieval mechanisms, based upon co-
ordinate indexing. A few diversions into magnetic tape (EOP serial)
and classification systems are made where spin-off from techniques
originally developed for random systems has been achieved; however,
these topics are illustrative only, and no detailed treatment of either
is presented. In Fig. 1 the model of an information system was
presented. In order to focus now on that part of the system that is
susceptive of treatment by the techniques to be presented in this book,
a closer look at the data flow in this system is had via Fig. 2, and
then the specific software components of interest will be explored in
Fig. 3. The Generator of information provides the primary data
source, i.e., Documents, which may first be subjected to a culling
process in which some are rejected because of their unessential na-
ture. Then, in terms of the model presented in Fig. 1, there will be
a data edit from the document and an intellectual indexing and/or an
abstracting operation that generates a series of forms for further
processing. In addition, the primary data sources selected are trans-
mitted to the repository for storage, usually in microfilm or hard-copy
files. The forms may be further edited for consistency and quality and
then transmitted to a keypunch operation and possibly also into a
microfilm or hard-copy file where they will be retained temporarily
as controls. Sometimes they are used as manual auxiliaries to the
automated Reference Files. The keypunched data are usually veri-
fied and, when necessary, corrected. These data then enter a file
generation process which is usually performed as a batched update
against the current file system. On-line techniques are now being
extended to the keying of data, thus eliminating the keypunch-verifi-
cation batched update cycle. Instead, the forms-edited data are
entered directly into the file from an on-line typewriter keyboard.
This requires an editor program in computer storage, which monitors
such input prior to its transmission to the file update programs in
12 FILE STRUCTURES FOR ON-LINE SYSTEMS

order to verify data in real-time and indicate errors immediately to


the typewriter operator. This mode of operation has a number of
advantages. For one thing it can eliminate the hard-copy controls
required for recirculation after data verification. Secondly, it partially
replaces the manpower requirement for data verification and enables
the corrections to be made immediately rather than waiting for a
recycling, which is usually done in a batch and requires a number of
controls. Finally, the forms and the input format procedures are
considerably simplified because the programmed editor can essentially
instruct the input operator as to the required format, and certain
repeated data fields can automatically be inserted with much more
flexibility than is available with the drum card on a keypunch machine.
It should be reemphasized that this diagram follows the model of
Fig. 1 inasmuch as a dual data flow is created immediately upon
entry of the Primary Data, into the Document File and the Reference
File (via Indexing and Abstracting). Within the Reference File sys-
tem there is a two part division which is preserved throughout this
book and which characterizes random access EDP as contrasted with
serial access EDP. These two parts, or subfiles of the Reference File,
are called the Directory and the File. The serial EDP system requires
only the File, although some of them employ both. The File contains
the reference records while the Directory contains the descriptors or
entry keys (the subject headings of the library system) by which the
Users want to access reference records from the File. The terminology
to be used consistently throughout this book for the access key is key,
and a more specific definition of its structure will be provided in the
next chapter. Functionally, however, a key is a term, expressed either
in natural language or as a code, that partially characterizes one or
more reference records of the file, which records presumably describe
documents of the Document File, and hence the keys provide random
access to the Document File, where the reference record serves either
as an intermediate (providing at the very least an accession number in
the Document File) or as a surrogate, providing an abstract and
other information that might serve in lieu of the document. Further-
more, since for any given state of the file system there is a fixed set
of keys attributed to the reference records, the vocabulary of keys
from which the User may select is fixed, although there may exist
auxiliary directories of synonyms and near synonyms that translate
from a more open ended vocabulary to the system keys. The genera-
tion of the Reference File system as two separate subfiles is quite
explicit as are the retrieval and update procedures regarding these
same two subfiles.
THE INFORMATION SYSTEM 13

The solid lines of Fig. 2 designate the data flow of fi:e generation
and update, where the final procedure in the generation of files is the
creation of the Directory and File of the reference file system.
The broken lines in the figure indicate data retrieval processes, and
are initiated at the bottom of the diagram by the User. It is assumed
here that the interface to the system is automated via a typewriter or
cathode-ray tube console. A System Executive program monitors
multiple consoles in a large scale on-line system, and particular queries
to the file are transmitted from the Executive to the Query Processor
which decodes or translates from User oriented key terminology and
syntax into the appropriate File ,record addresses by interpreting the
syntax of key expressions and then using the Directory as a table
look-up. The outputs of the Query Processor are (1) Presearch Sta-
tistics that may be returned to the console via the Executive, and
(2) Appropriate Record Addresses, transmitted from the Query
Processor to the File Processor, which controls the DASD acces-
sions and may further qualify record retrievals according to query
specification; File responses are then transmitted back to the con-
sole via the System Executive. In some systems the console in-
cludes a high-speed line printer for more voluminous output, and
this printer may either be on-site at the computer center or may
be on long distance telephone lines remote from the computer. After
communication with the Reference File via the Executive-Query
Processor-File Processor-Executive cycle, the User may communi-
cate with the Microfilm and Hard-copy Files, as indicated by the
directed dashed line in Fig. 2, and by the corresponding directed line
in Fig. 1.
As in the telephone system, the User and Generator sometimes
interchange roles in the information system, and the latter may use
the system in essentially a User oriented mode. Such may be the case
for on-line, real-time Reference File. update from a system console.
This procedure is indicated by the solid lines interconnecting the
console, Executive, Query and File Processors. The "queries" are
now file update commands with data modifications as operands. The
"Query" Processor must interrogate the Directory, as in true query
mode, in order to locate the record address, or addresses, to be up-
dated. These are transmitted to the File Processor, which executes
the indicated update. If the update involves the relocation of a record
in the File or the addition or deletion of keys to or from a record, then
the Directory address references to the File must also be updated.
A system response to the User/Generator indicating completion of
Primary Data Source .....
~

Data Edit, Data


System ::!l
Indexing and Forms Edit Ver i f ication Operation t""
Abstracting File trI
Vl
>-l
~
r----- ---, c:
(j
System
File Generation , >-l
Management c:
Forms ,' (Batched Update) , I . ~
Auto Monitor trI
Vl

'r1
I o
~
Ll
Microfilm
o
Document O'Irectory File F"lie ,
z,
and t""
File "
Hard Copy Files , I Z
ttl
, 1 ~
Vl
- _ - / - __1 >-l
trI
I I , ~
Vl
, I L __ -,
" Multi Console I
System
- Data Storage Executive
~----~,
IL ____________________ _ I ,
--- Data Retrieval
Fig. 2. Data Flow in the System
THE INFORMATION SYSTEM 15

the update may then be made via the normal User oriented (broken
line) channel of communication. This may be as simple as a state-
ment from the Executive-"update complete"-or it may be a display
of the updated record.
Since systems with high performance processors, large capacity
DASD, and telecommunications represent a considerable investment
in both hardware and software design, implementation and main-
tenance, the performance of these information systems should con-
stantly be monitored and periodically evaluated against the original
design standards as well as ad hoc criteria established from time to
time by users of the system. This may be achieved by programming
an Automonitor that can be interrogated by system management per-
sonnel to determine how the system is being utilized and how effec-
tively it is providing its intended service. For this purpose a system
operations file is maintained, and information of a statistical and
usage nature will flow from the File Processor to the System Manage-
ment Automonitor where it will be appropriately processed and stored
for later correlations and interrogation. The System Manager may
then make the appropriate interrogations and obtain statistical sum-
maries via his console.
In this way the system can be modified as deficiencies are detected
by significant deviations from the standards. For example, if the
relevance of recall is below standard, then the key vocabulary, the
quality of indexing, and the retrieval strategies of Users should be
examined. If recall is incomplete, then consistency of indexing and
the use of synonyms should be examined. If the system response time
is too low, then an improvement in retrieval technique may be indi-
cated, a hardware improvement at a system bottleneck may be re-
quired, or, in some of the file structuring techniques to be described,
a modification of an operational parameter may be attempted.
Another function of the System Manager and the Automonitor
could be to maintain file integrity, where users may restrict file access
selectively in multifile, multiuser systems);,l This means that certain
files or portions thereof have access and update authorization that is
limited, and the integrity of these files is maintained automatically by
programs in the Automonitor, while administration of the system is
under the cognizance of the System Manager.
It is intended in Fig. 2 to provide a perspective from which to view
the flow of data into, through, and out of the automated system, and
it is now appropriate to focus yet a little closer to the subject of this
book, which is enclosed within the dashed box of this figure. The
16 FILE STRUCTURES FOR ON-LINE SYSTEMS

block diagram of Fig. 3 presents a somewhat more detailed view of


the relationships among these storage and retrieval programs. (Table
2 is a legend for this diagram.) This diagram defines the elemental
software components of the IS & R system, and each will be treated
at some length in the book. The elementary nature of these com-
ponents will be further illustrated by breaking the diagram apart and
representing a succession of simpler designs starting with the most
basic components and then adding a few blocks at a time.
The highest level of software is the operating system through which
all peripheral storage and console communication is made to the work-
ing programs. At any given time of operation there will be resident
in core one of two executives. The first, labeled 1.1, is the Off-line
Executive, which accepts batched file update information, generally
from magnetic tape, and controls the execution of the Off-line File
Generation programs 2.1 through 2.4. These are responsible for
formating the records, generating or updating the records in the Data
File (F-l), and, finally, constructing or updating the Key Directory
(F-2).
The programming counterpart of random access is file partitioning.
This is normally a logical partition in the sense that specific sets of
records may be rapidly retrieved from the DASD without the neces-
sity of storing them contiguously. The records in a given partition
are normaIly associated by address linkages. The number of such par-
titions and the speed of retrieving a specified set of records via these
partitions is a measure of the system's performance, with respect to
operational flexibility, utility, and time responsiveness. The file parti-
tions, which consist of records as basic information units, can be
likened to lists or arrays in a compiler, and, hence, the technique of
organizing a file for random access by logical partitions is often called
list structuring. This book discusses the two basic modes of list struc-
turing, inverted lists and multiple threaded lists (also referred to in
the literature as knotted lists), and after examining their advantages,
disadvantages, and areas of application, presents modifications of each
which broaden the design alternatives to the system analyst. The
formating, generation, and updating of records in both the Directory
and File are all related to the particular file partitioning process and,
therefore. represent an integrated block of programming.
The On-line Executive (1.2) controls a somewhat larger number
of interrelated programs. Whether or not both Executives could re-
side in core simultaneously, and, under the On-line Executive, how
many of its subprograms could be resident, are functions of the core
6 I.. -I F-8
..,
::cttl
Z
"%1
I o
2.1-2.3 8 3
I ~
'"
I ~
r----...J 15
z
I en
..,en><
ttl
4.2 4.1 7 ~

IL _ _ _ _ _ _ _ _ _

2.4 5

......
Fig. 3. Software Mechanisms of a Real-time Retrieval System ....J
18 FILE STRUCTURES FOR ON-LINE SYSTEMS

size and total system requirements. For example, the functions of the
two Executives are normally exclusive, although, if core space were
available, the Off-line functions could be performed as background on
a low priority basis. The On-line Executive, however, must reside
in core whenever the system is responsive to on-line operation, but
none of the other programs, except a part of the Input/Output Pro-
gram, has to reside permanently in core. It is the sole function of the
Executive to sequence and call the appropriate subprograms for ex-
ecution in accordance with an established processing schedule. If the
system also has the ability to time-share queries from the consoles,
then the Executive has the additional function of scheduling the sub-
program calls according to priorities of the moment and a predeter-
mined time-sharing algorithm. The decision as to whether the
Executive or the initiating subprograms should respond to the Operat-
ing (OP) System I/O interrupts is for the designer to decide; this
usually depends upon the specific form of the time-sharing algorithm
and upon the operational environment of the information system. For
example, if the total system is dedicated to the information system, and
if subprograms are not shuttled from core to peripheral storage during
the execution of one of the subprogram's jobs, then it is simpler (and
the resident executive is smaller) to have the subprogram execute
its own I/O through the OP System. On the other hand, if a more
sophisticated time-shared system is envisioned, where subprograms
can be interrupted by the On-line Executive and removed from core,
then the Executive should handle I/O since an interrupt response may
be required when the subprogram is temporarily out of core. Simi-
larly, if the information system is itself a job within a larger time-
shared system environment, tilen I/O from the information system
may be less complex, with respect to the general system time-sharing
executive, if it were centralized in the On-line Executive.
The Input/Output Program (6) buffers incoming messages onto
an I/O file (F-8) and transmits responses from this same file through
the Executive back to the console via the Operating System.
The two routes of data communication from the Executive are to
the Input/Output Program (6) and the Query Interpreter (3). In
both cases, the Executive performs little or no processing, but is
merely a switching mechanism that provides a communication path
between the OP System and the I/O Program and between the I/O
Program and the Query Processor. The I/O Program collects the
input query a character or a line at a time, depending on the terminal
device buffering and upon the OP System buffering. In some systems
THE INFORMATION SYSTEM 19

most of the input and output buffering functions may actually be


performed by the OP System. The data are stored by the I/O Pro-
gram in an input file on a high speed DASD such as a drum or disk.
When a complete input message has accumulated and an "execute
query" signal sensed, the I/O Program indicates to the Executive that
it is ready to transmit to the Query Interpreter (3). The function of
the Query Interpreter is to translate and prepare various parts of the
query for subsequent processing. Some of these parts provide control
information for routing the query, calling applications programs, and
designating output terminals. Other parts provide processing data
such as query arguments and functions. The various parts are as-
sembled into formats for processing by subsequent programs. The
Query Processor may also perform syntax verification on the state-
ments, and the error messages are transmitted back to the terminal
via the I/O Program. The Query Interpreter may have access to three
subfiles on a DASD which contain (1) file access codes, (2) user term
definitions, synonyms, and program names, and (3) universal
synonyms and program names. The file access codes (F-3) are used
to indicate which files in F-1 a given query may have access to, if the
system includes such protection. The subfile F-4 provides certain
term definitions synonyms and program names that have been desig-
nated for special use by a given user or group of users within the
system. In addition, there may exist a universal set of synonyms and
program names that are common and accessible to all users of the
system, and these are stored in subfile F-S. The purpose of synonyms
is to provide a more open ended or free vocabulary to the Users,
where translations are made from the terms that appear in the
synonym listing to authority terms that appear in the Key Directory
(F-2).
The Query Interpreter communicates with the storage and retrieval
time-sharing Executive, which is subdivided into two supervisors, the
Job Supervisor (4.1) and the File Supervisor (4.2). The File Super-
visor has the singular function of accessing a subfile or list of records
from F-1 in accordance with a logical expression and qualifying
specification presented in the query. For this purpose it makes use
of the Directory Decoding of Block 5, which decodes the expression
and retrieves record addresses and presearch statistics from the Key
Directory (F-2). The presearch statistics may be transmitted via 4.2
directly back to the User through the path 4.1 into a Query Execution
Area (QEA) allocated for the particular query, then to the Output
Program, where it is buffered on disk, and then to the console through
20 FILE STRUCTURES FOR ON-LINE SYSTEMS

the System Executive and the OP System. The File Supervisor (4.2)
can also modify and restore records in F-1 and initiate an update of
the Directory, via Block 5, in response to an on-line file update. The
Job Supervisor (4.1) places the job (i.e. the query) into a QEA in
core or in disk, depending upon the availability of core space at the
moment, and controls switching from one query to another in accord-
ance with the retrieval of records from the file by the File Supervisor.
When the Job Supervisor transmits a retrieved record to a particular
QEA, it causes the selection of an Application Program (7), which
was specified by the Query Interpreter and transfers control to the
application program with a data linkage to the respective QEA. The

Table 2
LEGEND FOR FIGURE 3

1. System Executive
1.1 Off·line Executive
1.2 On·line Executive
2. Off·Line File Generation
2.1 Record formating
2.2 Partitioned file generation
2.3 Batched record addition/deletion/modification
2.4 Key directory construction
3. Query Interpreter
4. Storage and Retrieval Time·Sharing Executive
4.1 Job Supervisor
4.2 File Supervisor
5. Directory Decoding and Updating
6. Input/Output Program
7. Application Program
7.1 Intrarecord processing
7.2 Interrecord processing
7.3 On·line file updates
7.3.1 Record addition/deletion
7.3.2 Key addition/deletion/modification
7.3.3 Non·key data addition/deletion/modification
8. Background File Maintenance (space brooming)
F-l Data File
F-2 Key Directory
F-3 File Access Codes
F-4 User Term Definitions, Synonyms, and Program Names
F-5 Universal Synonyms and Program Names
F-6 Intermediate File
F-7 Program File
F-8 I/O File

application programs are usually quite numerous since they may per-
form a variety of processes on the records, depending upon the speci-
fication in the query; hence, they may be stored in their own Program
File (F-7). Furthermore, the processing may involve a number of
THE INFORMA nON SYSTEM 21

retrieved records, and these may have to be buffered or stored in the


Intermediate File, which is accessible to the Application Program as
F-6. For example, a particular application program may sort a file
of retrieved records, in which case all of the retrieved records would
first have to be stored on the intermediate file. Records which finally
qualify for output by the Application Program are transmitted from
the QEA to the Output Program, buffered in an Output File on F-8
and then transmitted back to the console or the line Printer via the
Executive and the OP system.
Another type of control that may be used in some systems, espe-
cially where the volume of output data may vary greatly from one
query to the next, and where different types of output devices are
available with differing print or display speeds, is indicated by the
broken line from Block 6 to Block 4.1 in the diagram. Under this
type of control, if a given query requires the use of more than a pre-
determined amount of Output File storage (F-8) then a signal is sent
by the Input/Output Program to the Job Supervisor, which removes
the job from active execution until such time as its output buffer
requirements have diminished by virtue of the printing process, at
which time the job may be brought back into execution. Externally,
this temporary cessation of the job execution is never felt because
output continues to come from the output buffer. Only in the event
that, in the interim, the on-line usage had risen to such a level that
the job could not be brought back into execution, would a substantial
delay in output be noticed.
At times when neither the on-line nor the background load is so
high as to fully utilize the processor capacity, the On-line Executive
( 1.2) may call into execution a Background File Maintenance (8),
which creates new space in the files. This space is required because
on-line updates which expand the size of records may cause the
records to be removed from their present locations and placed else-
where. The islands of space left by these removals are difficult to
utilize; therefore, it is desirable to have a program that in some sys-
tematic way can collect this space and make it available for new record
use, but which program can operate in very small increments of time
such as a few seconds or a fraction of a second, so that should the
on-line load suddenly increase, the program can be terminated, and
the file left in a completely usable state. Block 8 in Fig. 3 represents
this background maintenance program. One such program, to be
described in the text, is given the descriptive name "space brooming."
In order to illustrate the functional modularity of these software
22 FILE STRUCTURES FOR ON-LINE SYSTEMS

components, the diagram of Fig. 3 is broken down to the most basic


system, in Fig. 4, containing only the File and Directory subprograms.
Figure 3 is then reconstructed in three successive steps, via Figs. 5, 6,
and 3 itself. The four systems are labeled:

1) Basic Retrieval System. (Fig. 4)


2) Query Interpreter Augmentation and Single (Fig. 5)
Terminal, Dedicated Operation.
3) Multiple Console, Single Job Execution. (Fig. 6)
4) Software Mechanisms of a Real Time (Fig. 3)
Retrieval System.

OP System

USER's
Compiler
Program

4.2 5

Fig. 4. Basic Retrieval System

Each of the systems assumes the existence of the File Generation


programs 2.1 through 2.4 and their Executive 1.1. The series illus-
trates the elemental nature of each block and the modularity inherent
in the programmed system; hence the individual functions such as File
Supervisor, Directory Decoder, Query Interpreter, etc., can be de-
scribed as separately programmable entities. This enables their in-
ternal structure to be examined in the remainder of this book in
THE INFORMATION SYSTEM 23

considerable detail, after having gained an appreciation, through the


present chapter, of their organic contribution to the system as a whole.

Op System

USER's
Compiler
Program

Fig. 5. Query Interpreter Augmentation and Single Terminal,


Dedicated Operation

In Fig. 4, the Basic Retrieval System consists only of blocks 4.2 and
5, the File Supervisor and Directory Decoder. Furthermore, it can
only retrieve records from the DASD, not update. All file updates
and maintenance are performed in batches through programs 2.1 to
24 FILE STRUCTURES FOR ON-LINE SYSTEMS

Op System

1.2 6

I QEA 7

Fig. 6. Multiple Console, Single Job Execution

2.4. These subprograms are written simply as subroutines that can


be called from a User written compiler main program, such as in
FORTRAN or COBOL. The query language is part of the User's
main program. The retrieval keys are transmitted to the Directory
Decoder via an argument listing in the subroutine call statement. The
Decoder operates on F-2, using the compiler random access state-
ment and can return a presearch statistic to the main program, which
in turn calls the File Supervisor to retrieve records from F-l. The
addresses may be transmitted from Block 5 to 4.2 by Common storage
or again through argument listings. The logical expressions of keys
THE INFORMATION SYSTEM 25

are decoded by the main program (probably into a sum of products)


and fed, one product at a time, to the Directory Decoder. Also, the
final record qualification and processing is performed by the User's
main program on each record as it is retrieved from the File (F-1) .
This system would process queries either in batches from a card to
tape input and print on the line printer, or it could process on-line
queries from a console, depending upon the availability of tele-
processing software in the OP System and the desires of the User who
has written the main program.
The file structures discussed in this book are all applicable to this
Basic System.
Figure 5 contains two possible augmentations to Fig. 4. First, a
Query Interpreter (3) is added, which buffers the User's main pro-
gram from the file operations and substantially reduces the main
program to an input-output executive. Second, an on-line update
capability may be added to 4.2 and 5, as indicated by the broken lines,
and the Directory Decoder would be made subordinate to the File
Supervisor. Since an Input-Output Program has not been incor-
porated, this system is still dedicated to single terminal operation.
The third increment, shown in Fig. 6, includes the I/O Program
(6) with its DASD (F-8). so that multiple terminals can be serviced;
however, the jobs are executed one at a time through a single Query
Execution Area. This increment might also include a Library of
Application Programs (Block 7 and F-7), and the use of an Inter-
mediate File (F-6) for postretrieval processing. Many IS & R sys-
tems are of this general type, since time-shared operation, which
characterizes the final increment (Fig. 3), is an extreme requirement
and requires a larger processor and far more complex executive
programs.
By means of a series of telescoping diagrams and discussions, this
first chapter has successively narrowed the reader's conceptual field
of vision from the gross information system model in Figs. 1 and 2 to
its software mechanisms in Fig. 3. This field will be further narrowed
in Chapter III to focus upon the proper discussion of this book,
namely, the file structures, represented primarily as F-l and F-2 in
Fig. 3, (the File and Directory), and their functional programs, 4.2
and 5, even though the techniques for the generation, search, and
update of these files are also applicable to the other files of Fig. 3.
Chapter II presents a brief discussion of DASD types, their rele-
vant principles of operation and functional characteristics, in order
to define certain classes of random access devices, and to provide
26 FILE STRUCTURES FOR ON-LINE SYSTEMS

certain parameter definitions necessary to a discussion of timing and


memory requirement formulations.
Appendix A presents a further discussion of information systems,
in brief, for management oriented persons who may be interested in
how the government or a large organization goes about developing
and implementing an information system.
CHAPTER II

DIRECT ACCESS STORAGE


DEVICES

THE DIRECT ACCESS storage device, or DASD, as it will be referred


to, is a generic name for peripheral computer storages that have the
approximate data access characteristic shown in Fig. 7. This char-
acteristic is also compared in the figure with those corresponding to
magnetic tape and core. No time scale is implied by this diagram but,
as drawn, would be roughly logarithmic. The access time of core
memories is independent of the addressing distance between acces-
sions, and is on the order of 1 to 10 microseconds. The access time
of magnetic tape is very nearly linear over a single reel of tape, which
spans an addressing distance corresponding to approximately 15 mil-
lion bytes, or 3 to 4 million words. The typical slope of this line
is around 10-;; sec per byte for a 90 KBjS tape drive. The DASD,
in contrast has linear portions of its characteristic, which are
separated by vertical jumps, where the number and height of these
jumps are determined by the mechanical construction of the device.
In fact, the DASDs can be further classified according to the number
and height of such breaks in their characteristic, as shown in Table 3.
It is also necessary, in the case of the DASD, to redefine the notion of
distance between addresses, as will be shown after the discussion of
mechanical construction.
Broadly speaking, the basic difference between the direct access
storage device and the magnetic tape is the former's ability to access
a character or string of characters, and commence transmission to

27
28 FILE STRUCTURES FOR ON-LINE SYSTEMS
Table 3
CLASSES OF DIRECT ACCESS STORAGE DEVICES

No. of Breaks Height(s) Slope of


in Characteristic (Typical) linear Portion Device Type

0 - 3.2 to lOX 10-0 sec/byte (Fixed) Head per


Track Disk or Drum

3 50 ms 3.2 to lOX 10-0 sec/byte Movable Head Disk


150 ms (Disk Pack)
50 ms
120 ms
180 ms

9" 100 ms 2 X 10- 4 sec/byte Magnetic Card or


175 ms Strip (IBM Data Cell,
500 ms RCA Mass Storage,
95 ms NCR CRAM)
175 ms
250 ms
350 ms
375 ms
400 ms
450 ms
550 ms
600 ms

• The particular breaks and heights indicated are for the IBM Data Cell.

ACCESS
TIME

ADDRESSI N G DISTANCE BETWEEN


SUCCESSIVE ACCESSIONS

Fig. 7. Comparison of DASD, Tape and Core Data Accession Characteristics


DIRECT ACCESS STORAGE DEVICES 29

the central processor within a few tens of milliseconds for some equip-
Il1ent types, and within a few hundreds of milliseconds for others. The
magnetic tape, in contrast, is not designed for the random accession
of data within a narrow time range, but rather starts the transmission
at the beginning of the tape reel, and transmits all information serially,
whether or not it is required for processing by the computer; since it
takes approximately five minutes to pass a reel of magnetic tape, the
average time to reach an arbitrarily located character string on the
tape starting from the beginning, would be about two and a half min-
utes, and to reach it from an arbitrarily located standing position
would be about one-third of five minutes or one and two-thirds
minutes. Therefore, the random accession time constant differential
between these two device types is around three orders of magnitude.
In summary, the primary function of the magnetic tape is to provide
large capacity peripheral storage to a digital computer, where the
processing of this data is to be modular in size units of around 15
million characters, and time units of five minutes, whereas the primary
function of the DASD is also to provide large scale peripheral storage
to a digital computer, but where the modularity in size varies from a
few hundred thousand characters up to a half billion characters (in
present-day equipment), and the modularity in time units is in the
range of ten to 500 milliseconds. In terms of cost the most commonly
used high-speed tape units purchase for around $30,000 per tape
drive, and a disk system with a comparable modular capacity (i.e.,
15 million characters) would purchase for around $40,000 although
as the on-line capacity of the device increases, the cost increase is
less than proportional. Furthermore, the lower performance direct
access storage devices (e.g. Data Cell) are available at a considerably
lower cost per on-line character of information.
In Fig. 8 the construction of the fixed head or head per track-disk
file is illustrated. These systems have a series of aluminum disks with
a magnetic iron oxide surface coating like tape onto which digital data
can be written and read by recording heads that float on a cushion of
air a fraction of a mil from, but never touching, the disk surface.
The disk surface contains concentric recording tracks packed approxi-
mately 50 per inch. The only movable part is the disk, which rotates
in a period, typically, of 25 milliseconds. * Each surface has a read/

• Appendix C contains configurational specifications of a representative sample of com-


mercially available DASDs.
30 FILE STRUCTURES FOR ON-LINE SYSTEMS

write recording head for each track, and all of the heads are anchored
to a block. Data can be transmitted to or from one track at a time,
and switching from one track to another is electronically controlled,
since there is a head permanently positioned to service each track.
This means that its characteristic, in terms of Fig. 7, has no breaks,
as indicated by Table 3. If the device can only commence reading or
writing from an initial track index, then, the time, Tn to access data
at position x (measured, say, in degrees of rotation from the index),
from a given head position y would be

Tr =( 1
- v) R ,
+ X 360' (1)

where R is the rotation time.


If the device can read from any position on the disk, the access time is
x - y .
Tr = 360 R If x > y,
and (2)
X-y) .
Tr = (1 + :360 R If x < y.

In general, the head per track devices operate in neither mode, but
read in relatively small data units such as groups of eight bytes, so
that the actual access time would be close but not exactly equal to
formulations (2). The access time, in either case, is a linear function
of the shortest distance between the two addresses. Furthermore, the
two addresses need not be on the same track since the electronic head
switching time is negligible compared with even small fractions of R.
Therefore, the distance between addresses is the projected differential
onto a single disk surface, which means that in the first case, the maxi-
mum effective address distance throughout the entire medium is 720°
(x - y = 360° - d, and the average access time is R; in the second
case the maximum effective address distance is 360°, the maximum
access time is R and the average access time is 1h R.
The important timing concepts for this class of device (which is a
subtype of the other two classes) are (l) latency, which is the time
elapsed from initiation of the I/O command by the processor until
the first character to be transmitted is under the read/write head, and
the electronic circuitry is prepared to transmit, and (2) record trans-
mission, which is the time to transmit the record. The latency time
is given by the above expressions, while transmission time is simply
the ratio of the number of characters to be transmitted to the serial
DIRECT ACCESS STORAGE DEVICES 31

data transmission rate of the device, in characters (or bytes) per


second. If the system always transmits a complete track, the record
transmission time is R.

riTi'i~irTi'i~i~-----------------~
------------11--------'~'~!~!~!~!~!~-------------_-_-_--_~

Fig. 8. Mechanical Construction of the Fixed Head Disk

Figure 9 presents the mechanical construction of the movable head


disk. As indicated, there are two mechanical motions associated with
positioning the head; one is rotational and the other is radial across
the disk surface, because there are a limited number of heads to
service each surface. The access motions, therefore, occur in two
steps; first the head is radially positioned to the appropriate track, and
then access proceeds as in the head per track disk. Furthermore, all
of the heads are ganged on a single mechanism, appearing like the
teeth of a comb. This means that for any given position of the comb
arm, the system has "head per track" access to one track at a given
radius on each surface of the disk stack. Thus, zero head motion
access in the disks is sometimes referred to as cylinder access, as de-
picted in Fig. 9. The radial motion of the heads is usually by a tele-
scoping hydraulically powered arm action in which multiple segments
of the arm move simultaneously, providing coarse and fine adjust-
ments to the positioning. When only the fine adjustment is required
the arm positioning time is on the order of 50 milliseconds, and when
the coarse adjustments are required the positioning times are 120 and
180 milliseconds. * These times are cited in Table 3 and correspond
to the height of the breaks in Fig. 7. The radial distance of the
• These times are for the IBM 1301.
32 FILE STRUCTURES FOR ON-LINE SYSTEMS

fine adjustment is around 10 tracks (Le., 10 successively concentric


cylinders) . The average head positioning time in this type of device
is therefore a function of whether the address differential is such as
to require a fine or one or more coarse adjustments, and is not readily
amenable to a straightforward calculation, but it is, for the practical
purposes of estimate, around 100 milliseconds. *

I
I
I
I
I I
I I
I I
"- """-- --"'"
.I

,.....
~

Fig. 9. Mechanical Construction of the Movable Head Disk

Figure 10 presents a schematic drawing (not a facsimile) of the


IBM Data Cell, illustrating the three motions associated with the
breaks of Fig. 7 and Table 3. This device has 10 cannisters called
Data Cells that are driven circularly in either direction on a shaft,
like a lazy susan. This motion is labeled number 3 in the figure, and
corresponds to the breaks from 250 to 600 milliseconds. When a
particular subcell of the Data Cell is positioned under the drum,
which also rotates bidirectionally, an oxide coated magnetic strip,
] 3" x 2 - 14" x 0.005", is picked from the subcell and placed onto
the drum. This motion is labeled number 2 and is ] 75 milliseconds.
Finally, the read/write heads position laterally across the strip, as in
the movable head disk, and this motion, number ], is around 95
• See Appendix C for manufacturers estimates on various equipments in this category.
DIRECT ACCESS STORAGE DEVICES 33

milliseconds. The Data Cell has 20 read/write heads positioned


simultaneously over the strip, which can be rotated past the heads,
once they have been positioned over the drum. The cylinder concept,
as the unit of storage preceding the first characteristic break (i.e.,
requiring only drum rotation) is therefore valid for this device as well
as the movable head disk. Since there are 20 heads, the cylinder of
the Data Cell contains 20 tracks, as compared with 40 for the IBM
1301 disk and 10 for the IBM 2311 Disk Pack. The heads of the
Data Cell can be moved to 5 different positions; therefore, it has 5
cylinders (per strip), as compared with 250 for the IBM 1301 and
200 for the IBM 2311. The rotation time, R, of the drum is 50
milliseconds.

Fig. 10. Mechanical Construction of the Magnetic Strip DASD

These three device types characterize all of today's magnetic direct


access storages, and system response times to be calculated in subse-
quent chapters will all be based upon the above described operational
characteristics of these machines. In summary, the most relevant
34 FILE STRUCTURES FOR ON-LINE SYSTEMS

timing concepts and typical values for these three device classes are
presented in Table 4. It should be noted that there are a variety of
commercial equipments within each of these classes with widely vary-
ing parameters. For example, Burroughs intends to market a series
of fixed head disks in which R is as high as 60 and as low as 17 milli-
seconds.
Appendix C cont~ins a representative sample of available DASDs
and their operational characteristics. By forming a composite in mean
access time (Head Position plus latency), and cost per character for
the three DASD types from data extracted from Appendix C, and by
compiling the same data for large scale core (8 microsecond access)
and magnetic tapes, the graph of Fig. 11 results. It presents two kinds
of relationships, the On-line Cost per Character and the Off-line Cost
per Character of various devices, plotted against the Mean Access
Time of the device as an independent variable. The solid line indi-
cates On-line Cost, which is to be interpreted as the cost of a char-

Table 4
TYPICAL TIMING PARAMETERS OF DIRECT ACCESS
STORAGE DEVICES

Average Head Position


Device Type Rotation Latency Access

Fixed Head R 'hR 0


Movable Head R 'hR 85 ms
Magnetic Strip or Card R 'hR 400 ms

R= 25-50 ms

acter, as it is maintained for on-line usage. This basically involves the


cost of the access mechanism. The chart indicates that the Core
memory, with an access time of 8 microseconds is most expensive at
25 cents per character, and the cost drops through fixed head disk
and Disk Pack (movable head disk) to a minimum for the magnetic
strip memory of .035 cents per character with approximately half
second mean access time. The cost then rises for Hyper tape and
standard 800 BPI tape to approximately .25 cents per character. This
indicates that both Tape and Hyper tape are impractical for on-line
storage, because they are slower and more expensive than two other
available types of storage, namely the magnetic strip memories and
the Disk Packs. Furthermore, both of the latter storage devices are
modular in the sense that they can provide off-line storage as well,
as indicated by the broken line, although off-line storage costs for
100
- - - - Off Line Cosl
- - - - On Line Cosl

.;- 1000 .; 10
'2 "-
-CD
o
u
;;
tTl
.... 230
-0
"- rl
"- 0 -l
CD
.&:
,,
U 100 U >
0 . ~.9 rl
-"-
0
"-
CD
D-
rl
tTl
.&:
CIl
U CIl
I/)
"- 0
- ~.26 CIl
'" ,,
CD U
D- .'9 -l
'3 CD
o
I/) 10 c .I \..._-- :00
-0 >
u :.::i Cl
tTl
CD C
c 0 1-.035
3 -~ otTl
:.::i
1.8
---- '20\ !i
rl
'/' I .02 _ .0......1. .5 .......
1\.....'5_0
_ _ _---'
0 .01 tTl
CIl
- t 10- 5 V .01 + +.1 + 10 1001t 1000
DISK DISK STRIP HYPER TAPE TAPE
CORE PACK

Mean Access Time (seconds)

t...)
Fig. 11. Cost/Access Time Trade-Offs VI
36 FILE STRUCTURES FOR ON-LINE SYSTEMS

these devices is higher than for tape. The fixed head disks and core
storage would be recommended only on the basis of their perform-
ance, since they come at a considerably higher cost than the other
storages, and do not provide off-line modular storage. The broken
curve, which represents Off-line Costs, corresponds only to the cost of
the storage magazine such as the tape reel, the Data Cell or the Disk
Pack, and, as can be seen from the graph, the cost of storing infor-
mation on the shelf via tape is an order of magnitude less than the
shelf cost of the Disk Pack. This would indicate that for batched
processing, where large archival files of data are to be stored at
relatively low cost, the magnetic tapes are still to be preferred.
CHAPTER III

INFORMATION STRUCTURE AND


FILE ORGANIZATION

1. Functional Requirements on File Organization

THE DISCUSSIONS OF the subsequent chapters in this book require that


certain definitions be made at this point, and that a framework be
established for the effective presentation of design and programming
concepts. This chapter is, therefore, devoted to certain definitions,
underlying concepts, and the statement of functional requirements that
are placed upon the design of files for on-line systems by virtue of their
incorporation into the kind of system environment discussed in
Chapter I.
Figure 12 presents the salient functional requirements and the
property or characteristic that the file structure and the system that
manipulates the file must have.
The system requires real-time record response if it is to operate in
an on-line mode. In hardware this implies a selection from a DASD
equipment type, as discussed in Chapter II, but in software, which is
the primary concern of this book, the file must be partitioned in such
a way as to enable relatively small subfile accessions from the DASD,
which, upon the application of further qualifying procedures will
produce the indicated records. The quantitative determinant of real-
time response is the size of the file partition and the strategies for
retrieving specified combinations of these partitions. On the other
hand, the flexibility of the system is largely determined by the number
of partitions. The programming technique that is used to implement
the file partition is called list structuring. Various list structuring tech-
niques will be described and evaluated in this book, and a technique

37
w
00

FILE AND SYSTEM


FUNCTIONAL REQUIREMENT "11
CHARACTERISTIC ....
t""
til
[J)

eReal Time Record Response File Partition (List Structuring) '"'l


El(")
-Multi - Key Logical Retrieval Multiple List Partitions '"'l
c:
:c
til
[J)
-Non- Key Qualifications:
"11
o
Intra Record Processing :c
Arithmetic Value Comparison
~ I
Functional Value Computations Intra and Inter Record Processing t""
Z
til
-Report Generation Inter Record Processing
~'"'l
til
-Real Time vs. Botched Update List Cantroll ~
[J)

-Fi Ie Maintenance Space Braoming

Fig. 12. Functional Requirements of the File System


INFORMATION STRUCTURE AND FILE ORGANIZATION 39

for file partitioning that is a hybrid of serial and list processing will
also be described.
The file should be susceptive of muItikey logical retrieval. This
statement contains three concepts that must be carefully defined; they
are (1) key, (2) muItikey, and (3) logical. Functionally, a key rep-
resents a single file partition or list; structurally it is a subfield of a
record, being some digitally representable character (or bit) string a.
Then, every record in' the file that contains, as a subfield of some
commonly identified field (called the field of the keys), the string a
is said to be in the file partition characterized by the key a. Further-
more, this partition contains no records without the string a in the key
field. To the user of the system, a key is a descriptive element of
information pertaining to the record. For example Author/Smith or
Citizenship /USA.
The second and third concepts (or rather the implied functional
requirements) cause a number of problems, which turn out to be the
primary concern of this book. The multiple key and logical require-
ment means that retrievals may be specified by more than a single key,
and that a logical combination of these keys must also be satisfied. The
most frequently used logic functions are the Boolean operators AND,
OR, and NOT, although others such as threshold (i.e., m out of n)
functions and weighted threshold functions have also been used. Since
a single key corresponds to an individually retrievable file partition,
the muItikey query requires that combinative sets of partitions be re-
trieved. For example, the key expression A AND B implies that the
intersection of partitions A and B be retrieved. The files should there-
fore be organized in such a way as to minimize the number of records
that have to be transferred from the DASD to the processor in order
to satisfy this muItikey expression. By definition, the maximum num-
ber of transmissions need be no greater than the smaller of the par-
titions A and B; however, improvements are possible by more highly
organizing the key Directory, but there is a price to be paid for such
an increase in retrieval efficiency, which constitutes a design trade-off.
There are very few design problems in single key systems since com-
plete partitions are always retrieved, and, as will be shown, the gen-
eration and retrieval of these partitions is relatively easy. However,
even the single key systems divide into two types-the single record
partitions and multiple record partitions. The first is a very special
system type, although there are many of them in existence today. Each
key in these systems represents a unique record. An example is a
bank system in which each account has a record containing an ac-
40 FILE STRUCTURES FOR ON-LINE SYSTEMS

count number, name, deposits, withdrawals, dates, and balance. The


unique and only key of the record is the account number since the
system requires that this file be queried only by teIlers or account
administrators who have an account's pass book in hand. The multi-
ple record partitions, on the other hand, represent the more general
case. If, for example, keys corresponding to dates and names were
added to this account record then multipopulated record partitions
would be created. That is, a partition would be created for all records
containing a given date or a given name, so that, for example, every
account that had a deposit on May 19, 1968 could be randomly ac-
cessed. This system could still operate within the single key query
prototype even though there were a multiplicity of key categories, but
as soon as one allowed a query of the type "Retrieve the account
record(s) of John Smith who made a deposit on April 9, 1968," then
a system with multikey query logic is indicated. This book is primarily
concerned with the problem of multi key systems since the single key
systems have ready solutions that will appear as degenerate or special
cases of the techniques to be presented for the solution of the more
general problem. The system characteristic required to provide the
multi key logical retrieval is a multiple list partition, which is a con-
siderably more complex structure than a simple partition for the
following reason: The single key partition is a property only of the
file structure, whereas multiple list partitions are generated ad hoc
in most systems by the retrieval programs, in accordance with the
multikey query expression, and this introduces the crucial element of
time response into the design problem.
Non-key record qualification is another functional requirement of
most IS & R systems. In Chapter I a two-step retrieval process was
described in which lists or partitions of records were first determined
via Directory Decoding, and then the actual records were accessed
from the DASD file and transmitted to the processor. The multikey
expression in the query provides the input to the Directory Decoder,
and thereby determines the list accessions; however, further qualifica-
tion of the records on this list may be performed within the processor.
These qualifications are performed on designated subfield character
strings (or numerics) called qualifiers, although the aIlowable opera-
tions are somewhat more general than those applied to keys. That is,
in order for a record to respond to a key there must be a (key) sub-
field match, whereas the qualifier may be subjected to a number of
arithmetic and logical tests, such as equal to, less than, greater than,
between limits, (set) included in, substring match, etc. These same
INFORMATION STRUCTURE AND FILE ORGANIZATION 41

operations can also be applied to keys in the query expression, but


their qualification is performed in the Directory Decoding procedure
not within the record itself. The user of the system may not func-
tionally distinguish the keys and qualifiers, because, to him, they both
represent record descriptors. The system, on the other hand, dis-
tinguishe~ them in order to achieve some desired measure of opera-
tional efficiency, and a well-designed system will enable automatic
(i.e., by means of programs) conversion of qualifiers into keys and
vice versa, at file regeneration time.
In addition to arithmetic value comparisons, some systems may
require functional value comparisons, which add both power and
complexity that are warranted only in very special circumstances.
An example is to retrieve from a file all ships that are less than the
average ship tonnage. This assumes that a programmed function that
can compute an average is available, and that a subfield of the quali-
fiers (or even the keys) is "ship tonnage." Thus, the user can
specify an arithmetic value for comparison that is the result of a com-
putation instead of being a prestated constant. How this is accom-
plished in real-time retrieval is discussed in Chapter IV.
The decision to accept or reject a rec~rd based on arithmetic value
comparison with a qualifier is made wholly within the record itself and
is not dependent or contingent on values in or processes performed on
other records either in the file or in the list of retrieved records (from
the DASD). The system characteristic that satisfies this requirement
is, therefore, called intrarecord processing. The functional value com-
parisons, however, will usually depend upon the prior processing of
other records either in the file or the accessed partition, as well as
intrarecord processing; this system characteristic is therefore called
interrecord processing.
Closely associated with functional value computations is report
generation, since a report implies first the definition of a subfile
(sometimes the entire file) of interest and second, the processing of
this subfile into the report statement. This processing is usually inter-
record since it may involve sorts, sums, averages, counts, and other
statistics. The report is, of course, a significant management tool, and
its inclusion into on-line systems should not be overlooked, since the
ability to generate reports with flexible, high quality formats using all
of the retrieval mechanisms described above is readily implemented
as interrecord processing.
One of the first major file design considerations for on-line systems
is the mode of update. The programming complexity associated with
42 FILE STRUCTURES FOR ON-LINE SYSTEMS

maintenance is greater than otherwise if there is a real-time update


requirement. A number of controls must be imposed on the list struc-
tures in order to facilitate update, and certain file organizations have
clear advantages over others, thus constraining the design prerogatives
for retrieval optimization. There are many on-line system applica-
tions (and, in fact, they may be in the majority) that require real-time
retrieval but not update. The updates may be batched and applied
periodically to the files, and whether update transactions accumulate
via on-line or off-line facilities (or both) is immaterial to the file
organization as long as the update of the file is scheduled at a time
when the on-line system is not in operation. However, on-line update
has a number of advantages, and in some systems would be manda-
tory. Dynamic data bases that require the use of updated file in-
formation shortly after the update transaction is completed will
require real-time updates, and, as in the case of on-line file genera-
tion, program editors can be used to assist in the entry and formating
of the update, and to immediately detect errors.
Thus, the designer's first consideration is whether there is a real-
time update requirement. If not, his freedom of selection in file
organization is great; if it is required, file organization compromises
must be effected that will enable the system to meet both retrieval
and update specifications.
When a file is originally generated, it is in a state that is optimal
for retrieval and update for the given file organization. However, if
real-time updates are permitted, the compromises that are usually
made cause the state of the organization to deteriorate somewhat.
For example, in some systems updated records that increase in size
and cannot be restored as a single record will attach a trailer to the
latter part of the record, where the trailer record is located at some
randomly distant address. Furthermore, trailers can generate more
trailers, so that several random accessions would be required to re-
trieve the entire record. Another approach is to relocate the entire
record, leaving an unused space. In either case, a maintenance pro-
cedure is needed to reunite the record and its trailers in the one
system, and to collect and return small islands of unused space to a
common reserve in the other. These are file space maintenance re-
quirements, and are characterized by a process that can generically
be called space brooming (or garbage collection), which has various
implementations depending on the method of expanded record update.
In addition, some systems can accommodate space brooming at sched-
uled intervals, when it can be performed massively on all or a large
INFORMATION STRUCTURE AND FILE ORGANIZATION 43

part of the file, while other systems are so dynamic in their update
transaction rate that space brooming must be brought in as back-
ground work, and is so designed as to be interruptable after a few
seconds operation leaving the file in a completely useable state.

2. Information Structure and File Organization

At this point it is important to define the terms information struc-


ture, file organization (taken here to be synonymous with file struc-
ture), and data structure. An information structure is defined here
as an inherent property of the information in a given data collection,
that may exist by design or because of the normal way in which such
data appear in collections. In any case, the automated system designer
normally has no control over this structure since it is an inherent and
fundamental property of the information per se. The most that he
can do is to take advantage of the structure, if this is possible, when
he designs his file organization, where file organization is defined as
the formating of the data records with list processing controls, and the
assignment and distribution of file types in mass storage devices. In
other words, the information structure may be imposed by the appli-
cation oriented system analysts who are designing the procedures for
collecting and disseminating the information; the file organization is
the prerogative of the computer programming analyst who is to meet
a file retrieval and maintenance specification that includes such
parameters as response time, list length, query logic, DASD types,
etc. Finally, the actual record formats are sometimes referred to as
data structures, and are of relatively little interest in this book, since
they are only a bookkeeping detail. However, some data structures
will be presented in this chapter in order to illustrate in sufficient de-
tail how a particular information structure ends up in storage after a
file organization has been specified.
Information structures can be classified for convenience of dis-
cussion into two types, Hierarchic and Associative. Figure 13 pre-
sents a schematic illustration of them. The dots in this figure relate
to logical records of a file, and the solid and broken lines connecting
various dots indicate the Hierarchic and Associative relationships
among the records, respectively. As shown in the diagram, there is
a record at the top which bears a superior Hierarchic relationship to
five other records of the file, three of which (toward the left) may
44 FILE STRUCTURES FOR ON-LINE SYSTEMS

...
1/1
QI

....
:::J

...
tJ
:::J

iii
c:
0
:;:::;
('II

...
E
.E
E.
QI
>
:;:::;
('II
'0
0
1/1
1/1
<
'0
c:
('II

tJ
:c

-
~
...
('II

-
,~
J:
0
tJ
:;:::;
I ('II
E
I QI
.r:
tJ
I en
I
I
I M
....
/ I tlO
/ I i.i:
/ u'"
- >
:J:j::
UC(
C -
C(U
cO
",1/1
-1/1
:J:C(
INFORMATION STRUCTURE AND FILE ORGANIZATION 45

represent a particular subfile of records that have a common attribute


and possibly a common record format. This is indicated by the rec-
tangular enclosure around the three records on the left and the two
records on the right. Similarly, the two on the right are superior in
the hierarchy to other subfiles. Thus, the hierarchy can be represented
as a tree with any number of levels. Another property of a Hierarchic
information structure, as illustrated by the bottommost record in the
diagram, is that a given record may actually be shared in the hierarchy
by more than one superior record. Thus, the record labeled A is a
member of two different subfiles because it is related to two distinct
superior records, as indicated by the overlapping rectangles. This
property transforms the tree into a network or graph. This graph is
directed because of the superior-inferior relationships, and also has
no cycles, but if one were to conceive of a relationship that were bi-
directional, or in which a descendent had a directed relationship to
one of its ancestors, then the graph could also have cycles. Such
information structures may, for example, be characteristic of routes.
The General Electric Integrated Data Store (IDS) [(;j is a file system
that was specifically designed to handle this type of information
structure.
In summary, there is a logical relationship between certain records
of this file, which is established when the file is originally constructed.
For example, the record labeled A has a specific inferior relationship
in the hierarchy to the record labeled C, and this in turn has an in-
ferior relationship to record G.
In contrast, an Associative relationship, as indicated in the diagram
by broken lines, is one in which records are related by virtue of the
fact that each contains a specific key that is identical. For example,
record C might be in a subfile that is characterized by "Aircraft
Type"; records E and A may then belong to a subfile that is char-
acterized as "Aircraft Components," where, as implied by the dia-
gram, records E and A are components of Aircraft C. Consider then
that a descriptive element within the records of either of these sub files
were "Overhaul Point," and it is desired by the designers of this sys-
tem to relate or associate all records, regardless of their position in
the file hierarchy, that would contain the same value within the data
field "Overhaul Point." This kind of relationship is called Associative,
and is represented in this diagram by the broken lines, so that the
records D, E, C, and B might be so associated, and the particular
Hierarchic relationship, if any, among these records is not explicitly
denoted or recognized by the Associative relationship. The concept is
46 FILE STRUCTURES FOR ON-LINE SYSTEMS

analogous to an Associative memory, wherein all words that contain


a specified bit pattern are to be accessed. This kind of information
structure can be implemented by any of the list-structured file
organizations.
A given file system may have exclusively Hierarchic or Associative
properties, or it may have a combination of the two; the data struc-
tures (i.e., record formats and controls) required to commonly
manipulate these information structures in the same file system are
presented at the conclusion of this chapter.
Figure 14 presents a more concrete example of a Hierarchic Data
Structure. The first column contains the names of five record types,
or in the terminology of the previous figure, subfile attributes. The
last column contains some typical data elements that might be found
within the fields of the respective records. One might assume, for
simplicity, that the records are in a fixed data format so that particular
character positions within each record are allocated for each field of
the record. The second column indicates a Record Type that is char-
acteristic of the Hierarchic nature of the subfiles. A terminology
frequently used is that a superior record in a hierarchy is called a
Master and an inferior record relationship is called a Detail. Thus
ACCT may be considered to be a Master record, containing as data
elements an Account No., Name, Balance, and Credit Limit. All
information of the Master is recorded once, and commonly relates
to all Detail records that may be inferior within the hierarchy to a
particular account. The ACCT record in this example is Master of
the VOUCH and the INVC, as indicated in the third column. There-
fore, the voucher subfile may contain a series of records which are
Details to a particular Account No., and each of these would contain
a Voucher No., Date of Payment, and Amount. Thus, VOUCH is a
Detail to ACCT, as indicated in the fourth column. Similarly, the
INVC record is also a Detail to the ACCT; however, the INVC record
itself might contain only the Invoice No., Date of Purchase, and
Amount, and another subfile of records might contain the itemization
of the INVC. A fourth record type called ITEM is therefore indicated
as being a Detail to the INVC, and this record would contain an
Item No., Quantity purchased, and a calculated Extension of the
sale. Thus, INVC is indicated in the Record Type column as being
both a Master and a Detail. A fifth record type is INVTY, which is a
Master only to the ITEM, so that the ITEM records are seen to be
common to two Masters, namely, the INVC and the INVTY. Figure
13 is actually a schematic of the Hierarchic relationships in the ex-
RECORD NAME RECORD TYPE MASTER OF: DETAIL TO: FIELD CONTENT

ACCT M VOUCH Account No.


INVC Account Nome
Balance
Z
"l1
Credit Limit o
::<:I
VOUCH D ACCT Voucher No. :::>
'"'I
Dote of Payment oZ
Amount rJl
'"'I

INVC M,D ITEM ACCT I nvoice No. ~


(')
'"'I
Dote of Purchase c::
::<:I
Amount ttl
>
Z
ITEM 0 INVC ITEM No. t:J
"l1
INVTY Quantity t=
ttl
EX,tension o
INVTY M ITEM Item No.
z~
Supplier
~'"'I
Price
Quantity on Hand ~
Reorder Leve I

Fig. 14. Example of a Hierarchic Data Structure :!J


48 FILE STRUCTURES FOR ON-LINE SYSTEMS

ample of Fig. 14, where the topmost Master record is an ACCT


record, and the bottommost Master record is an INVTY record.
Records A and B are the ITEM records, which are common to both
INVTY and INVC (where C and Fare INVC records).
In Fig. 15 these relationships are displayed somewhat more ex-
plicitly in the form of a network. Each rectangular block in the
figure refers to a record of the file, and Master records are identified
by solid dog-ears in the upper corners, while Detail records are identi-
fied by open dog-ears in the lower corners. A single ACCT # 1 is
shown, and the Detail subfiles, INVC and VOUCH, are shown with
two and three records, respectively; however, instead of a ramifying
tree, as shown in Fig. 13, the inferior relationships are indicated by
linked chains that can be labeled as the Account/Invoice Chain and
the Account/Voucher Chain. The INVC appears in this diagram as
both a Master and a Detail, and has its own subchain called the In-
voice/Item Chain. The interlocking relationship between the Master
INVTY record and the INVC records via the ITEM records is more
clearly shown in this diagram.
Although Figs. 13 and 15 represent the same information structure,
they imply different file organizations (and, of course, different data
structures) because the access logic would differ. In the organiza-
tion of Fig. 13 any Detail of a given Master can be retrieved with one
accession, if the Master is in core. In Fig. 15, if the logical records
(Le. each Detail or Master record) are stored randomly, then n ac-
cessions are required to reach the nth Detail on a Master/Detail chain.
A file system should be capable of creating a fairly large number
of Hierarchic levels, and it is also useful to allow any amount of record
sharing or interlocking. The system must also be able to extend the
length of any chain indefinitely by adding more records to a given
subfile, or to contract the length of a chain by deleting records from
a subfile.
The actual disposition of these records throughout the random ac-
cess memory is largely a function of the predominance of chain-record
processing for retrieval versus the dynamics of record update (dele-
tions and additions), and particular programmer preferences with
regard to space allocation and the generation of trailers. These points
will be made more explicit further on in the discussion. However, it
should be pointed out here that where there are interlocking or net-
work relationships, as exist with the ITEM records of this example,
it is very difficult to find a particular collection of records that might
constitute a single logical record containing all connected records in
Fig. 15. Network Schematic of the ITEM *18
Hierarchic Structure

Z
'11
o~
a::
VOUCH ~
~
~
~
c::
n
-l
c::
Accoun t /I n voice ~
Chain
~
.... 1 o
'11
....
t"'
ttl

~
Z

Invoice/Item ~
Chain
~
D Malter Record

~
D Detail Record
50 FILE STRUCTURES FOR ON-LINE SYSTEMS

the Hierarchic Structure. For example, one might say that alI records
relating to ACCT # 1 should be contained within a single logical
record. This would then include the three records shown in the
Account/Voucher Chain, the two records contained in the Ac-
count/Invoice Chain, and the two and three records, respectively,
contained in the Invoice/Item Chains. There arises then the question
of the logical relationship between the INVTY record and the ACCT
record, which is interlinked in this case through two of the ITEM
records. The technique that is used in the General Electric IDS sys-
tem &voids the requirement to make such decisions, because each
record (rectangular block of Fig. 15) is its own logical record and
can be addressed independently of all other records.
There are, however, certain processing advantages that may accrue
from contiguous storage of hierarchically related records. Most
prominent among these advantages is the ability to process all records
in a chain without requiring multiple random accessions or intricate
paging techniques.
Figure 16 presents a Record Format (data structure) with the kind
of controls that would be needed to establish the Hierarchic relations
among subtile records, but which permits contiguous storage of all
hierarchically related records, where particular subfiles have been
previously selected for inclusion within a logical record. To make
these notions more specific, the following terminology has been
adopted. Each of the dots in Fig. 13, or each of the rectangular blocks
in Fig. 15 is called a Detail. (Generically, a Master can be considered
a Detail of itself.) All Details within a common subfile, as indicated
by a rectangular block in Fig. 13 or a chain of Fig. 15, are collected
into a subrecord; hence, a com·mon format can be associated with all
Details within a given subrecord. A logical record consists of a series
of subrecords headed by a subrecord that contains a single Master.
The data storage in a logical record is contiguous within the DASD,
except possibly for individual subrecord trailers, where a trailer may
be randomly dislocated from its parent subrecord, or from a previous
trailer. The Master subrecord contains a record index that indicates
by means of a link address the beginning of each subrecord that is
chained to the Master. This address is a character or word position
indicator relative to the beginning of the logical record. Each sub-
record has a name, and this name along with the link address appears
within the record index.
Within each subrecord appear a series of Detail records, where a
Detail is identified by a Detail Name/Value pair. The Detail Name
INFORMATION STRUCTURE AND FILE ORGANIZATION 51

is actually the same as the subrecord name, and the Detail Values
serve to differentiate one Detail from another. However, it is not
necessarily the case that all values of Detail Names must be distinct,
because the values may be assigned independently, as they may ap-
pear on different chains. For example, in terms of the example illus-
trated in Fig. 15, all blocks labeled ITEM will appear in the Item
subrecord, but there are, in the illustration of this example, three
ITEM chains, two being associated with the INVC and one with the
INVTY. These chains may, by nature of the system, be independent
with regard to the assignment of values to the Detail (in this case
ITEM); thus, the Detail Name/Value equal to ITEM # 1 is as-
signed twice, since there is an ITEM # 1 relative to INVC # 1 and
INVC #2. Both of these will appear as distinct Details within the
subrecord whose name is ITEM.
In Fig. 16 the first Detail subrecord, which begins at address £'1,
contains the link address control information in a Detail index that
appears within every Detail record. This index contains a listing of
Subrecord Name/Link Address pairs, where the Link Addresses may
be of two types. The first is a Link Address within the same sub-
record, and the second is a Link Address out of the subrecord. In
both cases the Link Addresses are again indexes that are relative to
the beginning of the logical record. The Link Addresses (such as
LA,¥2, as shown in the diagram) within the sub record are chain ad-
dresses indicating another Detail within the subrecord to which the
given Detail is to be linked. For example, in terms of Fig. 15, ITEM
# 1 on the INVC # 1 chain would be linked to ITEM # 2 within
the subrecord. The Link Address out of the subrecord (such as LAYl
in Fig. 16) indicates either a Master-to-initial Detail linkage or, if it
is desired to complete the circuit, a final Detail-to-Master linkage.
Every Detail contains this index, which effectively creates the
Hierarchic Structure among all Masters and Details.
There are two possible instances where a link must be made to
an address that cannot be indexed relative to the beginning of the
logical record. The first of these is a linkage to another logical rec-
ord. For example, in Fig. 15, if INVTY were to be a complete logical
record, and if ITEM were to be within the logical record of ACCT,
then the record index of the Master sub record of INVTY would have
to refer to a dummy subrecord within the INVTY logical record,
which in turn would link to a particular Detail in the ACCT logical
record. This linkage would have to be an absolute address reference
to the ACCT # 1 record plus a relative address to the particular De-
tail within a subrecord.
52 FILE STRUCTURES FOR ON-LINE SYSTEMS

Master Name/Value
Record Index
Master • Subrecord Name/LAa,
·Subrecord Name/LA~,

Detail Index
·Subrecord Name/LAa2


Detail •
• Subrecord Name/LAY2
J_LLLJ_LL.LJ_LL
Detail Name/Value
L. ..1---1_L L L L L L ..L...i_1..
••

~, Detail Name/Value
...L-I.....I-I_L L L L..L...L...i_1..
~2 Detail Name/Value
L.L ..L. ..L ...L ...L ...L - ' J_L L
Detail Name/Value
..L...i ....I-I_LLL..L -L. -'J_
Unk Address to Tra i ler

Fig. 16. Hierarchic Controls of the Generalized Record Format

The second case of absolute address linkage refers to trailers. Any


subrecord that must be logically extended to a trailer that is ran-
domly located elsewhere in the DASD must have an absolute Trailer
Link Address.
The hash marked regions of Fig. 16 within each Detail and within
the Master sub record, which can be considered to be a subrecord with
a single Detail, are the regions in which the associative data linkages
and the data fields are contained. The format of these will be dis-
cussed subsequently.
INFORMATION STRUCTURE AND FILE ORGANIZATION 53

Figure 17 illustrates this data structure for the example of Fig. 14.
It has been decided here that the ACCT, VOUCH, INVC, and ITEM
records of Fig. 14 are to be included in one logical record with Master
Name ACCT, and INVTY is to be a complete logical record by itself.
The Master record contains the Master Name ACCT and a Value
# 1. The index indicates the initial relative position of each subrecord
chained to this Master. It is assumed a priori that alI Details have
some inferior relationship to the Master record, although a given
Detail may not necessarily be directly related to the Master within the
hierarchy. The first subrecord is INVC, and starts at location al. The
first Detail is INVC # I, and has both types of linkage. The chain
linkage carries the subrecord name of the Master (ACCT), and links
to another Detail within the same sub record, namely at. At a:! is the
second Detail on the chain (lNVC #2), which contains a linkage
with Master Name ACCT and a Link Address referring to the origin
of the logical record, which is the Master record. This completes the
Account/Invoice Chain.

Since INVC is also a Master, there is the out of type linkage to


its appropriate subreco~d, which in the case of this example is ITEM.
INVC # 1 therefore, is linked to ITEM # I at Yh where the Invoice/
Item Chain is constructed via the linkages INVC/yz-INVC/a),
which completes the circuit. The Account/Voucher Chain starts in
the ACCT # I Master subrecord, and is implemented by the linkage
VOUCH/ ,81-ACCT / {:/2-ACCT / {:/3-TRAILER ADDRESS 0250-
ACCT /0. The illustration indicates that the amount of space orig-
inally allocated for the Voucher subrecord was insufficient, and
VOUCH # 3 had to be stored at another location, namely, 0250.
Since VOUCH # 2 at ,82 is linked to VOUCH # 3 via the Account/
Voucher Chain, a relative Link Address of ,83 appears in the VOUCH
#2 detail. and the absolute Trailer Address 0250 appears at location
B::, which. in effect. acts as an indirect address.

The Inventory record contains only the Master for INVTY # 18,
and a single Item subrecord reference to ('II, which is an indirect ad-
dress reference to the ITEM detail at 0100/')12. In the index of Detail
Item # 2 (at Y2), the Inventory/Item Chain is appropriately refer-
enced with a Link Address continuation of ')13, and then back to the
Inventory record, itself, via the indirect address Y6.
54 FILE STRUCTURES FOR ON-LINE SYSTEMS

0100 ACCT #1 0600 ,NVTy.,a


INVC/a, ITEM/a,
VOUCH/P, a,
11111111

L'NK ADDRESS O'OOlY2


a,
ACCT/a2
ITEM/y,
-'_1_ '- ' - ' - i....L VOUCH *3
a2 INVC *2
ACCT/.
ACCT/f L .L LI-i. ..L .L
ITEM/y!
..I..~~~~...i~~

/1, VOUCH .,
ACCT//12 ITEM*I
L.L.J.....iJ_LLL INVC/Y4
/1 2 VOUCH =-2
INVTY/Y6
ACCT//1! ..I.~-'-..L.£~~
..I..-'-'-'-'-'-''''''/' Y4
ITEM*2
TRA'LER ADDRESS 0250
", ITEM . ,
INVC/Y2 Y5
INVC/Y5
LLLLLLL
ITEM*3
-'-1-1-1-'-./-1-'- INVC/a2
"2 ITEM #2 -'-'-'...L...L.LL
Y6 L'NK ADDRESS 0600
--------

Fig. 17. Example of Hierarchic File Control

This file structure (organization) is actually a cross between those


represented by Figs. 13 and 15. Note that all three are software im-
plementations of the information structure of Fig. 14, and that all
Hierarchic relations indicated in that figure are preserved in each of
the three file structures. The principal advantage of this structure is
that a single random accession will, depending on the logical record
size, bring a number of Details, and possibly a number of complete
subrecords into core for processing. Furthermore, subsequent sub-
INFORMATION STRUCTURE AND FILE ORGANIZATION 55

records can be accessed by successive DASD serial transmissions.


Only in the case of trailers need there be another accession. There-
fore, the processing of complete Master/Detail Chains is very effi-
cient, or the random selection of an arbitrary Detail, as is possible with
the structure of Fig. 13, is also relatively efficient here. The burden
on those who set up the file is to decide on the composition of the
logical record and to allocate Detail space; in some applications the
Detail requirements are known and fixed. For example, monthly in-
voices to a given account means that if the on-line system is to main-
tain, say, one-year's invoices, then 12 invoice Details would be
allocated in the invoice subrecord. If a minimum allocation is de-
sired, so as not to preallocate excessive space, then the system will
dynamically generate new space as required via the trailers.
The hash marked regions in Figs. 16 and 17 contain the controls
for associative data relationships and all of the data fields pertinent
to the Details. Figure 18 illustrates a generalized format for this
region of the Detail record. Basically, the three kinds of information
that are contained in these fields are keys, qualifiers, and print data.
The keys and qualifiers may be specified as retrieval conditions in the
query, while the print data may be processed by application programs,
may be subject to interrecord processing, and may be subject to
formating programs for data output. From the user's point of view,
there may be little difference in his use of keys versus qualifiers, but,
as discussed in Chapter I, they are treated differently by the storage
and retrieval system. Both keys and qualifiers are, generally speaking,
descriptors in the file system. As such, each may have a Name (cor-
responding to its attribute) and a Value. Examples are NAME/
SMITH, AGE/21. Within the record, both keys and qualifiers are
stored in the general form Key Name/Value or Qualifier Name/
Value. * In the case of the keys, however, a list or file partition is
created for each Key Name/Value pair.
Generally, there are two classes of list structuring techniques; one
is the multiple threaded .list approach, t and the other is the inverted
list approach. In the former, a Link Address is required within the
record in order to form the list within the file; hence, in addition to
the Key Name/Value, a Link Address, which indicates the next rec-
ord in the DASD that also contains the identical Key Name/Value
(or Value), is also stored in the key field. The records on these lists
are generally not physically contiguous, but are associated under the
programmed control of the Link Addresses. The qualifiers are not
• In some systems the NAME is omitted or is implied, and only VALUEs are stored.
j'Sometimes called knotted lists.
56 FILE STRUCTURES FOR ON-LINE SYSTEMS

listed and hence, appear only as individual entries (Qualifier Name/


Value) within each record. From a programming viewpoint, a list
search can be performed only on keys not on qualifiers. In terms of
the threaded list system, such a search would be performed for the
shortest list in each query disjunct, where the logical expression in a
query may be in disjunctive normal form (sum of products) or trans-
formed to disjunctive normal form by a preprocessor. That part of
the query logic dealing with qualifiers is examined only at the time
when a record on a list is accessed from the DASD and brought into
the core memory of the processor. Obviously a random access or list
search can be performed only if at least one nonnegated key appears
in every query disjunct.
Detail Accession Number

Detail Control Information


• Access Key
• Usage Statistics
Table of Contents
• Start of Data
• Number of Keys or List of Key Lengths
• List of Data Element Lengths

Keys
• Key NamelValuel Link Address
• Key NamelValuel Link Address



Qualifiers
• Qualifier Name/Val'ul
• Qualifier Namel Value

••
Data
ReCord]
[
Data
Fig. 18. Associative Controls and Data Storage of the
Generalized Record Format
INFORMATION STRUCTURE AND FILE ORGANIZATION 57

The inverted list technique also requires the Key Link Addresses,
but in this file structure the list control information is distributed
somewhat differently. The Link Addresses are removed from the
individual records and stored in compact, sequenced lists elsewhere
in the DASD; therefore, the Link Addresses do not appear explicitly
within the records as shown in the figure, but they do exist nonetheless
in the file system, and their function is unchanged.
It should be noted that the system that lists Values rather than Key
Name/Value pairs will operate less efficiently by virtue of having
some longer list searches, where the query is specified in terms of Key
Name/Value pairs; however, the Value listing system has an added
degree of flexibility in that the user is not required to give the Key
Name in the query, and this could be an advantage either because he
may not know the Key Name or because he intentionally wants to
see all records with a given Key Value regardless of its Key Name.
For example, consider the Key Names AUTHOR and PRINCIPAL
INVESTIGATOR. The value SMITH could appropriately be as-
signed to either of these Key Names. In that system which lists only
Key Values, one could retrieve the list of all SMITHs regardless of
whether they were AUTHORs or PRINCIPAL INVESTIGATORs,
or it could retrieve, selectively, AUTHOR/SMITH or PRINCIPAL
INVESTIGATOR/SMITH, by searching the SMITH list and quali-
fying each record for retrieval when it is brought into core by ex-
amining the respe<..tive Key Names. On the other hand, the system
that lists Key Name/Value pairs could only search for all SMITHs
via the query AUTHOR/SMITH or PRINCIPAL INVESTI-
GATOR/SMITH, and this requires that the user know that the Key
Names AUTHOR and PRINCIPAL INVESTIGATOR are those that
are likely to produce values SMITH; in fact; there may be other Key
Names that could also produce the value SMITH of which he may
be unaware.
Each Detail record would normally be identified by a record acces-
sion number in addition to a Detail Name. It is generally desirable
to make the accession number a key of list length one so that retrieval
of Details could also be made by accession number. This is par-
ticularly useful in systems with real-time update. After the Detail
record accession number might appear certain control information
such as an access key that controls query access to this particular
Detail record via a code presented in the query. In some systems a
control such as this may not be necessary, but rather it may be suffi-
cient to control access to the entire logical record or to an entire list
58 FILE STRUCTURES FOR ON-LINE SYSTEMS

before the file is searched. Having the access key within the Detail
record permits files to contain different kinds of records with different
levels of user access. This topic will be discussed again in Chapter IV.
In addition to the record access key, another type of control might
be record usage statistics. In its most simple form, this could be a
count of the number of times the Detail record is accessed from the
DASD, based upon a list search, and the number of times it satisfies
a retrieval request. More detailed statistics relating to the usage of
individual keys and qualifiers and even user feedback on the utility
or relevance of the record might also be stored. However, if such
detailed data were to be retained it would probably be more expedient
to store them in an auxiliary file rather than continually updating the
Detail record.
It should be assumed that these records are of variable length.
Firstly, because all possible Key and Qualifiers Names will, in general,
not appear within each record, and those that do appear may vary in
length, and secondly, because the amount of print data may vary from
one record to another. Therefore, it is necessary to have a table of
contents that indicates the number of Keys and Qualifiers, and if it
is also the case that the values can be variable in length, then the
length of every such Key and Qualifier mu<;t also be given. It may
also be desirable to code every Key Name and Qualifier Name nu-
merically, creating a type code to be entered into the table of contents.
It would then be relatively easy for the program to scan the table of
contents in order to determine whether or not the particular Key and
Qualifier Names are present in the given record. Alternatively, this
could be done by an actual scan of the Key and Qualifier fields.
In addition to defining the length of the Key and Qualifier fields,
the table of contents should point to the beginning of the DATA field.
The table of contents and the Key and Qualifier fields represent the
associative information structure controls. The Detail accession num-
ber and control information are part of the IS & R file management
system. All of the remaining data in the record, i.e., that which is
not relevant to either the hierarchic or associative information struc-
ture or to the file management, is labeled DATA in Fig. 18, and would
be formated, internally, within this block, in any desired manner. It
could, for example, have its own table of contents if the format were
variable. Furthermore, different files might have their own DATA for-
mat, but all would be common to the storage and retrieval system
with regard to the information structure controls of Master, Detail,
Keys, and Qualifiers. In this way the IS & R system looks to the
INFORMATION STRUCTURE AND FILE ORGANIZATION 59

application programmer like an operating system; he uses a specified


calling sequence in order to access records according to the Keys,
Qualifiers and their functional specifications, and one record after an-
other is placed in core, a pointer to the beginning of the OAT A block
is given, and he is free to process his field of data within the OAT A
block. Modifications to the information structure controls, such as
additions or deletions of Keys or Qualifiers can be made only via
instructions in the IS & R query language, and various statements
(particularly those that enable update) of this language may be
privileged. Chapter IV will deal more specifically with the functions
and structure of the query language, as it relates to file structure. In
a more elaborate system, application programs can be written that
use the above calls to the IS & R retrieval system, and which then
perform the deblocking of the OAT A section automatically in accord-
ance with a prestored format for the given file.
In summary, one can conceive of an on-line file system in which
any number of diverse file types are managed, with hierarchic and/or
associative information structures, and in which the applications pro-
grammers, i.e., those who are concerned with processing within the
OAT A block can be completely buffered from the record retrieval, de-
blocking, storage, and update controls. Furthermore, there will exist
a user oriented query language by which the desired records can be
accessed, qualified, and otherwise manipulated via intra and inter-
record processing.
CHAPTER IV

THE QUERY LANGUAGE

THE QUERY LANGUAGE reflects all of the system functional require-


ments discussed in the preceding chapter, because it is the interface
between the user of the file system and the automated search and
storage components of the system. Hence, all of the ~esign capabili-
ties that are built into the file system by virtue of its structure must
be made available to the user through the medium of the query lan-
guage. It is important to realize that most users of an information
system will probably be aware of the information structure of the
files (i.e. whether it's hierarchic, whether it's associative, etc.), but
they may be disinterested in the file organization (i.e., whether the
lists are threaded or inverted, or whatever other means of file parti-
tioning are employed). Furthermore, command of the greater capa-
bilities that are designed into the system, by virtue of sophisticated
DASDs, consoles, communications, processors, and advanced tech-
niques can only be exercised by the non programmer user through a
properly designed task oriented query language. This represents a
considerable design challenge, because the non programmer user has
certain traditional thought processes regarding file handling, which
are not, in general, compatible with the way in which the computer
has organized its files internally. It is therefore a role of the query
language to act as a "conceptual clutch" between the human intellect
and the automated file system. The human and the machine, there-
fore, have only two things in common: the information structure (or,
more properly, knowledge thereof) and the query language, although

60
THE QUERY LANGUAGE 61

both of these are interfaces that are susceptive of some distortion. The
former can be distorted if the programmers of the IS & R system do
not accurately portray the information structure in the file structure,
and the latter will be distorted if either the IS & R programmers fail
to deliver all of the file structure and the processing power through
the language (perhaps diluted rather than distorted) or the user fails
to understand and properly utilize the language. This leads to the im-
portant issue of linguistic structure, for which there are two extremes.
Flexibility and expressive power come to language by virtue of an
enriched vocabulary and a syntax that is context sensitive. These two
factors usually lead to ambiguity in the language, as is certainly the
case with all natural languages. At the other extreme are the purely
mechanical languages, like machine code and most compliers, in
which the vocabulary is very limited and the syntax is context free.
At the present time, a natural-like language that would be trans-
latable to the context free, limited vocabulary compiler or assembler
format for machine execution is, at the present state of the art,
infeasible, and the limited version of such a language that might
be feasible imposes the burden (on the user) of learning the ex-
ceptions (i.e., what not to do) as well as the correct constructions.
And this situation might sooner be rejected by nonprogrammer
users than the purely mechanical, programming-like language. It
should be noted that the issue here is not so much the vocabulary
of command terms (or verbs as they are sometimes called) and
other auxiliary control words, but the syntax of the language. That is,
the rules governing the combinative use of the terms, or the phrase
structure. The inclusion of vocabulary terms like do, find, get, if, etc.
into a compiler is in reality no more an attribute of natural language
than an equivalent use of numeric codes in their stead, because if
someone substituted make for do, the statement would be unintel-
ligible, unless the two words were made equivalent in the language
processor. Or, the phrase do later would not cause any delayed ac-
tion unless the word later were a legal word of the vocabulary, and
unless it had the appropriate connotation to the program and the
syntactic rule existed for allowing the combination do later to be
meaningfully translated into the appropriate machine program. In·
fact, if the query processor were written, as some are, to ignore non-
authorized vocabulary words, then the deception is carried that much
further, and the serious user might actually become frustrated because
of the failure of the system to properly respond to his statements.
A more practical reason for avoiding natural-like language is the
62 FILE STRUCTURES FOR ON-LINE SYSTEMS

expense involved in writing it. Thus, not only may the user response
to these languages be negative (although there are always a few who
are enchanted by the anthropomorphism), but they will also be very
costly.
Therefore, it is here recommended that query languages for the
information systems to be designed over the next five years may be
very close in structure to compiler languages, although a number of
syntactic and semantic conveniences can be introduced. Furthermore,
and most importantly, the user should be guided, as far as possible,
through the query process in such a way as to insure that the best
possible match is made between his natural thought processes re-
garding file handling and the machine's internal organization of these
same files. This returns to the notion of the query language as a
"clutch. "
A stylized, highly controlled and guided query format should not
necessarily be viewed, restrictive as it may seem, as being deleterious
to overall system performance, since it is often the case that imprecise
questions result in expensive and useless searches. A useful approach
that is being widely pursued today is that of the man-machine dia-
logue, wherein precise and highly formated information is provided
to the automated system by the user in relatively small amounts, and
the system, in turn, responds with statistics relevant to an impending
search, or with certain tables or tutorials that will assist the user in
the further formulation of his query.
The very sophisticated query language would be oriented both
toward application programmers as well as non programmer users, and
would therefore be procedure as well as task oriented so that the full
capabilities of the computer could be brought to bear on intra and
interrecord processing. It can be written as a sublanguage of a com-
piler so that the file management functions referred to in the previous
chapter would be available to the applications programmer who uses
the compiler. It can also function as an interactive language for the
nonprogrammer user, although for this purpose the statements of the
language would be executed interpretively, since line by line edit and
execution would be desirable for an effective man-machine inter-
action. This also makes the file management operations, from the
viewpoint of the programmer as well as the nonprogrammer user,
machine independent.
To be more specific, the query language and its associated processor
must enable as a minimum:
1) Retrieval based on the use of keys and qualifiers.
THE QUERY LANGUAGE 63

2) Update of records, to include whole record additions and


deletions, key and qualifier additions, deletions and
modifications, and data modifications.
3) Interrecord processing with open-ended use of subroutines.
4) Flexible output formats.
In addition, a computational compiler capability could be added,
but this would be automatically available if the query processor were
imbedded within an existing compiler, so that statements of the query
language could be freely intermixed with computational statements.
The query language to be described in the remainder of this chapter
is a prototype or model, because its main purpose is to illustrate and
clarify requirements on the file structures. In a sense, two viewpoints
are being presented before proceeding to the extensive discussion of
file structures. One was the inside out view of the data structures
and records toward the file organization, presented in the last chapter;
the other is the outside in view-the view of the 1}ser into the file
organization-as enabled by the query language interface. Both of
these views are important for the programmer of file structures, par-
ticularly on-line file structures, to understand, because the former
represents the most basic mechanisms of the job, the nuts and bolts
and girders of the structure, and the latter represents the raison d' etre
of the file structure.
As a prototype, it fulfills the four minimum requirements cited
above, but the reader could adapt or enlarge upon it for his own
problem application, because each system may have certain peculiar
requirements that a basic language cannot completely satisfy. The
language described here may, for some purposes, be operationally ade-
quate, and could be so used. It is a highly formated language with a
context free syntax, and, as indicated above, its primary purpose, in
the context of this book, is to elucidate certain fundamental require-
ments of on-line file structures.
The format, shown in Fig. 19, contains six principal parts, the
first three of which. provide basic control information, and are of
little interest here, the remaining three being the most pertinent to
file structure and manipulation.
The first part is a HEADER, which contains the identification of
the user, and, if it were a remote multi terminal system would also
contain the identity of the terminal. The HEADER may also contain
other information such as the priority of the request and the file access
key, which is used to authenticate the query with respect to the use
of certain files or records in the system.
64 FILE STRUCTURES FOR ON-LINE SYSTEMS

HEADER (10. Priority. File Access Key)


COMMAND (Retrieve. Update. Report, etc.)

OUTPUT DEVICE (Typewriter, Display, Printer)

DATA CONDITIONS
KI (key nome R value)
K 2 (key nome R value)



C I (qualifier nome R value)
C2 (qualifier nome R value)



PROCESSING (A function of KI, K2, ... CI,C2 ...l
• Intra - Record Logic Functions
• Inter- Record Processing

OUTPUT FORMAT
• Titles
• Formatting Print Statements
Fig. 19. A Prototype Query Format

The COMMAND part of the query is usually a relatively simple


mnemonic statement of the function to be performed in response to
the query. It should be noted that the term "query" is used here in
a very general sense since it is not always the case that a retrieval
is requested, but in certain systems it may be desirable to enable file-
update, edit procedures, status inquiries, and subroutine or proce-
dure calls.
The third part of the query specifies an OUTPUT DEVICE. In
some systems, there may be a selection of output devices, such as the
typewriter for low volume output, the CRT display, which is very
useful in a browsing mode and can also be coupled to photographic
or Xerographic equipment for the production of hard copy, offset
printing masters, and projection slides, or the high-speed line printer
for high-volume output. The combinations and types of devices are,
of course, a function of the purpose of the system, the number of
terminals involved, and the costs that can be incurred.
THE QUERY LANGUAGE 65

The vital file related functions appear at this point; however, a


remark about the arrangement of these parts should first be made.
Three separable system functions have been identified, and each re-
lates to a "natural" division of the labor insofar as the manipulation
of the files is to be presented in subsequent chapters. In other words,
the automated system has its own, well ordered way of processing
information, which may differ somewhat from the natural processes
of the people who are attempting to communicate with and use it.
Therefore, in accordance with the principle that the query language
and processor should serve as a "clutching" mechanism or syn-
chronizer between the human and the machine, and since the human
is far more flexible, it was decided to firmly guide the user, via this
format, through a sequence of faceted statements that would, in some
sense, cause him to simulate or reflect in his own mind the "natural"
working order of the automated file system. This separation also
reenforces the illustrative intent of this chapter, because it places in
relief the major functional components of the file structure.
The fourth part of the query is the DATA CONDITIONS sec-
tion and contains COMMAND arguments or parameters. For ex-
ample, in a retrieval command it would be necessary to define keys,
qualifiers, and retrieval conditions. The command parameters for an
update would normally be a unique record accession number and the
data of the update transaction. In some instances a group of records
might be updated by a list key rather than the record accession num-
ber. For example, an update may be required against all captains'
records in a division. In eitIJer case, the DATA CONDITIONS,
which specify the key and qualifier conditions for record access, would
be used as though it were a query. It is often useful to be able to
label the key or condition with user assigned symbols or mnemonics,
such as Kt, K2, Ct, C2, etc., as illustrated in the figure, and then to
use natural language mnemonics or terms in order to specify the key
or qualifier name, a value or series of values, and a relationship be-
tween the name and the value, which is to be satisfied. The format
shown in the prototype, Fig. 19, achieves this in a fairly concise way
by allowing a user assigned symbol as a data condition label, .fol-
lowed by a left parenthesis, followed by the key or qualifier name,
in whatever form it is represented in the system, followed by a
relational operator such as equals, greater than, less than, between
limits, followed by a value that may be numeric or alphanumeric, or
a series of values, depending upon the definition of the relational
operator, and followed finally by a right parenthesis. A further en-
66 FILE STRUCTURES FOR ON-LINE SYSTEMS

largement upon this concept, which would considerably increase the


programming complexity but would not in general be required by
most information systems, would be to enable the values to be com-
puted functions of the values of key or qualifier names. This concept
will be illustrated further on in the discussion. Sometimes a DATA
CONDITION statement degenerates to a field value; i.e., the name
and relation are omitted. This might occur in an update COMMAND,
where a data field, identified in the PROCESSING section, were to
be updated, and the new data to be substituted or added into the
field would be cited and labeled under DATA CONDITIONS.
The representation of field names and values in the query and in
the records themselves requires a further remark. It is often more
economical to store coded data for names and/or values within the
records, since the full natural language term may require consider-
ably more storage, and if it is repeated for each of its record occur-
rences in the file, the storage requirement can become unnecessarily
large. Therefore, a coded integer can be substituted in the record for
the name and/or value, and a translation table converts from the
name or value to the code and vice versa, if it is required to print the
keys or qualifiers from the record. Furthermore, in fixed format rec-
ords, the key or qualifier name is implied by character position within
the record, where the name to position translation is provided in a
table describing the format of the record. There is one file organiza-
tion technique, to be described, in which the keys (but not the quali-
fiers) can be omitted from the record entirely.
The designer of the system may decide whether or not the user
should be required to explicitly differentiate between keys and quali-
fiers in the OAT A CONDITIONS, perhaps by means of the label. In
the latter case, the system would have to distinguish them automati-
cally by a decoder; in the former case, the user would refer to a
printed catalog of key and qualifier names, and this would obviate
the system decoding requirement.
The fifth part of the query specifies retrieval PROCESSING, which
is a function of the OAT A CONDITIONS. Two types of processing
are distinguished: Intrarecord logic functions and interrecord process-
ing. The former elicit records from the DASD via list searches, and
once the record is brought into the core memory it may be further
qualified by an examination of the keys and qualifiers within the
record. The final determination of whether a particular record in the
list satisfies the query per intrarecord processing logic is a function
only of data (key and/or qualifier conditions) within the record. The
THE QUERY LANGUAGE 67

form of the intrarecord processing function that specifies this list


search is usually a logical or Boolean expression whose arguments
are the labels defined in the DATA CONDITIONS section of the
query.
The interrecord processing involves operations upon intermediate
files that are created from all records qualified by the intrarecord
processing and may involve computations and sorts on all or a selec-
tion of the retrieved records. This processing capability is a require-
ment for report generation, although it is not always the case that
report generation, or interrecord processing, is required in all systems.
Some require only that each record, as it is retrieved and qualified by
intrarecord processing, be displayed to the system user as soon as it
is found.
The PROCESSING section also contains program-like statements
containing built-in query language functions like FOR (looping)
applied to a list, DELETE KEY name, ADD KEY name, SORT
name, T ALLY data condition, SUM name etc., and an allocation
statement of the form a = {3 will be defined, where {3 is a functional
expression such as SUM name or A 1 AND A2, and a is a user as-
signed symbol for storage and labeling of {3.
The sixth part of the query contains the OUTPUT FORMAT.
Good report generation requires that titles or labels be exhibited either
explicitly from the query itself or called from prestored labeling
formats. In addition, the results of retrieval processing must be in-
serted and arrayed in some flexible manner.
It is the experience of most new programmers that input-output
format statements of a programming language like FORTRAN or
ALGOL, and even of a business oriented language like COBOL,
require more training and present more operational difficulties than
the processing statements of the language. The nonprogrammer user
of a query language of the type described here, who has never been
exposed to even the simplest compiler format procedures would,
therefore, desire as simple a set of output format specification rules
as possible, consistent with obtaining good quality, flexible data dis-
play. These two requirements are somewhat opposing, because greater
display flexibility generally implies more complex format procedure;
however, this can partially be relieved through the use of open-ended
format functions, embedded in an executive program of relatively
primitive construction. These functions or subroutines, like process-
ing functions SUM, T ALLY, ADD KEY, etc., discussed above, are
written by the system programmers to provide convenient format
68 FILE STRUCTURES FOR ON-LINE SYSTEMS

packages for the user. For example, a function called WHOLE


RECORD name might format and print (or display) every record in
the file name, which had previously been defined (in a name = f3
processing expression). This format would be standardized but could
be accommodated to any record format of the system. Thus, the file
name would have an associated format description that could be in-
terpreted by the WHOLE RECORD function. This could be further
elaborated by the writing of a series of such functions generating a
variety of special formats either for general use or tailored to the
specific requirements of certain users. The executive or calling pro-
gram for these functions should be highly simplified, because it would
be written by the user; the essential requirements of this executive
processor are the abilities to:

1) Specify line spacing and tab settings (i.e., columnate).


2) Specify alphanumeric labels.
3) Draft lines for CRT charting or graphing.
4) Call the system functions discussed above, and print or dis-
play labeled variables from the Processing section.

In those systems intended either for direct, hands-on use by man-


agers, executives, and high level research personnel, or where the
direct output of the system is intended for immediate use by such
persons in the form of reports or visual displays such as slide projec-
tions, the maintenance of very high quality format and layout is essen-
tial. The functions that these automated information systems will
perform for them have, in the past, been performed by people who
first assemble and digest information, and who then rely upon a
graphics department for the production of these high-quality reports,
and the development of this graphic art has accustomed the execu-
tive to a standard of report presentation that he would be reluctant
to compromise. Thus far, the use of computers in information systems
has been largely focused on the improvement of data handling with
respect to storage, retrieval, update, and processing, but additional
attention will have to be given to procedures for data display if these
systems are to be accepted for direct on-line use by executives and
otherwise oriented nonprogrammer users. This is not to say that the
problems of automated data handling are of decreasing concern be-
cause of its preponderant development; on the contrary, the real
problem of interfacing standard automated IS & R techniques, as are
described in the subsequent chapters of this book, to the Management
THE QUERY LANGUAGE 69

Information Systems, Command and Control Systems, Inventory Con-


trol Systems, and so forth, are only now about to be faced.
Figures 20 through 24 illustrate various levels of query complexity,
all falling within the scope of this prototype.

NL252. PRI. FAI2743

RETRIEVE AND REPORT

OUTPUT DEVICE
TYP

DATA CONDITIONS
KI( NAME = SMITH
K2(AGE .BTW. 21.30)
K3(KI*= GORDON)
K4(JOB=PROGRAMMER)
CI (LOCN=PHILA)
C2 (DRAFT = IA)

PROCESSING
INTRA
File 1= K2 (KI+ ('\IK4) K3)CI('\I C2)
INTER
FOR FILE I
FI LE 2 = SORT AGE

OUTPUT FORMAT
(LI) TAB 20:'PROGRAMMER LISTING"
(L3F) WHOLE RECORD FILE 2

Fig. 20. Sample Query 1

In Fig. 20 the user has identified himself in the HEADER as


NL252. He has further designated a priority of 1 for the query, and
has presented a file access key of 12743. This key can be used as
both a file identification code as well as an authorization code. In the
first instance it is necessary, in a multifile system, to identify the file
or files to which the query is addressed; in the second instance, if
70 FILE STRUCTURES FOR ON-LINE SYSTEMS

privacy and file protection features have been incorporated, then a


level of control is maintained that can extend as far down as the
Detail record (see Fig. 18), in order to selectively grant or deny access
to specific records within a file. The OUTPUT DEVICE is a type-
writer. If a system only has one device type or if the device address
(and thereby type) is indicated within the ID field, then the OUT-
PUT DEVICE part of the query could be omitted completely. He is
calling for a retrieval and a report generation, and has defined six
DATA CONDITIONS. The OAT A labels used in Fig. 20 are in-
dicative, for the purposes of explanation, of the fact that the first four
elements are keys, and the last two are qualifiers. However, as indi-
cated, these labels (Kl, K2, etc.) are not operationally significant
with regard to distinguishing keys and qualifiers to the query processor
and search programs. In fact it is usually desirable not to require
that the user be cognizant of which data elements are keys and which
are qualifiers, although this information can readily be provided in
the form of printed catalogs. It is relatively easy to distinguish them
by means of computerized directories, and, if a list search is not pos-
sible, either because the user has not included any keys in his ques-
tion, or because there do not appear any nonnegated keys in a
particular disjunct of the query logical expression, then a message
could so inform the user, and he could appropriately modify the
question.
In this question the user has defined the Keys and Key Name/Value
relationships as NAME equals SMITH, AGE between 21 and 30,
NAME equals GORDON, and JOB equals PROGRAMMER. As a
convenience, the NAME part of the NAME/Relation/VALUE triple
can be referred to alone by citing the DATA label with a special
character (such as asterisk) appended. Thus, in the definition of
DATA CONDITION K3, the Key Name NAME is referenced as
Kl *. The two qualifiers defined are LOCation equals PHILA and
DRAFT equals lA.
The intrarecord processing logic is of the a = f3 form. On the left
side is an intermediate file name that is freely designated by the user,
and on the right is a logical expression whose arguments are the
labels of the DATA CONDITIONS. In this particular example, a
file labeled FILE 1 is created, and calls for a list search and intra-
record processing that would produce all records relating to "persons
who are between the ages 21 and 30, whose names are Smith or
Gordon as long as Gordon is not a programmer, and who live in
THE QUERY LANGUAGE 71

Philadelphia and are not of draft status 1A ." It should be noted that
the directory look-up will indicate that a list search is possible, be-
cause there is at least one nonnegated key in every disjunct of this
expression (when the expression is put into disjunctive normal form);
however, the conditions Cl and NOT C2 can only be imposed once
the record has been brought into the core memory from the DASD.
The second part of the PROCESSING, which effectively produces
the report, is interpreted as follows. For every record in FILE 1,
produce a second file, labeled FILE 2, which is sorted according to
the values of the field named AGE. The underlines in the figure are
illustrative only, and indicate functions or command types of the query
processor. The use of indents in these examples, to indicate scope of
control of a FOR statement, is also illustrative, but it may have a
practical value as well, because nonprogrammers are not accustomed
to in-line nesting procedures, whereas programmers and scientists use
them continually in the form of indexes and parentheses. The indent
is a more natural loop or indexing control and nesting format for a
nonprogrammer to use, and he could be taught its use without ever
hearing the words index, loop, or nest. Some terminal devices, like
the Dura Mach 10, have a coded tab key that could be used to signal
indents.
Since the accessed subfile (FILE 1) in this example contains the
entirety of every retrieved record, the argument of the function SORT
may be any field name appearing within the records of the file to
which this query is submitted, and it need not necessarily appear as
a name of one of the labeled DATA CONDITIONS. For example,
this line could have read: FILE 2 = Sort SS NUMBER, in which
case FILE 2 would be produced in a sequence according to the social
security number of the individuals. Note also that FILE 2 in this
example would consist of the entire record, since no specific extrac-
tion of data elements has been requested.
In the OUTPUT FORMAT, a fairly straightforward and simple
line editing procedure is to require that the user indicate either the
line or the beginning of a succession of lines on which various titles
and data are to be displayed. Thus (L 1) in Fig. 20 means line 1.
TAB is a query processor function that will cause the output device
to tab a specified number of spaces, in this case 20. Alphanumerics
enclosed by quotes are to be literally printed; therefore, the effect of
72 FILE STRUCTURES FOR ON-LINE SYSTEMS

the first print statement is to display the words PROGRAMMER


LISTING on the first line of the display, starting in the 20th print
position. The second statement indicates that display is to begin on
line 3 and to continue on as many following lines (L3F) as are neces-
sary to complete the subsequent print command. A function WHOLE
RECORD indicates that no extraction is to occur, that an entire file
labeled FILE 2 is to be displayed, and that a prestored record display
format is to be used for each record. It is, of course, necessary to
assure that any argument of a function, such as FILE 2 of the function
WHOLE RECORD, will have been previously defined within the
query, which in this case was done under interrecord processing,
where FILE 2 was the AGE sort of FILE 1, which in turn was de-
fined by the intrarecord processing logic function.
Figure 21 presents a somewhat more complicated example. Three
DATA CONDITIONS are specified; first, that PROCUREMENT
ACTION be less than five (where it is presumed that the values of
PROCUREMENT ACTION in a given record type are encoded as
numerics); second, that BUSINESS TYPE equals J; and third, that
BUSINESS TYPE equals K. In reality one would expect the values
of a data field (key or qualifier) named PROCUREMENT ACTION
to be natural language expressions such as TERMINATION, AD-
DITIONAL TASK, CHANGE OF SCOPE, etc.; however, for stor-
age economy it is sometimes useful to encode these terms as integers.
Also, if these codes are classified or grouped, it can aid in retrieval,
as demonstrated in this example, where it is presumed that all pro-
curement actions with an assigned code number less than five have
some particular significance to the user. In this example they repre-
sent New Contract types as opposed to Modified Contract types. It is
also possible to encode Names as well as Values within the records.
In any case, it is desirable to enable the user to specify either the
natural language or the coded Name and/or Value. This requires a
natural language to code translation table, and, if in addition it is
desired to generate natural language display from the coded record
entries, then the reverse table translation is required.
The intrarecord PROCESSING function calls for the creation of
two files, one of which would consist of Dl and D2 and the other of
D2 or D3 and not Dl. In other words, FILE 1 will consist of all
procurement actions less than five and of business type J. FILE 2
will consist of all procurement actions greater than or equal to five
and of either business type J or K.
THE QUERY LANGUAGE 73

AGT I, PRIORITY I, FAI2 3 4


RETRIEVE AND REPORT
OUTPUT OEVI CE
TYP
DATA CONDITIONS
01 (PROCUREMENT ACTION .LT. 5)
02(BUSINESS TYPE _ J)
03(02*- K)

PROCESSING
INTRA
FILE 1= 01 AND 02
FILE 2= ('" 01 AND (02 OR 03)
INTER
FOR FILE I
TI = TALLY 01
SI= SUM ACTION AMOUNT
FOR FILE 2
FILE 3 _ SORT 01*
FOR FILE 3, SAME 01*
T2= TALLY
S2 = SUM ACTION AMOUNT
PRINT (I)

Fig. 21A. Sample Query 2


74 FILE STRUCTURES FOR ON-LINE SYSTEMS

OUTPUT FORMAT

(LIl-TOTAL ACTIONS AND DOLLARS


OF NEW CONTRACTS"

(L3) TAB 10 "TOTAL ACTIONS "TI

(L4) TAB 10 "TOTAL DOLLARS $"SI

(L7) "CONTRACT MODIFICATIONS AND


DOLLARS BY TYPE"
(L9) "CONTRACT MODIFICATlON"TAB 20
-
"NUMBER" TAB 40 , "DOLLARS"
PRINT (I)
(LIOF) DECODE 01*, TAB 20, T2,
TAB 40,S2,REPEAT
Fig. 218. Sample Query 2

The interrecord PROCESSING would be as follows: For all rec-


ords in FILE 1, a TALLY is made of those records that contain a
field in which PROCUREMENT ACTION is less than five (i.e., Dl).
This TALLY or count is to be held in an intermediate storage loca-
tion labeled Tl. Since FILE 1 has been created by the conjunction
D 1 and D2, it is obvious, in this case, that T1 will contain the total
number of records in FILE 1, because every record will satisfy Dl,
whereas, for example, a T ALLY on D2 for FILE 2 would not, in
general, produce the count of FILE 2. Also, for each record in FILE
1 the values within the field ACTION AMOUNT will be Summed
and this sum will be stored in SI. For all records of FILE 2, a third
file, labeled FILE 3, will consist of the SORT (of FILE 2) by PRO-
CUREMENT ACTION (Dl*). Then, for all records of FILE 3
which have the SAME PROCUREMENT ACTION value, a TALLY
will be made and stored in T2, and the SUM of the respective AC-
TION AMOUNTS will be made and stored in S2. The TALLY func-
tion without an argument is interpreted as applying universally to
all records under the control of the FOR statement, while a T ALLY
statement with an argument constrains the count to only those rec-
ords that satisfy the argument. Thus, the prior statement T ALLY D 1
could have been simply T ALLY by virtue of the definition of FILE 1.
THE QUERY LANGUAGE 75

At this point a print specification labeled (1) will be executed. This


print specification is found within the OUTPUT FORMAT section
and will be discussed at that time.
In some query languages the output format statements are inter-
mixed with processing; however, in this language prototype, the two
functions are clearly separated, in order to illustrate the processing
block without extraneous statements; furthermore, this separation
might be a valid proposition from a human engineering viewpoint,
because when a man's thoughts are on information processing, he
may know what and when he wants to print or display data, but he
usually does not want to stop to specify formats; therefore, the minimal
indication that can be suitably labeled for subsequent processing may
be the preferred approach. In this language the statement PRINT
(n) is inserted, which means that data will be printed in accordance
with a format statement labeled n within the OUTPUT FORMAT
section. Since FILE 3 in this example would generally consist of a
number of sorted PROCUREMENT ACTION groups all with the
same value (such as PROCUREMENT ACTION 5, PROCURE-
MENT ACTION 6, etc.), the last three statements will have to be
repeated for every such group. In report generation parlance, these
different groups of equivalent values within a sort key are called
breaks. The processing under the statement FOR FILE 2 also dem-
onstrates an interesting factor in man-machine interaction. People
who are used to working with report preparation or who use reports
know that the most efficient way to tally breaks is to presort the file
so that all of the same key values are collected within the sort se-
quence. Of course, common sense would also indicate such a pro-
cedure. Therefore, such a person would "naturally" insert the
statement FILE 3 = SORT D1 * before tallying the breaks, thus
increasing the processing efficiency, even though the tally could have
been made without the sort, using only the function T2 = TALLY
SAME Dl *. This latter procedure, however, would then require
either multiple passes of FILE 2 or a built-in sort for the function
SAME.
Figure 22 illustrates how this report might appear, in accordance
with the OUTPUT FORMAT specified in Fig. 21 B. An unlabeled print
statement begins on line 1 with the title TOTAL ACTIONS AND
DOLLARS OF NEW CONTRACTS. Since there is no TAB function,
printing begins at the left margin. On line 3 (line 2 is spaced) the
label TOTAL ACTIONS is printed after a TAB 10 indent; the label
is followed immediately by T1, which is interpreted as intermediate
76 FILE STRUCTURES FOR ON-LINE SYSTEMS

storage, previously defined under retrieval processing. This location,


as indicated above, contains the count or T ALLY for FILE 1. On
line 4 the indented label TOTAL DOLLARS is printed, followed by
St, which contains the sum of the ACTION AMOUNTS of the new
contracts.

TOTAL ACTIONS AND DOLLARS OF NEW CONTRACTS

TOTAL ACTIONS 8
TOTAL DOLLARS $ 325,800

CONTRACT MODIFICATIONS AND DOLLARS BY TYPE

CONTRACT MODI FICATION NUMBER DOLLARS

Additional Task 18 3,252,000


Cha nge of Scope 3 70,000
Termination 65,000
Renegotiation 4 120,000

Fig. 22. Report Format for Query 2

On line 7 the title CONTRACT MODIFICATIONS AND DOL-


LARS BY TYPE is printed starting at the margin; on line 9 the title
CONTRACT MODIFICATION is printed starting at the margin, the
title NUMBER is printed after a TAB of 20, and the title DOLLARS
is printed after a TAB of 40. The next statement in the output format
is PRINT (1) which is specifically referenced in the PROCESSING
section associated with FILE 3. This print statement will be executed
at the appropriate time in the processing, and in the case of this
example will be repetitively executed for every break in FILE 3 (Le.,
for each distinct value of the function SAME D 1* ). The PRINT (1)
operation starts at line 10 and continues on as many following lines
as is necessary. That is, for every repetition of PRINT (1) in the
PROCESSING section a subsequent line will be printed from the
OUTPUT FORMAT section. The line begins by executing a DE-
CODE function on Dl *. That is, the value in the field named PRO-
CUREMENT ACTION is decoded to its English equivalent, using
THE QUERY LANGUAGE 77

the code to natural language translation table. The printing begins at


the margin since there is no TAB function preceding. Therefore, the
terms Additional Task, Change of Scope, Termination and Renegotia-
tion are all of the English terms relating to PROCUREMENT AC-
TIONS that are greater than four and that were retrieved from the
file under the retrieval logic expression of FILE 2. On the same line
there is a TAB 20 after which the value contained in T2 is printed.
From the retrieval processing part of the query, this is seen to be the
TALLY or count of the procurement action value breaks, i.e., the
number of times a given procurement action value appears in the
retrieval of FILE 2. After a TAB of 40 the value in location S2,
which is the sum of the action amounts for the same procurement
action is printed. Thus, in Fig. 22 it is shown that the procurement
action "Additional Task" appeared 18 times in the retrieval, and the
total Dollars associated with this action were 3,252,000. "Change of
Scope" appeared 3 times and total dollars associated with the
three actions were 70,000, and so forth. The final statement in the
PRINT (l) sequence is a function without an argument, REPEAT,
which returns control to the retrieval processing section. That is, the
four lines of this report starting at line 10 (ADDITIONAL TASK 18
3,252,000) are printed by four successive executions of the PRINT
( 1) statement.
Figure 23 presents an even more complex query processing ex-
ample, still within the same prototype format. This one involves the
use of functions in the DATA CONDITIONS section. In this type
of processor the CONDITION is generalized from the form NAME/
Relation/VALUE to NAME/Relation/function of NAME. This is
illustrated by element A 2 which calls for AGE greater than A VER-
AGE of AGE. This means that the function AVERAGE is applied
to every AGE in the file in order to compute a value that can be
substituted into A 2. Whether this type of query can be accommo-
dated as a real-time search depends upon several factors. First, if the
argument of the function (in this case AGE) is a key and if the
Directory is properly designed, then a real-time search is possible.
The Directory design that enables this type of search is called a Tree
and will be described in Chapter VI. Every value under AGE can
be readily located in the Tree Directory and the average value com-
puted. If the argument of the function were a qualifier, then it will
not appear in the Directory, and there are two alternatives. The
functional computation can be limited to values appearing in the
subfile retrieved by the keys, which means that only the ages con-
78 FILE STRUCTURES FOR ON-LINE SYSTEMS

tained in records retrieved by the keys in the DATA CONDITIONS


would be averaged, which, of course, is not the average age of the
file. In this case a real-time search would result, but the users should
be aware of this restricted use and meaning of the function. If, as the
second alternative, the function were always to apply to the entire
file, regardless of whether the argument were a key qualifier, then a
real-time search would not be possible for qualifier arguments since
every file record would have to be scanned in order to compute the
functional value.

ACT 5, PRI, FA2586

RETRIEVE AND REPORT

OUTPUT DEVICE
LINE PRINTER

DATA CONDITIONS
AI (SEX- M)
A2 (AGE .GT. AVERAGE (AGE»
A3 (SALARY .GT. MAX (RENT + LlVEXP)

PROCESSING
INTRA
FILE I-AI AND A2 AND A3
INTER
FOR FILE I
FILE 2 = NAME,ADDRESS, A2* , A3*
FOR FILE 2
FILE 3 =SORT A3 *, NAME

OUTPUT FORMAT
(LlF) WHOLE RECORD FILE 3

Fig. 23. Sample Query 3

The DATA CONDITION A 3 introduces a further complication.


Here the argument is an arithmetic expression; basically, the problem
involves the citation of multiple arguments. If the arguments RENT
THE QUERY LANGUAGE 79

and LIVEXP were keys, a real-time search on the A 3 condition is


logically impossible since all combinations of both sets of values would
have to be computed from the Directory. Aside from being a very
time consuming process, the combinations themselves are meaning-
less. Therefore, the only practical alternative is that this type of
CONDITION will be tested only on the list retrieval subfile (i.e. as
qualifiers), but again, the user should be aware that the CONDI-
TION is not applied to the total file, but to the subfile accessed by
keys.
The intrarecord retrieval logic, in this example, is a conjunction of
the three elements, while the interrecord processing of the retrieved
file (labeled FILE 1) is first to create another file (labeled FILE 2)
which, for every record of FILE 1 extracts the fields (i.e., the key or
qualifier fields) whose names are NAME, ADDRESS, AGE,
SALARY. It might also be the case that DATA are formated and
subject to such extraction as well. Then, all records of FILE 2 are
sorted by NAME within SALARY, and this sorted file is labeled
FILE 3.
The OUTPUT FORMAT is simply to print FILE 3 .on lines 1
and following. It should be noted that only the four extracted fields
(FILE 2) will appear in the report, and they will be sorted appro-
priately by name within salary. Since titling and tabs are not used, a
standard record format would be retrieved from the query processor
library to produce an appropriate page layout under the control of
the WHOLE RECORD system function.
Figures 24 and 25 present file updates in the prototype query
language. The first one, identified as MR 76 has no output device
requirement. The DATA CONDITIONS contain the data field names
and values that are required to process the update. A 1 is the unique
accession number of the record, and A 2 is a key name and value that
represent the modification data. Since this update involves only a
single record, and it is not conditioned on the contents of any other
record, only intra record processing is required. FILE 1 is the record
to be updated (A I) and for this record two system functions are ex-
ecuted. First the key GRADE is deleted from the record. This re-
moves both the name GRADE and its value, which might not be
known to the user, from the record, and removes the record from the
list for the Name/Value pair. The second function adds the key A2
to the record, and the record to the appropriate list.
80 FILE STRUCTURES FOR ON-LINE SYSTEMS

MR76, PRIORITY 2, FA8721

UPDATE

DATA CONDITIONS
AI (ACC NO.: 67215)
A2 (GRADE = 67)

PROCESSING
INTRA
FILEI=AI
FOR FILE I
DELETE KEY GRADE
ADD KEY A2

Fig. 24. Sample Query (Update) 4

Figure 25 presents a final example. It is an update and has an


output device requirement. The file being updated contains the rec-
ords of automobiles, with the various engineering drawing numbers
being stored as DATA. The indicated update is against a set of
records defined as all cars of type XL78 made between the years 1966
and 1969. This subfile (FILE 1) is produced as a list search under
intrarecord processing. For every record in FILE 1, the DATA field
F7 is deleted where F7 would be identified within either the record
format description, if DATA is a fixed format subrecord, or the table
of contents, if it is a variable format subrecord. Then the degenerate
DATA CONDITION D is added to DATA field F7. That is, the
contents of D are regarded simply as alphanumeric data, and are
loaded directly into field F7. Finally, the updated FILE 1 is restored.
The file sys-tem does this by decoding the accession number of each
record to its DASD address.
The output statement prints only the DATA subrecord of every
record in FILE 1, and since this statement is executed after the
processing, the updated appearance of FILE 1 will be printed.
In the future, increasingly high-quality report generators, particu-
larly on-line, real-time, demand report generators are going to assume
greater importance. All levels of management will look to this tool as
a meaningful basis for decision making. The bases of its utility and
THE QUERY LANGUAGE 81

ultimate success are: (l) the ability to retrieve information from a


flexibly structured file on an arbitrary set of key and qualifier com-
binations, (2) the ability to perform a variety of highly effective
interrecord processes that go beyond simple sorts and tabulations,
and (3) perhaps most importantly, the ability to generate an easy to
read, high quality textual and graphic layout. This capability can, of
course, be extended beyond the production of hard-copy documents.
With graphical techniques and CRT displays it is possible to produce
very high quality printing masters and projection slides, even in color,
and, in the not too distant future, movies. The exploitation of demand
report generation by means of the above described techniques should
receive more attention in the future both from management as well as
from systems designers.

MR77, PRIORITY 2, FA8721

UPDATE
OUTPUT DEVICE
TYP
DATA CON DITIONS
B (YEAR .BTW. 1966,1969)
C(TYPE - XL78)
o (STR COL PLAN NO. 8765231E)
PROCESSING
INTRA
FILE I - BAND C
INTER
FOR FILE I
DELETE DATA F7
ADD DATA F7 0
RESTORE FILE I

OUTPUT FORMAT
(LIF) DATA FILE I
Fig. 25. Sample Query (Update) 5
CHAPTER V

CLASSIFICATION OF FILE
ORGANIZATION TECHNIQUES

IN CHAPTER I an overview of the information system was presented,


and in Chapter III the functional requirements it imposes upon the
file structure were discussed. Then three definitions were introduced:
information structure, file structure (organization), and data structure.
The first was said to be outside the purview of the automated system
designer, since it was a relatively fixed property of the data files pre-
sented to him. The second and third are of direct interest to the
designer, but the third is relatively trivial as compared with the second
because it primarily concerns record formats and list structuring con-
trols. The second, file structure, is of primary interest to the designer
because it is here that he makes decisions regarding the file partition-
ing techniques, the type of directory construction, file and directory
maintenance techniques, and executive system and query processor in-
teractions. Chapters VI, VII, and VIII are devoted exclusively to these
subjects, and in order to assemble the various concepts and techniques
into a coherent body of useable design information, a simple classifi-
cation is to be made in this chapter of the techniques for file structur-
ing. This systematization is also helpful when comparing techniques
and designating trade-offs. The first step is to isolate the file struc-
turing problem from file manipulation programs, and both of these
from executive and query processing functions. Figure 3 offered a
means of doing this, and Fig. 26 is an expansion of part of that figure,
indicating more explicitly the data flow of the query through the
system. The files to be organized for on-line access and the programs

82
CLASSIFICATION OF FILE ORGANIZATION TECHNIQUES 83

that directly communicate with these files are also identified in Fig. 26.
The remote terminal interfacing to the system through the Input/
Output Program is as described in Chapter I, Fig. 3. The Query In-
terpreter divides the information in the prototype format into two
data streams. One flows to the Directory Decoder and the other to the
File Search Executive. The solid blocks in Fig. 26 indicate the major
system component programs, the DASD symbols designate the files of
interest (in particular, the list structured file system) and the broken
line blocks indicate specific information transmissions or buffers. The
data stream to the Directory Decoder consists of the query 10, the
retrieval keys and retrieval conditions and functions, as expressed
in the OAT A CONDITIONS and PROCESSING sections of the
query, respectively.
The key Directory Decoder translates from the natural language
Key Name/Value pairs (or Values, depending upon the list approach)
to an address or series of addresses that may point to the head of a
list in the file, to a series of heads of lists in the file, or, in the case
of the Inverted List system, to every record in the file that satisfies
the key conditions. The information required to perform this decoding
is contained in the DASD in a table called the Key Directory. In
addition to transmission of these addresses, the Directory Decoder can
also provide presearch retrieval statistics, which indicate an upper
bound on the ultimate retrieval. As will later be shown, various list
structuring techniques provide different qualities of retrieval statistics.
Both the statistics and the list addresses are transmitted to the Search
Executive.
This Executive corresponds to block 4.2 in Fig. 3, and in systems
that are to provide multiterminal, real-time response this Executive
might also enable multi asking either in the sense of controlling a series
of processors, (mUltiprocessing) each identified in Fig. 26 as a Query
Execution Area (QEA), or in a simulated mUltiprocessing mode,
where each QEA is a core partition, and the Executive performs a
switching function among the various QEAs, thus enabling N queries
to time-share both the processor and the files.
The second data stream from the Query Processor feeds directly
to the Search Executive. This consists also of the query 10, the
COMMAND, the DATA CONDITIONS, PROCESSING functions,
OUTPUT DEVICE and OUTPUT FORMAT. The Multiprocessing
Executive assigns a QEA to every query in real-time execution.
Within this area are placed the query parameters and the list
addresses.
00
~

I I Ire:::
,------,I ;:;
Input /Output : Input / 1
Progrom '"11
....
I Output , t""
ttl
t I ... Buffer _ ,
CIl
L _ _ _ _ -.J >..,]
Query ~
Interpreter c:
('")
>..,]
c:~
---. r---'---, ttl
CIl
r---- L
10 I I 10 I '"11
'Command I I Keys 1 List Structur d o~
I Conditions I L Retrieva I ..J File System
I Retrieva I I Keys1- - ~I
I Output Device , t""
Ir-.,
I/t----::>-,I Z
L_~utP~~ormat _.J 1 ttl
Pre Search Directory 1 ,
1 Key
Retrieva I Decoder , ~
CIl
1 Director.,!... >..,]
Multiasking Search Statistics I ....... , ttl
I ~
Executive 1 CIl
List Addresses I
I 'fC ~ I
File Search 1 I
r---, File I
rQEA,1 r QEA 21..J -- - QEA n.J
L..: _ _
L.: ___ J L ___ 1l _____.,.- I
..J
Fig. 26. The System Block Diagram
CLASSIFICATION OF FILE ORGANIZATION TECHNIQUES 85

It is frequently desirable in on-line systems that there be at least


a minimal dialogue in which the Executive transmits the presearch
statistics from the Directory Decoder to the user terminal via the I/O
program. As mentioned above, some file organizations are better
suited than others for the production of quality presearch statistics
and, therefore, if such a dialogue is a system requisite, certain file
structures might have to be excluded from consideration. The user
at the terminal then decides, based upon the statistics, whether to
proceed with the search in the File, to modify the query or to termi-
nate it. When he responds, the Search Executive can take the ap-
propriate action. In the first case it enters the job into an execution
schedule, and in accordance with a file-scanning strategy, it executes
a random record read (or write) in the DASD, and transfers control
to the File Search System with a specific connection to an appropriate
QEA. This scanning strategy is a function both of the file list struc-
turing organization and of the system functional requirements. In
general, however, a complete search would not be performed on all
the lists for a single query at one time but would be intermixed with
the list searches of other queries, in accordance with the built-in scan-
ning strategy of the system. The use of priorities can, of course,
dynamically override the preprogrammed scanning strategy. Re-
sponses from the File Search System flow from the particular Query
Execution Area back through the Search Executive to the Output
program, where they are temporarily buffered in a high speed DASD
and subsequently transmitted to the remote device. If inter-record
processing is indicated in the query then the Search Executive trans-
mits the responses to an intermediate file instead of the Output
program (File F-6 in Fig. 3).
Aside from the On-Line Executive (Block 1.2 of Fig. 3), the sys-
tem can be viewed as containing five major subexecutive components.
These are (1) the Query Interpreter, (2) the Search Executive,
(3) the Key Directory Decoder, (4) the File Search Subsystem, and
(5) the Input/Output Program. The remainder of this book is con-
cerned largely with the Key Directory Decoder and the File Search
Subsystem, because these relate most directly to file-structuring tech-
niques. The Multitasking Search Executive will also be briefly
discussed because different file structures have an effect upon the
design of this executive.
Basically, on-line file search can be viewed as a two step process:
Step 1 involves a decoding or translating from the natural language
or encoded keys in a logical expression to the list addresses by means
86 FILE STRUCTURES FOR ON-LINE SYSTEMS

of the Key Directory. Step 2 consists of the random access search


in the File based upon the list addresses. Figure 27 classifies the
existing techniques that are used to perform these two functions. At
the left of the diagram is contained a tree denoting the various Key
Directory decoding techniques. These divide into two general classes,
one called randomizing or hash coding, the other called tree or table
look-up decoding. The randomizing approach has a number of
variants that will be described, but they all classified functionally in
a single group. The tree decoding technique, however, has various
functional ramifications. The first distinguishes between Fixed and
Variable Length Key decoding. In natural language a complete key
is usually variable in length. If any transformation is made on this
key that converts it to a fixed length, then some decoding ambiguity or
redundancy may be introduced; hence, the designer must consider,
based upon the system functional requirements, whether or not such
ambiguity can be tolerated. Fixed Length Keys can usually be tol-
erated and this approach is generally preferred because the decoding
programs are somewhat easier to write and are faster in execution.
There are two general methods by which a full length key can be
fixed in length, as indicated in the figure. The first is to sample some
fixed set of characters or bits from the key; a special case of this,
which is used in many systems, is simply to truncate the key to a
given number of characters. The second method is to apply a ran-
domizing technique to the key in order to transform it into a range
that is represented by a fixed number of bits. There are also two
approaches to the use of variable length keys, one of which is to use
the entire key, and the other is to determine a unique sampling (or
transformation) for each key such that as little of the key word is
retained as is necessary to completely differentiate it from every other
key in the system. This latter approach may, in terms of storage,
actually be as economical as the fixed key approach, and at the same
time, provides completely unambiguous decoding; however, as will
be seen in the subsequent discussion, this technique is fairly compli-
cated both in terms of the original generation of the decoding tree
and in the updating of the tree and may also have some other un-
desirable features.
The purpose of creating this classification is so that at each level
the designer can make a selection based upon clearly differentiated
properties of the decoders. These properties will be thoroughly cov-
ered in Chapter VI, but they can be summarized here, in connection
with the classification tree shown in Fig. 27 as follows: the first level
KEY DIRECTORY DECODING FILE SEARCH
(")
l"'
co .. List Length
Control .. i
Partial
~
>
::l
o
z
Inverted o'"Ij
'"Ij
....
Cellular Partitions l"'
ttl
o

~
~
~
o-,l
ttl

~
Unique Sampling g
ttl
(Jl

Fig. 27. The Two-step Retrieval Process 00


-...J
88 FILE STRUCTURES FOR ON-LINE SYSTEMS

distinguishes the randomizing approach from the tree approach. In


general, the randomizer requires less DASD capacity (sometimes
none) than the tree, although in some applications the difference is
slight. The tree, on the other hand, has a key range scanning prop-
erty, which is necessary for implementation of the arithmetic rela-
tional operators "greater than," "less than," etc., while the randomizer
does not have this property. The decoding speed of the two methods
are comparable, under most circumstances.
At the second level, the tree method branches into Fixed versus
Variable Length Keys. The trade-off here is based entirely upon
ambiguous decoding. The former may produce an ambiguous de-
coding that would have to be resolved by a subsequent examination
of the entire key, either in another subdirectory (which would be
costly in storage) or in the File record at search time, (which is less
efficient). The latter is guaranteed not to produce an ambiguous de-
coding, but the price is greatly increased programming complexity
and DASD storage requirements. Most designers prefer the Fixed
Length Key because of simplicity and because the cost in additional
storage or loss of efficiency is not significant.
The third level decision for Variable Length Keys is based upon
storage economy versus programming complexity, since unique
sampling is more economical in storage but more complex to pro-
gram. Also, if range search is required, then either the complete key
must be retained or the sampling must be via truncation. The third
level decision for Fixed Length Keys is based mainly upon the re-
quirement for range search capability. The sampling techniques via
truncation provides this capability, whereas any other sampling
technique, or the randomizing of the key would not.
The second step of the on-line retrieval process, as indicated on the
right side of Fig. 27, is File Search.
The primary requirement of file structuring for on-line retrieval
is the process referred to in Chapters I and III as logical partitioning.
That is, given that each record of the file is to be stored in only one
place in the DASD, and that it is desired to be able to access this
record on the basis of anyone of its keys along with other records,
all of which contain this key, what logical organization does this im-
pose (1) upon the programs that store, retrieve, and update these
records, (2) upon the placement of the records in the DASD, and
(3) upon the data structures within the records? In the late 1950s
and early 1960s, programming techniques called list structures were
devised for the internal (core) manipulation of pushdown lists. [71
CLASSIFICATION OF FILE ORGANIZATION TECHNIQUES 89

These methods were described in the literature as threaded lists,IS]


knotted lists,[9 1 and multiple threaded lists (MultiIist[1O]). With the
introduction of the DASD, certain forms of these techniques were
adopted to implement list structured storage and retrieval. The most
general of these list structures is the multiple threaded list or Multi-
list system of Prywes and Gray; these techniques will be gen-
erically referred to hereafter either as threaded lists or MultiIists
without any specific or proprietary connotation to the latter. The
threaded list is characterized as having a single DASD address as an
output from the Key Directory Decoder for each key; this address is
that of the first record in the file that contains the particular key.
Each subsequent record in the file containing this key is accessed by
a link address that is stored within the record containing the key and
is associated, within the record, with the particular key, as illustrated
in Fig. 18. Thus, if a particular key appears in 100 records, there
will be a single address in the Directory pointing to the first record,
and there will be 99 link addresses, each within a respective record
containing the key, pointing in turn to the next of the 99 records
within the DASD. Basically, the multiple threaded list, or Multilist
system, * allows for any number of keys to be contained within a
record, and, hence, a threaded list is developed with as many threads
passing through a given record as there are keys in the record.
Another approach to list structuring in a DASD comes from more
traditi(,mal methods of file manipulation. This is the Inverted List,
in which all of the link addresses are removed from the file records
and are placed at the output of the Directory in address sequence.
That is, all records containing a given key, X, will have their ad-
dresses listed in a monotonic sequence within a variable length
record, the address of which is the output of the Key Directory for
keyX.
The two schemes represent different file structures and require dif-
ferent types of programming, but they can represent the same infor-
mation structures, whether Hierarchic or Associative. In one sense
they are identical because they use the same means of logically defining
the partition, namely, the link address, and both require the same
number of link addresses to partition a given file; the only difference is
the location of these link addresses. The Inverted List technique re-
moves the distributed link addresses of the Multilist system from the
File records and compacts them into another file of their own, which

• This same type of list structure has also appeared in the literature as k,lOlted list. (See
Bibliography [91).
90 FILE STRUCTURES FOR ON-LINE SYSTEMS

is intermediate between the Directory and the File. This modifica-


tion, however, endows these two structures with significantly different
properties in terms of retrieval response time, update response time,
ease of programming, presearch retrieval statistics, space maintenance
and query processor and executive program design.
Therefore, although they are functionally identical, they are struc-
turally and operationally quite different, and in this latter sense are
polarized as shown on the right side of Fig. 27. It turns out that
there are certain operational deficiencies for large files in the Multilist
system that are corrected by the Inverted List approach, but there are
certain economies of file storage and programming simplicity in the
Multilist systems. Therefore, it is natural to consider combinations
of the two approaches, producing a spectrum between these two poles
in which partially inverted Multilist systems can be constructed.
Furthermore, as it will be presented in Chapter VII, this spectrum
can be made continuous because a program parameter can be assigned
a value at file generation time that establishes for the file structure a
predetermined degree of inversion, that can be varied, as desired,
from one file generation schedule to the next.
This modification in the basic Multilist concept is called the con-
trolled list length Multilist, whereby the system can be designed to
produce lists of any desired maximum length ranging from one up to
the actual list length. If every list were limited to a length of one,
an Inverted List system would be generated; therefore, the range of
Multilist systems is represented in this figure as extending continuously
from pure Multilist where the controlled list length is unlimited, to
the Inverted List where the controlled list length is unity. Since the
degree of inversion can be varied at file regeneration by using the
list length as a program parameter, the system manager can periodi-
cally alter the parameters of the system in accordance with the actual
file and query statistics at a given time, which may enable him to
dynamically adapt and improve the system's responsiveness.
List structuring, however, is not the only means of file partitioning.
Another is to physically partition the DASD into cells and to direct
the key search to these cells instead of to individual records. There
may then be a list structure within the cell, which randomly accesses
records, or alternatively, the cell can be searched serially. Whether
or not a cellular partition is list structured will depend upon such
factors as the cell size (both in number of records and in character
capacity), the programming complexity desired, the type of DASD,
the requirement for on-line update, and the desired system response.
CLASSIFICA nON OF FILE ORGANIZA nON TECHNIQUES 91

When this technique is described, it will also be shown that there is a


particular advantage to assigning the records in some coherent man-
ner to the cells. This has the effect of reducing the number of cell
accessions, where the record groupings within the cells reflect similar
subgroups that will be called for in queries.
Figure 27 presents the Cellular Partitioning approach as an off-
shoot of the partially Inverted List structures, because they are, effec-
tively, a list structure (of cells instead of records), and the degree of
record inversion is a function of the cell size. Small cells tend toward
complete record inversion (1 record per cell), and large cells tend
toward either a Multilist, if the cell is internally Multilisted, or a serial
file if it is not internally list structured. On the other hand, if the
cell is internally list structured by an Inverted List then, regardless of
the cell size, the entire file is inverted but in a multistage hierarchy.
In summary, there are two generally distinct steps in the on-line
retrieval process-the Directory Decoding and the Partitioned File
Search.
The programming techniques for implementing the file structure
respective to these two steps are described and evaluated in the next
two chapters. Chapter VI presents Techniques of Directory Decod-
ing, and Chapter VII presents Techniques of Search File Organization.
The final chapter of the book describes and evaluates methods of
generating the files, updating them on-line, and providing space main-
tenance.
CHAPTER VI

TECHNIQUES OF DIRECTORY
DECODING

FOUR OF THE decoding techniques shown in Fig. 27 are to be de-


scribed and analyzed. There are (1) the truncated fixed length key-
word tree, (2) the unique truncation variable length key-word tree,
(3) the complete variable length key-word tree, and (4) the ran-
domizer. For convenience these descriptions will be shortened to the
fixed tree, truncated variable tree, variable tree, and randomized meth-
ods, respectively. The procedure followed will be to describe and
illustrate the methods of each, then to formulate expressions for re-
trieval time and storage requirements, and finally to compare them
with respect to programming complexity, decoding speed, and
memory requirement.
There are two types of graphical constructions for decoding trees.
One is called the balanced tree [11], and the other the unbalanced
tree. The first has the property that if the lowest of the tree is N,
then all keys can be decoded in either (N -1) or N levels. The un-
balanced tree is simply defined as a tree that does not have this
property, i.e., it is not balanced. The advantage of the balanced tree
is that the decoding time is nearly constant; the disadvantage is that
the programming is slightly more complicated. An advantage of the
unbalanced tree, for some purposes, is that decoding speed prefer-
ences can be intentionally assigned; however, as will be shown, the
tree depth required to accommodate key vocabularies in the tens of
thousands is only two or three levels, using available DASDs, and
hence decoding time cannot be improved much in any case of un-

92
TECHNIQUES OF DIRECTORY DECODING 93

balancing. The tree construction to be described here is the balanced


tree with a fixed length key-word fragment that is obtained by trunca-
tion of the full length key. This is the most commonly used type of
tree or table look-up decoder.

1. The Tree with Fixed Length Key-word Truncation

Figure 28 illustrates schematically a portion of such a tree. A


series of names in alphabetic order is given in Table 5, and the
tree of Fig. 28 decodes the names on this table, where each name has
been truncated to the first four letters. For simplicity it is assumed
in this diagram that the tree branches three ways at each disk level.
In actuality, the branching factor is usually several hundred, as will be
shown in a sample calculation at the end of this section. The tree
shown in Fig. 28 is a three-level tree; the first level may be contained
within the core memory, and the second two levels are within the
DASD. In order to illustrate a specific memory allocation strategy of
some importance in certain types of DASDs, it is assumed that the de-
vice is either type two or three, as defined in Chapter II, Table 3, and
therefore has cylinders. If the tree is to be implemented on a fixed
head disk or drum, then all of the following still applies, except that
it is immaterial to which tracks the tree nodes are assigned. A disk
addressing notation Tij is used in the diagram, which designates the
jth track in cylinder i. Across the bottom level of the tree, the key
fragments corresponding to the complete name given in Table 5 can
be read alphabetically from left to right, with symbolic link addresses
A 13, A 12, A 11 ... A 1. The format of a single field within the track
is as follows: Truncated key fragment/Link address to the head of list
or to the next level track/List length.
The tree is constructed by assigning the key-term fragments to tree
node positions either in forward or reverse order. In this example
reverse order has been chosen. Thus, the last three fragments (EYER,
DYSO, DUNL) are placed in the right most bottom level track, T 1 :l •
The associated symbolic head of list addresses and list lengths,
AI/L1, A2/L2, and A3/L3, are also inserted into the record. The
last key fragment on the track, which is called the range of the track,
is also entered into a second-level track as a reference; therefore, the
last record entry of track TlO is EYER/T 1:1o which indicates that
EYER is the last key fragment on the third level track T,:lo and the
10
.j:>.

"f1
....
t""
CORE LEVEL lAKE/TOO EYER/T IO LEVEL ttl
I I (/l
'"l
'c::"
(')
'"l
c::
ttl
(/l
'"
"f1
2 0
'0"
ZI
t""
....
DISK / /' /' / Z
ttl
(/l
" ><
(/l
'"l
ttl
~
(/l

Fig. 28. Tree with Fixed Length Key Word Truncation


TECHNIQUES OF DIRECTORY DECODING 95

Table 5
AN EXAMPLE OF UNIQUE NAME TRUNCATION

Uniquely Coded
Uniquely Fragments For
Full Name Coded Fragment Multilevel Tree

BABBET B B
BABSON BABS BABS
BAILEY BAI BAI

BAKER BAK B
BELL BE BE
BELLSON BELLS BELLS
BELMAN BELM BELM

BLACK BL B
BLACKWELL BLACKW BLACKW
CARDER C C
CARTON CART CART

CROZIER CR C
DUNLAP D D
DYSON DY DY
EYERS E E

list length field being null un indicates that the reference is an an-
other track in the Directory and not to the list in the File. The next
three key fragments from the bottom of the list are CROZ, CART,
and CARD; these are entered, in turn, into the next available track,
namely, T 12 , along with their respective link addresses and list lengths.
CROZ, the last fragment on the track, is entered as a reference in the
next record position on the second-level track T 10, with the appro-
priate track reference T 12 and the (J list length. The next three frag-
ments are treated similarly, as shown in the diagram, and this
completes the second-level track, T lO , at which time the last fragment
of TlO is entered into the first-level record, which is stored in core
as EYER/TlO , indicating that the last entry on track T 10 is EYER.
No list length is required because it is assumed that link addresses
from the core level do not lead to File records. The next three records
in reverse order are BAKE, BAIL, and BABS. These are placed in
the third-level track, T 01 , and the last fragment is referenced in track
Too, as BAKE/Tot/(J. At this point there is one remaining fragment,
namely, BABB in the list. Since there are two second-level positions
open in track Too, it is not necessary to begin another third-level track;
therefore, BABB is placed in the second record of track Too with its
appropriate link address A 13 and list length L 13. The last fragment
of the second-level track Too is now entered into the first-level record
96 FILE STRUCTURES FOR ON-LINE SYSTEMS

as the reference BAKE/Too. This completes the construction of the


three-level tree for the 15 keys shown in Table 5. It should be noted
that there are actually only 13 key references since two of them are
ambiguously decoded. There are the names BELL and BELLSON,
which both truncate to BELL, and BLACK and BLACKWELL,
which both truncate to BLAC.
The decoding in this tree proceeds as follows. Assume that the
name CARDER is to be decoded. The first four characters are
truncated to give the key fragment CARD. A search in the core level
indicates that CARD is greater (alphabetically) than BAKE, but less
than EYER; therefore, the second-level track TlO is accessed. This
track is scanned in core, and it is determined that CARD is· greater
than BLAC but less than CROZ; therefore, the third-level track T12
is accessed. Within track T12 an exact match of CARD is found in
the first record, and the decoding is complete with the output of link
address A 6 and list length L6.
This tree will thus decode all keys except BABBET in two disk
accessions, and BABBET in one accession. If the indicated track
numbering scheme can be followed, then the first random access (to
the second-tree level) costs a head movement plus track rotation
latency, while the second random access (to the third level) is within
the same cylinder, and costs only latency. In general, as many frag-
ments should be represented at a tree node as is physically possible.
That is, each node corresponds to a physical DASD record. The
limitation in size of this record will be fixed either by DASD hard-
ware or by the size of record that can be buffered into core. Typically
a physical record may correspond to a full track of, say, 3,000 char-
acters. If the key fragment/address/list length triplet required 15
characters, then 200 fragments could be accommodated in each tree
node, and a three-level tree could decode (200)3 = 8 million keys.
This means that a I in 8 million choice can be made in three random
accessions if the top level (requiring 3,000 characters) were not stored
in core, and in two random accessions if the top level were stored in
core. At this point one might challenge the proposition that such a
large node be maintained simply to generate a shallow, balanced tree;
however, the following reasoning indicates that this is the optimal
approach. Assume that the intranode comparisons, which are per-
formed in core, are done serially, and that it requires 100 micro-
seconds per comparison (i.e. about 25 machine executions). Then,
the maximum node search time is 20 milliseconds, and the average is
10 milliseconds. The comparison procedure could be optimized by
TECHNIQUES OF DIRECTORY DECODING 97

programming a binary partition search, which might require 200


microseconds for comparison and range adjusting, but the number
of comparisons would be Log:!200 -..:... 8, and the node search time
would be 1.6 milliseconds. By comparison, the DASD random access
times for most of the device types that would be used for this purpose
are between 20 and 500 milliseconds, and typically 60 to 100 milli-
seconds. Therefore, the time to scan a large node in core is negligible
as compared with the individual node access time if optimized (binary
partition) coding is used, and is, on the average, 1/6 to 1/10 of
typical access times if the scan is not optimized.
In order to enable on-line update of the tree, the physical records
that represent the nodes should not be completely packed. That is, a
certain amount of reserve space should be left at the end of the record
for node expansion. Landauer LI:!] recommends that for most systems
(i.e. based on various simulations of update dynamics), 10% reserve
space is adequate. Assume, for example, that the key word BINDER is
to be added to this Directory. The word is first truncated to the frag-
ment BIND. It is greater than BAKE and less than EYER; therefore,
track Tift is accessed. It is less than BLAC; therefore, track Tll is
accessed. It is greater than BELM but less than BLAC; therefore, it
belongs between these two fragments. If it is assumed that there is an
input and an output core buffer, then the physical record is transferred
into the output buffer, with BIND inserted between BELM and BLAC
along with its appropriate reference address and list length. (Probably
1 if it is a new key occurrence). If there is enough reserve space at the
end of the record to accommodate the new key, the output buffer will
not overflow, and it can be restored to track T tt . If the reserve space
has been consumed because of previous updates (10% reserve in a
200 key track would allow 20 updates before overflow), then the last
key is bumped into the next track, on the same level, if the tree is to
remain balanced. This means that the next track and the higher-level
track must both be updated. In the example, BLAC would be added
to track T12 and the range of T11 would then be BIND, which would
be substituted for BLAC in track T lO • If track T12 did not have
sufficient reserve, then CROZ would be bumped into T 13 , etc. If a
statistic were maintained on the number of tracks that had exhausted
their reserve space, or on the average number of bumps required to
effect an update, then, at a given threshold, the tree could be com-
pletely regenerated and the reserve space reestablished. The exact
percentage of reserve should be a program parameter, and its particu-
lar value would in time be determined as a function of the system's
98 FILE STRUCTURES FOR ON-LINE SYSTEMS

Directory update dynamics and the desirable period for regeneration.


It should also be noted that Directory update dynamics are usually
far less severe than file update dynamics, so that Landauer's 10%
reserve figure is probably adequate for most systems.
The illustration of Table 5 and Fig. 28 presents a truncation de-
coder for names. This identical strategy would not work for the more
practical case of a key that was of the Name/Value form, since a
truncation might destroy all or a significant part of the Value. Two
approaches are, therefore, open. One is to truncate X letters from
the Name and Y letters from the Value, and to construct a key frag-
ment from the concatenation of the two fragments. The other ap-
proach is somewhat more flexible in that it more readily enables the
construction of trees that are tailored in the length of their Value
truncations. This is to generate a two-level hierarchy of trees, the
first of which is the Name Directory, the outputs of which point to
respective Value Directories. In this way, the Name fragment does
not have to be repeated with every Value fragment, and each Value
tree could, in principle, have a different length fragment, although all
fragments within the same tree would be the same length.
Aside from high-speed decoding of large scale Directories, and
relatively simple programming, another advantage of the tree is its
ability to perform range searches. For example, Query 1 of Fig. 20
called for a key search of the range AGE/21 to AGE/30. The system
does not know in advance what actual values of the key name AGE
exist in the file, but it would like to perform a list search on those
that do. A program could therefore be written that would appro-
priately decode the Name tree (in this case it would decode AGE),
and be directed to a Value tree. It would then decode the lower
value bound (21). ~nd scan the bottom of the tree from this point up
to and including the upper bound (30). All values encountered in
this scan would be decoded to their address reference and a list
search made for each.
Formulations derived at the end of this chapter will indicate how
many tracks and levels are required for a given key vocabulary size.

2. The T"ee with Variable Length Key-word Unique Truncation

This method of Directory Decoding introduces two complicating


factors. First, the key fragments are variable in length, so that the
TECHNIQUES OF DIRECTORY DECODING 99

program does not know beforehand how many fragment fields can be
packed onto a track. Second, in order to minimize the amount of
storage required, key-word truncation is used, as far as is necessary
in order to create a uniquely coded fragment, and this procedure in-
troduces another kind of ambiguity. One set of rules for such an
encoding is illustrated in the second and third columns in Table 5.
As in the fixed word Directory, the key words are first alphabetized
as shown in the lefthand column of the table. Then, the uniquely
coded fragment, as shown in the second column, is obtained by using
the least number of letters, starting from the left, that will be required
to completely distinguish the given full name from the immediately
preceding name.' Hence, BABBET, being the first name in this par-
ticular series, can be uniquely encoded simply by the letter B; BAB-
SON is distinguished from BABBET in the fourth letter; therefore,
BABS is the code fragment required to distinguish BABSON from
BABBET. Likewise, BAILEY is distinguished from BABSON by
the letters BAI, and BAKER is distinguished from BAILEY by BAK.
This procedure is applied to the entire alphabetized index, and the
result is as shown in the second column of Table 5.
In order to illustrate how these variable length coded fragments
are stored as a tree in a disk memory, and how the tree is decoded,
two cases will be presented. First, the case where all 15 names in the
table are to be placed on one track is considered. Then this concept
will be generalized to the multilevel tree assignment.
The single track arrangement is shown in Fig. 29, where the in-
formation is presented in a three-column table. The first column in-
dicates the Record Number on the track. The second column indicates
the Track Content within each field of the indicated record, where a
separate line is used for each information field. The third column
presents Comments relating to the data in the field. The procedure
is as follows: The fragments are rearranged in descending order of
fragment length. Thus, the longest fragment, BLACKW, which con-
tains six letters, appears at the head of the list; then comes a five-letter
fragment, BELLS, and so forth. The first field in a record is called
the header, and contains two pieces of information; one is the frag-
ment length which, in the case of record 1, is 6, and the other is the
number of fragments that are contained in this record which, in this
case, is also 1. Therefore, the decoding program knows that it will
read a fragment code of length 6 with its corresponding link address
and list length, and that there will be only one such triplet in the
100 FILE STRUCTURES FOR ON-LINE SYSTEMS

record. The second logical record contains a single triplet where


the fragment is of length five; the third record contains three triplets
where the fragment lengths are four, namely, CART, BELM, and
BABS. Thus, the entire track is laid out according to this descending
size of the fragment length. In general, this track would be one among
many that would contain a partial listing of the coded fragments.

TRACK
RECORD NO. COMMENT
CONTENT

I 6, I Key of Length 6, I Key


BLACKW/AI/LI Reference to key BLACKWELL!
at file location All with
list length LI
2 5, I Key of Length 5, I Key
BELLS/A2/L2
3 4,3 Key of Length 4,3 Keys
CART/A31 L3
BELM/A41 L4
BABSI A51 L5
4 3,2
BAK/A6 I L6
BA I/A 71 L7
5 2,4
DY/A8/L8
CR/A9/L9
BLI AIOI LlO
BE/AII/LlI
6 1,4
E/AI2 I Ll2
D/A13/L13
C/AI41 Ll4
B/AI51 LI5

Fig. 29. Single Track Arrangement


TECHNIQUES OF DIRECTORY DECODING 101

The decoding of information on this track would proceed as fol-


lows: Consider the case where it is desired to decode the name BAB-
SON, and the higher levels of the tree had led to this particular track.
The entire physical record (i.e., the track) would be read into the
core memory. The first record would indicate that six letters of
BABSON had to be compared with the six-letter fragment, BLACKW;
the comparison would fail, and the program would proceed to remove
one letter from BABSON and compare the first five letters with the
five-letter fragment, BELLS, whereupon the comparison would again
fail. Then the program would compare the first four letters, BABS,
with CART, BELM, and BABS, whereupon a match would be made
and the word BABSON would have been uniquely decoded to the
appropriate link address AS. The ambiguous decoding BELLSON
and BELL of the previous tree would fall out uniquely in this decoder
at BELLS and BE respectively.
It should be noted at this point that although a key word that is
contained in the Directory will invariably be uniquely decoded, it is
not the case that a key word NOT contained in the Directory will be
so detected and thus rejected. Instead, it wiIl usually be decoded at
some point and accessed to the first record in the file, which would
have to indicate, by means of a comparison with the complete key
word contained in that record, that the key word was not in' fact in
the file. For example, the key BINDER would be erroneously de-
coded by the fragment B as BABBET.
In order to add a new key word to the Directory, the following
procedure is followed: The word is first decoded in order to locate
that fragment that would cause redundancy. (If no redundant coded
fragment exists then the initial letter of the word is the fragment.)
The redundant key word that is already in the file is then retrieved,
and two cases are to be distinguished.

Case 1: The new word precedes the old word alphabetically.

In this case, the new word assumes the existing coded fragment,
and the old word is assigned a longer fragment such that it is dis-
tinguished from the new word. For example, assume that the name
BLAB is to be assigned to the existing Directory. This would be
decoded to the fragment BL, which has been assigned to the word
BLACK. Since BLAB precedes BLACK alphabetically, the fragment
BL would be assigned to BLAB, and BLACK would be increased to
BLAC in order to distinguish it from BLAB. None of the rest of the
102 FILE STRUCTURES FOR ON-LINE SYSTEMS

list is affected, the succeeding word BLACKWELL having previously


been distinguished from the smaller fragment of BLACK and hence
not having to be distinguished from any larger fragment.

Case 2: The .ew word is decoded a.d is fou.d to follow the


old word a1llhabetically.

In this case, the old word retains the smaller coded fragment, and
the new word is assigned the minimal length fragment necessary to
distinguish it from the old word.
Consider now the construction of a three-level tree for these same
15 key words, as shown in Fig. 30. The construction of the tree
starts at the extreme right. Each logical record within the track is
distinguished by double bars. As in the case of the fixed length key-
word tree, the Directory reads alphabetically from left to right; there-
fore, the construction of the tree from right to left would start in
reverse order. As soon as a lower-level track is filled, a reference
entry is made in the appropriate record of the next higher-level track,
and then a new lower-level track is begun. The variable length key-
word tree requires that a count be made to determine how many key
fragments can actually be put onto a given track less reserve space,
since this number can vary both because the coded fragments them-
selves are of variable length, and because a number of the coded
fragments of the same length can be grouped under the same header.
It is assumed, for convenience, in this example, that each track
holds four fragments. This means that a three-level tree requires only
two-way branching at the first two levels in order to decode 15 frag-
ments at the third level. Each track in the tree must be constructed
according to the single-track principles illustrated in Fig. 29; there-
fore, the fragmentation of the 15 keys, taken four at a time starting at
the bottom of the list is as shown in the third column of Table 5,
where each set of four fragments is uniquely truncated independently
from the others. These fragments appear then in Fig. 30 as the
third-level tracks T 12 , T u , T 02 , and T 01 • The last fragment, alpha-
betically, of each third-level track is entered in alphabetical sequence
into a second-level track; thus, the fragment E from track T 12 is
entered as the last fragment of the second-level track T 10 with an
address reference T 12 • Similarly, the fragment CART of Tn is entered
into track T lO , which completes TlO since the second-level branching
is binary. The last fragment of TlO is E, and is entered into the first-
level (core) record with a track reference T 10 •
L4,I/BELM/TOO 1\ I,I/E/T IO
I \
I ,
~O II "
3,I/BAI/TOI II 4,I/BELM/T021~/~A \
I \ \ >-l
t'!j
I \ TIO ' (')
::c
14,I/CART/TII 111'I/E/TI21~~ z
II \\ \ \ .0
ctTl
\ \ \ til

o'I'j
1.I/B/AI5/L15
\ \ Sl
\ t'!j
\ Til \ \ (')
"
>-l
o
\ 6,I/BLACKW/A5/L5 4,I/CART/A6/L6 1,21C/A7/L7/B/A8/L8 ~
t:I
\ t'!j
T02
\ (')
o
5, II BELLS/A9/L9 2,11 BE/AIl/LIl Sl
zCl
\
TI2 \

2,I/DY/AI/LI II 1,3/E/A2IL2ID/A3/L3/C/A4/L4

.....
Fig. 30. Tree with Variable Length Key Word Unique Truncations
o
c..J
104 FILE STRUCTURES FOR ON-LINE SYSTEMS

The decoding of this tree is similar to the fixed length, key-word


tree except that at the last level the technique associated with single
track, variable length, unique key-word truncation decoding (Fig. 29)
must be applied.
For example, if BAKER is to be decoded, the first-level comparison
shows that:
BAKE < BELM.
Track T()o is accessed and comparison on this track shows that:
BAKE> BAI
BAKE < BELM.
Track T02 is accessed and application of the last level rules yields:
BAKER ¢' BELLS
BAKE ¢' BELM
BA ¢' BE
B = B.
Therefore, the decoding correctly indicates A 12 with list length L12.
The same principles of on-line update apply to this tree as for the
fixed tree; however, the procedure must also include the ability to
appropriately generate the new unique fragment, and to regenerate
the total track.

3. The CMnplete Variable Length Key-word Tree

The retention of the complete key word is the only assurance of


unambiguous decoding, in either of the senses that existing or non-
existing vocabulary words are unambiguously decoded. There are two
ways to encode such a Directory. An uneconomical (in storage), but
less complex (in programming) way is to apply the variable-length
tree construction principles of Fig. 30 to the entire key. A more
economical, but somewhat more complex programming approach is
to code a standard parsing procedure. The actual amount of storage
required for the parse tree depends on the amount of commonality
in the initial letters of the keys. For typical natural language descriptor
vocabularies this commonality is not very great, and hence the storage
requirement is probably greater than the unique truncation method,
but certainly less than retention of the entire key. Figure 31 illus-
trates a standard parse for the 15 names of Table 5.
>oj
t!l
B c g
Z
.sc::
ers t!l
{Il

/\ o"!l
a el lac or rozier unlap yson
A 52
,a
t!l
(")

~~ ~
k kwell der ton ~
Aley ker 1\ 1\
I Ison man
1i(")l
o
52
~
bet son

Fig. 31. A Parse of the Names in Table 5 ~


-
106 FILE STRUCTURES FOR ON-LINE SYSTEMS

4. The Randomjzing Method

The randomizing technique,[13] or hash coding as it is sometimes


called, decodes from natural language or coded input key words to
addresses by means of a mapping procedure, wherein the coded rep-
resentation of the natural language key term, whether it be fixed or
variable length, is transformed, usually by means of a bit mapping

integers. For example, if the specified range of numbers were to


1,023 then the mapping might be to extract 10 bits of the BCD rep-
°
function, into a fixed length key falling within a specified range of

resentation of the key term, and to use these as the fixed length key
reference. This procedure can, of course, produce duplicate key ref-
erences, just as truncation in the fixed length key-word tree. In fact,
if in the above example there were more than 1,024 keys in the
system, there would have to be some duplication. That is, the map-
ping would be many to one for some of the keys. The amount of dupli-
cation can be minimized, for a given integer range, if the mapping is
as uniform as possible, in the sense that for a given set of keys the
deviation from the expected number of duplicated references is mini-
mum. A straightforward procedure for accomplishing this is to per-
form one or more arithmetic operations upon all or a part of the
BCD coded representation prior to extraction, which will tend to
remove regularities from the code that occur because of the regu-
larity of certain combinations of letters in natural language, and
because of repetition of certain character sequences due to prefixes,
roots, and suffixes. As an example, the computer words containing
the BCD representation of the key word could each be squared, and
then every fourth bit extracted from the squared words; if this did
not produce a sufficient number of bits, then every fourth bit starting
at the second bit could be extracted, and so forth until n bits had been

°
extracted, where the fixed length key reference were to be within the
range to 2"-1.
Figure 32 illustrates the process of Key Directory Decoding by ran-
domizing. In the upper part of the diagram is shown a series of N
memory locations designated Loc 1, Loc 2, Loc x through Loc y
through to Loc n. This sequence of locations represents the compact
range of the fixed key reference; hence, the next higher integer of
Log2N bits would have to be extracted from the randomized input key
terms. The diagram shows an input key represented as Key X, being,
for example, a natural language key term, which is randomized to
TECHNIQUES OF DIRECTORY DECODING 107

the numeric x, within the range of 1 to N. The memory location Loc x


is accessed, and within this location is contained the DASD address,
Ax, of the data corresponding to Key X. Loc 1 through Loc n may
either be core or DASD addresses, depending upon the size of n, the
availability of core, and decoding speed requirements. In other words,
Ax is an indirect memory reference to the variable length Decoder
output record of Key X. Within this record, at location Ax, is found
a header of the data record for Key X that contains a complete cita-
tion of Key X, followed by the list length for Key X, followed, in this
case, by a null field indicating that the data of Key X are contained
within the current record. The structure of data within this variable-
length record depends upon the file structure. If the file is a threaded
list structure then the "Data of Key X" is a link address to the first
record; moreover, if the Key X subfield of the header were fixed in
lensth, then the entire decoder output (i.e. all the data in the record
at Ax) could replace the indirect address level, and be moved to
Lee x, where, if the total output record contained W words, then Loc
x = Wx. If the file is an Inverted List structure then the "Data of
Key X" would be the variable length inverted list of addresses or
accession numbers.
As shown in the upper part of the diagram, Keys Y1 and Y'2 both
randomize to the numeric y, and are indirectly addressed to location
Ay in the DASD via Loc y. At the random access memory location
Ay is contained the header field, which specifies the complete reference
for Key Yh followed by the list length for key Y1 (Lyd, followed by
a link address, AY2, which indicates that there is another data record
in the file containing a key word that is randomized to the same num-
ber as Key Y1 • It can be readily ascertained whether the current
record is the desired one by a comparison of the input key (which in
this case could be Key Y1 or Key Y 2 ) with the citation at the begin-
ning of the header field. If there were no match, i.e., if it were the
case that Key Y:! were being decoded, then the record at location AY2
is accessed and a comparison is made with the first subfield of the
header field, which, in this case, would contain Key Y 2. This is fol-
lowed by the list length for Key Y2 which, in this example, is followed
by a null field indicating that there is no further chaining on the y
randomized integer.
If at some time another key were added to this file, say Key Y:i ,
which also randomized to the integer y, then the chain would be ex-
tended by one more, and the third subfield in the header of AY2 would
contain the link address AYa, that would reference the data record of
108 FILE STRUCTURES FOR ON-LINE SYSTEMS

Key Ya, and would appropriately contain a header field of the form
Key Ya/ LY3/'"
If the information structures are not so general as those implied by
the functional requirements of Fig. 12, then the generalized random-
izer of Fig. 32 degenerates to more simplified though specialized
forms. For example, if the records of the file contain only a single
key (i.e., it is a single-key query system, such as the bank example of
Chapter I), and if the records are fixed in length, then the File and
the Directory output merge, since the data records could be stored in
the indirect addresses, and the keys would randomize directly to the
records in the file. That is, if the record were of length W, then its
address would be Wx if its key randomized to x. Chaining would
still have to be provided, however, for redundant key decodings.
Figure 33 presents another modification that can be used with
threaded list structured files, which saves storage by making the in-
direct address level the decoder output, and by utilizing the key field
of the File records as a header. In this diagram the File records are
stored at locations Ax, Ay, and AY2, and a key header, which appears
only in the head of list record, is designated by an asterisk. In general
there will appear keys in the record with and without asterisks. Those
with asterisks represent head of list keys; those without asterisks are
not at the head of the list. The subfields of this combined Directory
header and record key field are: The full key citation/the list length/
the link address to the next record in the file containing the same key /
the chain address (" if no chain)/the randomized code (if the chain
address is not"). The purpose of the randomized code is to identify
the appropriate chain if the complete key citations do not match. That
is, assume that Key Y2 is to be decoded. It randomizes to y, and
thence to Ay via Loc y. An examination of all * key fields will not
reveal a match between Key Y2 and the first subfield; the only way to
then determine the correct chain is to match on y, since the chain is
actually a chain of redundancies on y. The largest storage saving in
this method comes from the common use of the full key citation by
the decoder and the file record.
The advantages of the randomizing technique are that it is rela-
tively easy to program, that it can be fast to decode, particularly if
the size of the key vocabulary equals N, the number of indirect ad-
dress cells, and it is relatively easy to update. Its principal disadvan-
tage is the uncertainty of chain length, since it is usually difficult to
control, particularly if the system is subject to update. In fact some
systems that use randomizing periodically examine the chain lengths,
Key X Key YI Key Y2
t t t
[ Randomize TO; ~
I I I
x, y",- J
Loc x Loc y
>-l
ttl
("l
1:1:
Z
~ttl
til

o'!j
Sl
Of] ::a
ttl
[ Data ("l
Key X >-l
o
~
Ay": Key YI/LYI/AY2"
g
("l

[ Data Of] "AY2: Key Y2 /LY2 / (ZJ z~


o
Key Y I [Data Of]
Key Y2

Fig. 32. Key Directory Decoding by Randomizing ~


-
Key X Key Y I
Key Y2 .....
t t t .....
o
Randomize T o : ]
I I I

x~ Y",-}
Loc x Loc y ::l
t""
t!:I
CIl
...;
~
("l
...;
c:
lI:l
t!:I
CIl
Ax: Keys [*Key X/Lx/LAx/II ..,
~
[ Data Of] o
ZI
Record ~ f === t""
Z
t!:I
Ay: KeystKey YI/LYI/LAYI/AY2/y .
~...;
t!:I
Of] 1*- . ~
CIl
[ Data
Record
J
AY2: Keys ~ Key Y2/LY2/LAY2/0

[ Data OfJ
Record
Fig. 33. Key Directory Decoding by Randomizing
TECHNIQUES OF DIRECTORY DECODING 111

and if they are too long will attempt to find another randomizing
formula that will better distribute the codes. The result of longer
chain lengths is that the advantage of decoding speed is lost. The
other disadvantage is that randomized decoding is not capable of
automatically performing range searches, such as AGE. BTW. 21-30,
because every possible value that is in the file would have to be
known a priori, and individually decoded.

5. Formulation of Decoder Memory Requirements and Access Time

Large scale decoders are expected to handle key vocabularies in the


range of several thousand to several million, and it is expected, there-
fore, that all or the major part of the storage requirement for
their tables is in the DASD. Partial storage of the table in core results
only from an extreme response requirement, where it is necessary to
save one random accession. In percentages, this saving, of course,
could be considerable. For example, in the case of an n level bal-
anced tree, storage of the first level in core saves approximately (~
n
X 100%) of the total access time. The following formulations are
intended to show the relationships among vocabulary size, decoder
parameters such as tree depth and average randomizer chain length,
decoding time and memory requirement. These quantitative factors
when added to the qualitative factors of decoding ambiguity, range
search capability, and programming complexity will enable the de-
signer to assemble the appropriate decoder for his application.
The relevant physical and logical parameters of the formulations are
given in Table 6.

5.1 Tree Decodel' Formulations

In order to calculate the memory requirement for a balanced tree


it is necessary to determine first the number of ways each node of the
tree is to branch. This parameter is called the branching factor, m,
In accordance with the principle of packing as many keys per node
as is physically possible, the calculation of m becomes a function of
112 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 6
PARAMETERS FOR DECODER MEMORY REQUIREMENTS

Decode...

Symbol Meaning FT VT R

Ct Total Chars.jTrack x x
Cr Reserve Chars.jTrack x x
C, Chars.jKey Fragment (Av.) x x x
C. Chars.jDASD Address x x x
C. Chars.jlist Length x x x
C~ Chars.jHeader x
C. Chars.jKey (Av.) x
a Chain length (Av.) x
m Tree branching factor x x
Nd No. of Different Length Key x
Fragments (Av.)
N. No. of Keys in Vocabulary x x x
N" No. of Tracks Required at Tree x x
Level i
Nt No. of Tracks Required x x X
Md Disk Memory Requirement (Chars.) x x x
Me Core Memory Requirement (Chars.) x X x

• This column indicates the type of decoder to which parameter relates.


FT = Fixed Tree. VT = Variable Tree. R = Randomizer.
the total available space in the largest possible physical record of the
DASD (consistent with core buffer size) and the size of the Key
(fragment) / Address/List length decoding triplet. For convenience
it is assumed that a physical record is one track; for any other unit of
storage the reader can substitute this unit wherever the word track
appears in these analyses.
Since it is also necessary to deal in integral units of m, M d , and Me,
and it is desirable to do so in units of Nt, and N ti , a notation is used
to designate integer rounding as follows: The rational number x may
be represented as
x = INT + R,
where INT is an integer and R is a fractional remainder. Then we
define:
INT_ [x 1 INT
INT+ [x 1 I NT + 1 if R rf 0
INT if R = o.
If the total available space in a track is considered to be C t - Cr ,
then fot the fixed tree,
Ct - C ]
m = INT_ [ C1 + Ca + C1 '
(1)
TECHNIQUES OF DIRECTORY DECODING 113

and for the variable tree,

m
= INT
-
[C,C,- +CC +NCdC,,]
r -
a 1 •
(2)
Since the bottom level of any tree must contain all keys of the
vocabulary (N X,) except for a remainder that partially fills one track
at the next higher level (See BABB in track Too of Fig. 28), the
number of tracks required at the output of an n-Ievel tree would be

Nt .. = INT_ [~l (3)


For the next higher level of the tree, each track at level n-l sup-
ports m tracks at level n, or m 2 key fragments at level n. Therefore,
the number of tracks required at level n-l to support a vocabulary
of Nk keys is the integer ratio of Nk to m 2 ; furthermore it is INT+ be-
cause all tracks at level n must be supported in addition to the residue
of keys resulting from the INT- at level n. It follows from this that
all levels up to and including the first must also be INT +, and each
level requires one mth the number of tracks of its successor. There-
fore, the following formulas give the track requirements for each tree
level, for the total tree, and the DASD memory requirement in
characters.

Nt .. = INT_ [:k] (4)


Nti = INT+ [m~:+1J for i = 1 ... n - 1
n
N, = ~ Nti (5)
i = 1
Md = C,N, (6)
If the top level of the tree is stored in core, then,
Me = Nt2 (C, + Ca)
n (7)
Md = C t ~ Nt;.
i = 2
In the special case of a two-level tree, where the first level is in
core, it might be desirable to reduce the core requirement by storing
every key in the DASD, (i.e., at the output level of the tree), in which
case formulas (7) would read:
+
[:kl
Me = NI2 (C, Ca)
(7a)
NI2 = INT+
114 FILE STRUCTURES FOR ON-LINE SYSTEMS

If one considers the values of Nti in formula (4) for m > 10 and
Nk up to 100,000, it may be seen that
N Ii « N In for i = 1 ... n - 1,
and therefore a reasonable approximation for the tree requirement,
of any number of levels, is
""'
Nt ""' N tn = INT_ [NkJ
m . (8)

Figure 34 illustrates the application of formula (4) to a three-level


tree where m = 4 and Nk = 22. Each horizontal bar represents a
track; each single vertical stroke through a bar separates logical re-
cords (i.e. a key fragment triplet), and the hash marks represent track
reserve space. The formulations of Fig. 34 indicate that NtB is the
integer part of the ratio 22/4 which is 5, that Nt2 is the integer part
plus 1 of the ratio 22/42 = 22/16, which is 2, and that N n is the
integer part plus 1 of the ratio 22/4 3 = 22/64, which is 1. By con-
structing the tree in accordance with the procedure outlined in Fig.
28, it is necessary to completely fill five tracks, four keys per track,
with the keys 1 to 20. In order to support these five tracks at the
second level, one complete track is needed, as shown along the right-
hand path of the tree, and one record of a second track at the second
level is needed to reference the fifth third-level track (Le., the track
containing keys 17 to 20). The remaining two keys, 21 and 22, are
placed in two of the remaining three records of the second track on
level 2. The reserve space on this track is therefore equal to the
original reserve plus one unused record. At the first level it requires
only one track to support the two tracks of level 2. Thus, as indicated
by the formulations, one track is required at the first level, two tracks
at the second level, and five at the third level in order to accommodate
22 keys, with 4 keys per track.
If the first level were stored in core, then only two logical records
each with the Key fragment/Address double would be required (i.e.,
Mc= 2 (C, + Ca).
Somewhat more typical examples are the following:
1) Three-level tree.
20,000
100
INT 20,000 = 200
- 100
INT 20,000 = 2
+ 100 2
INT 20,000 = 1
+ 100 3
= 203
Levell

Level 2
7'~
NmUNJHH,~'1I~I- ~ '\' ~ ...,
t'l1
n
::c
z
Level 3 /iN I 1 1 Nit I I I IN I I I ",UN -+--+ g
"n
-+''3'2'1
2019 18 17 If/( 16'15'14' 13 12 II 10 9 ,. 8 7' 6' 5 4 t'l1
Vl

o"11
m=4 sa
~
t'l1
Nk = 22 n
...,
o
~
N t3 =INT_[~2J=5 ~

Nt2 =INT+[~~]=2 z~
o

Nt 1 = 1NT+[ ~~ ] = 1

Fig. 34. Example of a Three-Level Tree VI


--
116 FILE STRUCTURES FOR ON-LINE SYSTEMS

Note that since m is well above 10, the approximation Nt ~Nta =


200, is quite good.
2) Two-level tree.

Nk = 20,000
m ]00
N
12
= INT 20,000 = 200
- 100
Nt! = INT 20,000 = 2
+ 100 3

5.2 Randomizer Formulations

The memory requirement for randomized decoders must be spe-


cifically tailored to the randomizet; however, for the generalized ran-
domizers of Figs. 32 and 33 the formulations are given as:

(Fig. 32) M r1 = N: Ca + N (C + C + 1 + [~ ~ 1] Ca + Ca)


k k 1

= Nk (C + C + 2C + 1)
k 1 a (9)

(Fig. 33) M d = N k~ C = N (Ct + 2 + [~ ~ 1] [Ca + C,])


a k

= Nk (C,[1 - ~J + C + C + 2).
1 a (10)

In formula (9) for Fig. 32, the indirect addressing level requires
Nk/a locations, each with a DASD address. That is, if the key vo-
cabulary, N k , were 20,000, and 10 bits were extracted from the key
by the randomizer, this would produce 1,024 codes, and the average
number of redundant decodings (i.e. the average chain length, a)
would be 20,000/1,024~ 20. Therefore, the total number of char-
acters required for the indirect addressing level is NkC a / a. The out-
put of the decoder requires, for each of the N k keys, the full citation
C k , the list length, Cl, a chain address, Ca, if there is a chain, and the
decoder output reference address Ca. One could economize by adding
a character to the output record that would indicate whether a chain
address were present or not. Thus, one character is added to the
count, and the chain address Ca is assumed to be required only a-I
out of a of its occurrences. When the factors are multiplied and
collected, formula (9) results which, curiously, is independent of a.
It is left for the reader to determine why this happens.
TECHNIQUES OF DIRECTORY DECODING 117

In formula (10) (for the randomizer of Fig. 33), the indirect


address level again requires N tCal 01 characters, but the total key cita-
tion is not required as part of the decoder output, since it uses the
key in the record. There is required a list length, Cz, an asterisk,
a chain signal character, 01-1 out of 01 occurrences of a chain ad-
dress, Ca, and the key fragment, C, (the randomized key, x, y, etc.).
In this case the expansion and collection of terms does not eliminate
01, and formula (10) results, which, when compared with formula (9)

is less by N/,; (C/, - C, [1 - !] + C


01
II - 1) characters.
If the indirect addressing level were to be placed in core, then the
formulas would appear as follows:

(Fig. 32) "lIe = Nk Co


:z
(9a)
.11 d N k (C k + Cz + C a [2 - ~] + 1)
Nk Co
(Fig. 33)
:z

Md N" (C z + [C, + Col [1 - J


~ + 2).
(lOa)

It should be noted, however, that both of these formulations in-


clude the resolution of unambiguous decoding, whereas the tree for-
mulations of (6) and (7) do not. Therefore, if the part of (9) and
(10) that is responsible for resolving ambiguity were removed from
the formula, in order to enable a direct comparison, they would
appear as:

(Fig. 32) Md = Nk (CI + C{ + Co [1 +~} (9b)

Resolution Overhead = N k (c" - CI + 1 + [1 - ~] Co) (9c)

(Fig. 33) J rd = N k ( Cz + ~a + 1) (lOb)

Resolution Overhead = Nk <[ 1 - ~J [C + C/l + 1) . (lOc)

Generally, 01 is made close to unity by extracting Log 2N{, bits from


the key in order to generate as many locations in the indirect address
level as there are keys. Then, the better the randomizer, the closer
01 is to 1. Typically, a good randomizer will produce an 01 of around
118 FILE STRUCTURES FOR ON-LINE SYSTEMS

1.1. The reason for desiring low a is fast decoding, as will be seen
in the discussion of access times, and since M d is independent of a
nothing is lost in memory, unless the indirect level is stored in core,
in which case, if memory space is to be optimized, a larger a is more
desirable.

5.3 Memory Requirement Comparisons

Table 7 presents a comparison of the memory requirement for the


three basic decoding techniques, the balanced tree with fixed length
key words, the tree with variable length key words, and the two
randomizers. Two cases are taken for each of the tree technique com-
parisons, one in which the average (or fixed length truncation) num-
ber of characters per key fragment is 4, which is near to the minimum
number that this parameter could assume, and a second in which the
average number of characters is 8, which is a more typical fragment
length. These parameter values are indicated in the column labeled
Ct. For each value of C t a tree is constructed with three different
values of N k , the number of keys in the system; these values of N". are
1,000, 10,000, and 30,000, as indicated in the third column.
The corresponding key length parameters for the Randomizer of
Fig. 32 are C, and C k , Formulas (9b) and (9c), and for the Ran-
domizer of Fig. 33, Ct. The value selected for C, is 2 because this
would provide a 16 bit extraction, or 2 16 indirect addressing cells,
which is adequate for the range of N,;s. The value selected for C"., the
total average key length is 12, and this parameter is only relevant to
the first Randomizer. In the fourth column, labeled m, is calculated
the number of key references of a tree that can be placed on one
track. These are computed from formulas (1) and (2). At the
bottom of the table, constant parameter values are given, that char-
acterize a typical system. As shown there the track capacity (C t ) is
taken as 3,000; this corresponds roughly to the capacity of an IBM
1301 disk track of 2,800 characters. The reserve space (Cr ) left on
each track is assumed to be 10%, or 300 characters. A list length
(C,) can be designated in two characters *, or 16 bits. An address
reference (C n ) is assumed to be 4 characters or 32 bits. A header
(C,,) in the variable length key-word tree can be designated in 2

• It is not considered necessary to distinguish characters and bytes in this discussion.


Table 7 COMPARISON OF MEMORY REQUIREMENTS FOR DIRECTORY DECODING TECHNIQUES

M" (Chars.) M,I (Chars.)


Total
(000)
Resolution (000)
Directory Overhead (2 Level
Decoding Nk 2 Level 3 Level 2 Level 3 Level (Chars.) Tree + Res.
Technique C, C.. (000) m Tree Tree Tree Tree (000) O'head
---- -
Tree with Fixed 1 32 8 12 12 12 [3] 24 -,j
Length Key Words 4 10 270 340 8 114 114 120 234 ttl
30 896 8 336 336 360 696 n
::t::
------ Z
I 72 12 18 18 12 30 .0
8 10 192 636 12 159 159 120 279 cttl
30 1884 12 471 471 360 831 VJ

o"r]
Tree with Variable 1 56 8 21 21 0 21
Length Key Words 4 10 150 536 8 201 201 0 201 S2
30 1600 16 600 606 0 600 ttl
--- "n
-,j
I 108 12 27 27 0 27 o
8 10 122 984 12 246 246 0 246
30 2952 36 738 744 0 738 "-<ti
ttl
Randomizer (Fig. 32)
n
1 4.4 (1] 11.4 ['] 15.8
2 12 10 44.0 114.0 158.0 ~
30 1320 342.0 474.0 Z
Cl
---
Randomizer (Fig. 33) 1 3.4 (2] 1.6 ['] 5.0
2 10 34 16.0 50.0
30 102 48.0 150.0
- ----- -------

C, = 3000 C(I = 4 Ci = 1.1 [3] Either excess retrievals or full key citation
C,· 300 C. = 2 (1] Formula (9b) ['] Formula (9c) 1.0
--
C, = 2 Nd = 4 [2] Formula (lOb) [5] Formula (lOc)
120 FILE STRUCTURES FOR ON-LINE SYSTEMS

characters, and the expected number of different length unique coded


fragments (Nd) in the variable length key-word tree is assumed to
be 4. A chain length (a) of 1.1 is used for the randomizing tech-
nique, which is consistent with the values of C, and N". The char-
acter requirement in the core for a two- and three-level tree is
given in columns 5 and 6, respectively; the character requirement
(in thousands) in the disk for a two- and three-level tree is given in
columns 7 and 8, respectively. It is also assumed that both the in-
direct addressing level and the output level of the randomizers are
stored in disk; therefore, there is no requirement in core, and the
only relevant statistic in the randomizer disk calculation is given under
the two-level tree. Since formula (7), upon which the tree calcula-
tions are based, does not include the resolution of ambiguity in the
fixed tree, whereas formulas (9) and (10) for the randomizers do,
the disk calculations have been divided into two parts. One is for the
decoder without the resolution of ambiguous decoding (columns 7
and 8); the other indicates the ambiguous decoding resolution over-
head (column 9). For the fixed tree, the redundancy may either be
resolved by examining the file records for the complete key, in which
case the price is excess retrievals from the File, or the decoder output
can contain the complete key of 12 characters, in addition to the
fragment. The variable tree has no resolution overhead because it is,
by design, an unambiguous decoder. In the case of the randomizers,
the part of formulas (9) and (l0) responsible for the resolution can
be isolated, and this is exhibited as formulas (9b), (9c), and (lOb),
( 1Oc) . The latter are therefore the formulas used for the respective
randomizer calculations, as indicated in the footnote of Table 7.
It should be emphasized before proceeding to analyze this chart
that it should not be used to draw general conclusions about the rela-
tive merits of these techniques, because the computations are based
upon a particular set of parameters, and another set might produce
somewhat different comparisons. Hence, the designer should con-
struct such a chart as this, using parameters relevant to his own
application in order to make meaningful comparisons for a given
application.
A number of observations can be made from this chart. First, as
expected, since m is considerably greater than 10, the disk require-
ment for the two- versus the three-level trees in each category is
nearly identical, diverging only slightly for the largest vocabulary.
Second, there is a significant difference in core requirement be-
TECHNIQUES OF DIRECTORY DECODING 121

tween the two- and three-level trees, and this presents a time/space
trade-off. The two-level tree decodes in one random accession but
requires greater core storage, while the three-level tree decodes in two
random accessions and requires minimal core storage. Of course, the
absolute values are probably more important in this case than the
relative values; for the case Nk = 1,000, the choice would clearly be
a two-level tree, since the difference between 32 and 8 characters is
insignificant. On the other hand, the 896, 1,884, 1,600 or 2,952
characters of core required by-the 30,000 key vocabulary may be a
significant allocation, particularly if it must always reside in core.
Third, the resolution of ambiguity in the fixed tree is very expen-
sive. Consequently, these decoders normally resolve ambiguity by
sacrificing accessions; however, the total memory requirement includ-
ing resolution overhead, is not too much more than that for the vari-
able tree, and since the former is considerably easier to program, if
the designer elected to accept the additional allocation rather than
the excess retrievals, the fixed tree, in this analysis, would still be
superior to the variable tree, because of programming simplicity.
Fourth, the randomizer of Fig. 33 is clearly superior by a factor
of 3 to the Fig. 32 randomizer, and there the trade-off consideration
is flexibility. The randomizer of Fig. 33 is specifically tailored to a
threaded list or Multilist file organization, while the other is com-
pletely general.
Fifth, there are two kinds of comparisons to be made between the
trees and the randomizers. If one accepts the excess retrievals of the
fixed tree in order to resolve ambiguity, then the total disk storage
requirement for the tree is given under M" in the table, and for the
randomizers under Total (last column) since the randomizer has
the resolution built-in. In this case there is little difference between the
tree and randomizer of Fig. 32, but a significant difference between
the tree and randomizer of Fig. 33. On the other hand, if the resolu-
tion of the fixed tree output were by citation of the full key, then there
would be somewhat of a difference between the tree and the Fig. 32
randomizer, and a very significant difference between the tree and
Fig. 33 randomizer.
In summary, the various quantitative trade-offs in selection of a
Directory Decoder involve decoding speed. core and DASD memory
requirement, and programming complexity, while the qualitative
factors are range search capability and the contingency on threaded
list file organization.
122 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 8 presents gross comparisons extracted from the analysis


of Table 7, with respect to the above mentioned quantitative and
qualitative trade-off factors. The table includes both the two- and
three-level trees (but only for C, = 4, since this is most comparable to
the randomizers' C,= 2). Each parameter or factor, except Range
Search Capability is ranked by a number, where the more optimal the
parameter, the lower the number. That is, in decoding speed, the
two-level trees are fastest, and the randomizer of Fig. 32 is the slow-
est. Again, the reader is cautioned against categoric generalizations
from this chart, because the ratings are based entirely upon the com-
putational results of Table 7, which in turn were based upon selected
operational values.

6. Decoding SPeed

Formulation for the speed of a decoder, once designed is relatively


easy. It basically consists of enumerating the sequence of head posi-
tionings (if on a movable head DASD), track rotation latencies, track
reads, core processing, and revolutions lost due to core processing.
Table 9 presents such a sequence chart for a three-level tree with the
first level in core. As an example, it is assumed that the decoder has
been implemented on an IBM 2311 Disk Pack, and that each set of
third-level nodes emanating from a given second-level node is stored
in the same cylinder as the second-level node, so that there is no head
positioning required between the second and third levels. This con-
dition would, of course, be satisfied if the tree were stored in a fixed
head disk or drum, and, moreover, the total access time would be
reduced by the head positioning time. The total average decoding
time is

1\ = p + ~.5 R, (11 )
where P is the average head positioning time, and R is the disk revolu-
tion time. This formulation can be generalized for an n-Ievel tree
with first level in core to

Tn = P + 1.5 R + 2 (11 - 2) R
=P + (2n - 2.5) R. for 11 > I. (12)
This assumes that no further head positioning is required after the
Table 8
COMPARISON AND TRADE-OFFS AMONG THE
DIRECTORY DECODERS ANALYZED IN TABLE 7

Core/DASD Memory Requirement


>oj
Directory Range tt1
DecCMflnc Without Decoder With Decoder DecCMfInc Search Programming g
Techniques Ambiguity Resolution Ambiguity Resolution Speed Capability Ease Z

Two-Level Fixed Tree 4/2· NA/4 1 YES 1


~tt1
CIl
(Cf = 4)
o
"'1
Three-Level Fixed Tree 2/2 NA/4 3 YES 1
(Cf =4) sa
f:l
('l
Two-Level Variable Tree 5/4 NA/3 1 NO 3 >oj
(Cf = 4)
~
Three-Level Variable Tree 3/4 NA/3 3 NO 1 g
(Cf =4)

Randomizer (Fig. 32) 1/3 NA/2 4 NO 1


~
Z
o
Randomizer (Fig. 33) 111 NAIl 2 NO 2

• Rankins: Lower numbers indicate more optimal parameter size.


NA means Not Applicable.
....N
Vl
t 24 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 9
DECODING TIME FOR THREE-LEVEL TREE

Tree Typical Value (Av.)


Decoder Process Level Time (Symbolic) for IBM 2311 Disk Pack

Record Process in core 1 0 0


Head Position P 85 msec.
Latency %R 12.5
Track Read R 25
Record Process 2 0 0
Lost Revolution R 25
Track Read R 25
Reco rd Process 3 0 0

Total P + 3.5 R 172.5 msec.

cylinder for the second level is reached. Variants on this formulation


for other situations are as follows:
1) n-Ievel tree, first level in core, fixed head DASD.
Tn = (2n - 2.5) R, n > 1 (12a)
2) n-Ievel tree, first level in DASD, movable head DASD and
successive levels in same cylinder.
Tn = P + (2n - 0.5) r, n > 0 (l2b)
3) n-Ievel tree, first level in DASD, movable head DASD and
successive levels not in same cylinder.
Tn = n (P + 1.5R) ,n > 0 (l2c)
Table 10 presents the sequence chart for the randomizer of Fig.
32, which is identical to formula (12) for n= 3 or (12b) for n= 2,
and Table 11 presents the sequence chart for the randomizer of Fig.
33, which is comparable to the single-level tree, formula (12b) for
n= 1, or the two-level tree with first level in core, formula (12) for
n=2.
These formulations will be used at the conclusion of Chapter VIII,
where total file response for retrievals and updates will be computed.
It should also be noted that a complete revolution, R, must be
added to all random access writes in which the DASD does not include
hardware verification with an immediate read after write. Appendix
C indicates for each of the equipments described, which has this hard-
ware facility. Those that do not, like the IBM 2311 Disk Pack, should
be verified by a programmed read (on the next rotation) of the writ-
ten data. Therefore, in the case of tree updates, R should be added
TECHNIQUES OF DIRECTORY DECODING 125

to the formulations (11) through (12c) for devices without hardware


verification.

Table 10
DECODING TIME FOR THE RANDOMIZER (Fig. 32)

Decoder Proeess Level Time (Symbolic) Typical Value (Av.)

Randomize Key Randomizer 0 0


Head Position P 85 msec.
Latency 'hR 12.5
Track Read R 25
Record Process Indirect 0 0
Address
Lost Revolution R 25
Track Read R 25
Record Process Output 0 0

Total P+ 3.5 R 172.5 msec.

Table 11
DECODING TIME FOR THE RANDOMIZER (Fig. 33)

Decoder Proeess Level Time Typical Value

Randomize Key Randomizer 0 0


Head Position P 85 msec.
Latency 'hR 12.5
Track Read R 25
Record Process Indirect 0 p
Address

Total P + 1.5 R 122.5


CHAPTER VII

TECHNIQUES OF SEARCH FILE


ORGANIZATION

IN CHAPTER V the two-step retrieval process of Decoding and File


search was described, and it was indicated that these two processes
were relatively independent. As a further demonstration of this in-
dependence Chapter VI presented detailed descriptions of these de-
coders with minimal reference to or contingencies upon the data file
organization. Only in one case, the randomizer associated with Fig.
33, was there a restriction placed on the file structure design. Thus, the
designer is relatively free to select a decoder, based almost entirely
upon the criteria outlined in the last chapter. Similarly, the selection
of a data file structure is independent of the decoder.
In this chapter, the multiple threaded list (Multilist) and the In-
verted List file organizations will be contrasted, and then a structural
continuum between these extremes, called the Multilist with controlled
list length, will be described. These three system types are represented
in Fig. 27. From these standard list structures, a variant class, labeled
Cellular file structures in Fig. 27, will be discussed. All of the struc-
tures are described first by a series of schematic diagrams in Figs. 35
through 39. The same example containing four key lists is used
throughout so that the progression from the threaded list to Inverted
List structure is made evident, and the parallel relationship of the
cellular systems is also manifest. Then procedures for generating these
structures are presented, and finally timing formulations developed.

126
TECHNIQUES OF SEARCH FILE ORGANIZATION 127

1. The Multilist File Organizatirm

In Fig. 35 the output of the Key Directory is a fixed length record


containing the triplet Key (fragment)/Head of List Address/List
Length, where either the entire key or the fragment is used, depending
upon whether decoding ambiguity is resolved at the Directory output
or in the data record. The head of list address is represented sym-
bolically, in these diagrams as An, where n is a numeric representing
a mass memory address. Records within the File area are repre-
sented as dots or beads along the thread emanating from the output
level of the Key Directory. Thus, the W list begins at address A6 and
contains seven records. As indicated in the diagram, the second record
on the W list also contains key X because of the list intersection. This
record has not been labeled with an address in the diagram because
it is not a head of list record, since the head of the X list is at A3;
however, there would be a link address pointing to this record con-
tained within the record at A6, associated with the citation of key W
within that record, as shown in the Record Format of Fig. 18.
A link address pointing to the second record on the W list (as well
as on the X list) would also be contained within the record in loca-
tion A 3, associated with the citation of key X within that record.
If, then, a query conjunction WY were to be searched in this file, the
Key Directory Decoder would decode both Wand Y and examine
the respective list lengths at the output of the Directory. Since the Y
list is shorter, containing four records as opposed to seven for the
W list, the Y list would be searched, thus requiring four random ac-
cessions within the File area. Each accessed record would be cex-
amined in core for the joint occurrence of W, and, in the illustration
of Fig. 35, would be satisfied only once, by the record in location A9.
If two link addresses were stored with each key in a record, one
pointing to the subsequent record on the list and the other to the
preceding record, then the list structure is said to be bidirectional.
This permits a general network scan of the file, including cycles, while
the unidirectional threaded list cannot trace cycles except by linking
the last record on a list to the first, which is not very efficient. Bidirec-
tional listing also has utility in increasing the speed of updates and in
space maintenance, but in all of these regards~ycles, updates, space
maintenance-the Inverted List system is also adequate. Bidirectional
lists are normally not required in the typical IS & R application, and
since it doubles the list storage overhead, where their particular
capabilities are required, the Inverted List, which has the same over-
head as unidirectional lists is generally preferred.
Key/Head of List Address/List Len9th
IV
00
--- Key Directory
(Output Level)
"Il
r::
ttl
{/}
o-j
Cell 0
c::
n'"
o-j
c::
ttl
{/}
'"
"Il
Celli 0

List Structured
'"
0
ZI
File in Random t""
.....
Z
Access Memory ttl
{/}
Cell 2 0<
{/}
o-j
ttl
~
{/}

Cell 3

Fig. 35. The Multilist File Organization


TECHNIQUES OF SEARCH FILE ORGANIZATION 129

The greatest disadvantage of the multiple threaded list is that in


order to respond to a conjunction of terms, like WY as shown above,
it must access from the DASD, and transfer to core all records on the
shortest list for examination in the processor, even though the inter-
section of these lists, which satisfies the query, may be much smaller.
In the case of the above example, the system retrieval to DASD acces-
sion ratio was 1 to 4. In practice, if list lengths are several hundred
(or thousand) long and conjunctions contain many keys, this ratio
may be on the order of a few hundredths (or thousandths), which is
highly inefficient. A reflection of this inefficiency is also found in the
low quality presearch retrieval statistics. The shortest list length in a
query conjunction, or the sum of such list lengths for a logical sum
of products, is the closest upper bound on the number of retrievals
that a Multilist organization can provide, prior to the File search.
The other structures to be described yield considerably better pre-
search statistics.
The principal advantages of the Multilist file organization are pro-
gramming simplicity and update flexibility.

2. The Inverted List File Organization

Figure 36 illustrates the Inverted List file organization. In this


system all linkages have been removed from the File area and appear
instead at the output of the Key Directory. This will usually result
in a considerably larger Directory than the Multilist system although
the total memory usage is no greater, because these same link ad-
dresses no longer appear within the file record. In fact, the Inverted
List system may require less storage than the Multilist system if the
key names or codes are not individually cited within the record itself.
This can be done if it is not necessary to print the keys as output,
because any logical expression of keys can be decoded directly by the
Key Directory itself.
These lists are variable length records that must be maintained in
a monotonic sequence for efficient logical manipulation. Both the
maintenance of the lists as variable length records, which can be
quite diverse in size, and their maintenance in sequence, contribute
to certain programming complexities that are absent from the Multi-
list organization, although they buy much in performance. Logical
conjunction of nonnegated terms is accomplished by list intersection
130 FILE STRUCTURES FOR ON-LINE SYSTEMS

within the Directory; disjunction of nonnegated terms is accomplished


by list merge within the Directory, and the conjunction of a non-
negated key with a negated key is accomplished by removing from
the nonnegated key's list of addresses, all those that appear on the
negated key's list. The retrieval of an isolated negated key term list
is usually tantamount to a serial search of the entire file, and hence
would be performed as such if ever required, since key list lengths are
normally less than a few percent of the entire file size. Most real-time
list structured systems would disallow such a query, although it could,
if required, be relegated to the batched mode of retrieval. Each of
the above described logical functions is greatly expedited if the list
addresses are stored in a monotonic sequence, since only a single pass
through the two lists is required.

w X Y Z
A6 AI9 A3 AI9 A9 AI5 A27
A7
A9
A23
A35
A7
AI4
A20 AI4
AI6
AI7
A22
A37 ---
AI2 AI5 A21 A25

.A 3

CeliO • A7
.A9
~----------- ----- -----
.A12
Celli

- - - - - - -A21- -A16·
- - -.A17
- - - - -• -AI9- --
A20.

Cel12 • A22 A23.

1"-----------------------
Cel13

Fig. 36. The Inverted List File Organization


TECHNIQUES OF SEARCH FILE ORGANIZATION 131

Since the resulting addresses from the logical search of the inverted
lists is precisely the retrievals specified by the key logic part of intra-
record processing, the ratio of system retrievals to DASD accessions
is higher than for the Multilist organization, and if no qualifier logic
is involved then the ratio is unity. Therefore, the Inverted List system
has a considerably higher quality presearch statistic than the Multilist
or any of the other partially inverted systems to be described. In
Fig. 36, the WY query intersects at A9 in the inverted lists; therefore,
the presearch statistic, which can be returned to the user prior to
accessing the record from the file, is one, and the number of random
accessions from the file is one.
An efficient way to program the Inverted List logic processing, for
systems in which the list lengths may vary greatly, is illustrated in
Fig. 40. Assume that the variable length key lists A, B, and C are
to be logically processed. These are the inverted lists stored in the
DASD, which contain the sequenced addresses of all file records that
contain the keys A, B, and C, respectively. Each list may be regarded
as a logical record that extends, in general, over a series of physical
records.
Six buffers are allocated in the central processor, two input pairs
(11,12, and 13, 14) and one output pair (01,02). The solid line
rectangle in Fig. 40 designates the processor; all switches are imple-
mented by programming in the processor, and the broken lines rep-
resent disk to core or core to disk data transmission. The physical
records from two input files are double buffered into the input buffers,
and processed alternately by the Logic Processing, after which the
resultant addresses are double buffered out to one of two output files
(FI, F2) on disk. The sequencing for the processing of the query
expression f(A,B,C) is shown in triangles on the figure. First, input
buffers 11 and 12 are loaded with two records from list A and 13 and
14 from B ("V). As 12 and [4 are loading, 11 and 13 are processed,
and the results transmitted to the output buffer 01 and, if necessary
02. For example, if A and B are intersected, then only 01 is required
for 11 and 13, but if A and B are merged, then 02 might also be re-
quired. The program successively reads records from lists A and B,
and writes from output buffers 01 and 02 onto DASD file Fl (W).
When A and/or B are completely processed, file Fl is switched and
becomes an input to buffers 11, 12 (&), list C becomes the other
input to 13, 14, (~), and the output buffers 01, 02 are switched to
DASD file F2 ('fI). This procedure can now be repeated for as
many keys as desired and of any arbitrary length. The principal ad-
132 FILE STRUCTURES FOR ON-LINE SYSTEMS

vantage of the Inverted List over the Multilist organization is that the
list intersection or merge process is considerably more efficient, since
it is performed on compact, sequenced lists rather than requiring a
random accession for each record on the list. Thus, the exact number
of addresses that respond to the entire Boolean expression of keys
is transmitted from the Directory to the File, although each record
must be randomly accessed in order to examine the qualifier condi-
tions. Furthermore, the number of addresses that is produced by the
Inverted List processing is a more accurate retrieval statistic that can
be returned to the remote terminal prior to the actual search in the
file, since the user may want to make a query modification based upon
the size of this number. Disadvantages of the Inverted List system are
that a working or staging area is required in order to perform the
logic processing (intersection, merge, deletion), and the list records
being variable length should include some reserve if real-time update
is to be allowed.

J. The Controlled List Length MultiUst

One of the disadvantages of a threaded list system is slow response


in situations where list intersections are infrequent and the list search
is long. For example, for a 1 % intersection on a list search of 2,000
records, the total retrieval time for the 20 records is 200 seconds if
the DASD mean-access time is 100 milliseconds, whereas the time
to access the 20 records by an Inverted List procedure would be two
seconds, exclusive of the list intersection time. Whether this 200 sec-
onds versus the two seconds (plus list intersection) is actually mean-
ingful depends upon the output device and the chance distribution of
the intersections on the list. A typewriter with a 10 to 15 character
per second rate would type continuously if each record were in excess
of 150 characters and were about equally distributed through the list.
If all records were bunched at the end of the list, then there would be
an initial delay of around three minutes. The sensitivity of CRT
operation might be somewhat more erratic, since a record display of
interest might be viewed for a half minute or so, during which time
the list search would continue; however, other records that are re-
trieved, but are discarded within a few seconds with the press of a
button, may not provide sufficient overlapped search time. This would
be felt as a lack of responsiveness to the touch of the button, where
TECHNIQUES OF SEARCH FILE ORGANIZATION 133

delays would range from virtually nothing to tens of seconds. A line


printer, that would print from 100 to 1,200 lines per minute would
generally be delayed by such a processor response unless the records
were extremely long. Thus, it is seen that peripheral devices and the
environmental use of the system may actually cancel performance
factors in either the file organization or the DASD itself, which would
then allow the designer to use economic rather than performance
criteria.
The Multilist can be modified, however, to take advantage of DASD
modularity, in a relatively easy manner. In Fig. 37 the Multilist
system is modified by limiting each list to a predetermined maximum
length, which, in the case of this example, is four. When a list ex-
ceeds four, a new list is started, and the head of the new list is inserted
into a listing of such addresses at the output of the Key Directory,
under its respective key. In order to be more precise, an addressing
notation is used in this figure that identifies in which of the four cells
a given list begins. Thus, list W which contains seven records, is
broken into a list of length four and a list of length three. The first
list begins in Cell 0, record six and is identified as AO.6; the second
list begins in Cell 1 at record nine (A 1.9). Schematically, the same
thread that appeared in Fig. 35 is shown in Fig. 37; however, the
break in the single list is indicated by the broken line since, in fact,
the seven items on the W list are now contained on two separate lists.
If the cells in this diagram were to be defined as individual disk
modules, then it would be possible to overlap list accessions if two or
more list records that are the "next to be accessed" are contained on
different modules (i.e., in different cells). The overlapping is auto-
matically controlled by the resident I/O operating system, which is
usually provided by the manufacturer of the equipment. For example,
if a query conjunction WX were required, the X list would be selected
because it contains a total of six records whereas the W list contains
a total of seven. The X list appears in this file as two sublists each
of which begins in a different module. The search system would issue
an I/O command against both addresses in the X listing, namely, AO.3
and A 1.9. It is the function of the operating system then to schedule
the accessions on the respective disk modules in accordance with the
availability of the modules and the priority of the I/O request. If at
the moment when these two I/O commands were issued by the search
system, neither of these modules were in use, then the head position-
ing would begin simultaneously to AO.3, and A 1.9, and record reads
would subsequently be initiated. Hence, the two X sublists would
.-
Vl
y .j::o..
W X Z
A 0.6/4 A 0.3/4 0.9/4 A 1.7/4
A 1.9/3 A 1.9/2 A 2.5/2

'"%l
....
t""'
t!j

en
..,
CeliO
~
(")
..,
c:::
~
t!j
en
'"%l
Cell I o
~

~ I
t""'
Ii:t!j
Cel12
~
t!j
~
en
-----~---------
Cell 3
Moximum List Length= 4

Fig. 37. The Multilist File Organization with Controlled List Lengths
TECHNIQUES OF SEARCH FILE ORGANIZATION 135

effectively be searched in parallel, and the entire search would be


effected in approximately four random accession times rather than six.
Assume then that the X list is being searched from both AO.3 and
A 1.9, and that both AO.3 and its list successor (in cell 0) are ac-
cessed and processed before A 1.9 is accessed, and that an I/O com-
mand is issued for the third record on the list, which would be in
Cell 1. This request would automatically be stacked by the operating
system until A 1.9 had been accessed and transmitted; thus, the pro-
gramming of this type file structure is virtually no different than that
of the Multilist system except for the multiple head of list entries at
the output of the Directory.
The generation of this type of list structure can also be programmed
as easily as the Multilist of Fig. 35, as will be shown later. In fact,
the controlled list length Multilist is actually a more generalized list
structure, because both the Multilist and the Inverted List are special
cases, wherein the former has a list control of infinity and the latter
has a list control of unity. The procedure for generating these list
structures, described later, includes the parameterization of this con-
trol in such a way that the file can be regenerated at any time with a
new list length control, and hence a different degree of inversion. An
alternative to the single parameter is a table that provides differential
control over the various keys. For example, frequently used keys
should have shorter lists (i.e., more inversion) than infrequently used
keys. Furthermore, this table could automatically be modified by a
program that maintains key usage statistics, and the file reorganized
periodically in accordance with the new state of the table. Such a
procedure would then appear to have a self-adaptive quality.
Consider as a second example, the query "X OR Y," which would
necessitate the retrieval of both the X and the Y lists. The search
system would issue I/O commands to access records AO.9, AO.3, and
A 1.9. Assume that the commands were issued in the above sequence
and that the operating system, therefore, initiated an accession request
on module 0 for address AO.9 and on module 1 for A 1.9, so that part
of the X list and the Y list were overlapped. When the first record
on the Y list at location AO.9 is examined in core, the system will first
issue an I/O command to access the next item on the Y list, as indi-
cated by the link address within that record. This record happens
to be, as indicated in the diagram, in Cell 1; therefore, an I/O com-
mand to access a record on module 1 is issued. If at this time, the
system is still in the process of accessing or reading record A 1.9 on
the X list then the I/O command for the second item on the Y list
136 FILE STRUCTURES FOR ON-LINE SYSTEMS

will be stacked by the operating system. However, the 0 module will


be released, and the search will begin immediately on the other part
of the X list that begins at address AO.3, since this had initially been
stacked by the operating system; therefore, there will then be overlap
on the searches of the two X sublists. As soon as the first item on the
X sublist, at A 1.9, has been read into core, an I/O command is issued
in accordance with the link address in that record to access the next
record, which happens to be in Cell 2, whereupon the search of the Y
list, which had been stalled until module t had been cleared, may
proceed, and at this time all three lists would be overlapped, with
accessions in modules 0, 1, and 2. Thus, the maximum amount of
overlap possible will always be obtained without any programming
requirement on the search system subexecutive, since the stacking of
I/O commands is performed automatically by the manufacturer pro-
vided operating system.
Unfortunately, the presearch statistic in this file structure is still the
same as that of the Multilist organization because an entire list
(the shortest in the conjunction) must still be searched, and, unless the
file is completely inverted (controlled list length of t ), no intersections
are possible at the Directory stage of the search. The search time,
however, will always be less than the uncontrolled MuItilist, being at
the outset a function of the DASD modularity, while dynamically, it
is a function of the queries themselves, over which the system designer
has no control. The best he can do is to use the adaptive procedure
described above so that frequently used keys have more starting points
but higher Directory (output) overhead, and infrequently used keys
have fewer starting points but lower Directory (output) overhead.
The off-line generation of these list structured files is performed,
surprisingly, without the use of any random access devices. On-line
update of the files, which in certain cases may also be regarded as a
file generation procedure, must, of course, utilize the DASD, and the
requirement for such a procedure imposes its own set of constraints on
the file design. This topic will be covered in the next chapter.

4. Cellular Partitions

The preceding three list structures have the common property that
no attempt is made to arrange or order the records in the File for
more optimal retrieval. The File may be ordered by record accession
TECHNIQUES OF SEARCH FILE ORGANIZATION 137

number, but these number assignments are quite arbitrary with respect
to the file structure and retrieval mechanisms. Furthermore, those
controls that are applied, such as the controlled list length Multilist,
do not lead to predetermined file processing efficiencies, but rather
are more incidental, being a function of how the records happen to
group or distribute themselves over the various lists, and of how the
various keys happen to combine in queries.
An alternative is to define logical cellular boundaries throughout
the DASD medium into which the records may be placed according
to some predetermined storage strategy. Or, even if the records were
still loaded arbitrarily, the cells represent a partition that may be used
in lieu of or in conjunction with list structured controls. This type of
file partitioning, referred to as Cellular Partitions in Fig. 27, will now
be explored.
Figure 38 illustrates a Cellular partition that incorporates a Multi-
list file structure. Looking the other way around, it could be viewed
as a modification of the Multilist technique, wherein each list is local-
ized within a cell, instead of being limited by length. This technique
is, therefore, called Cellular Multilist. The same approach could be
applied to the Inverted List method, wherein, the inverted lists for a
given cell would appear at the beginning of the cell.
Since the cell is part of the record address, as shown in Fig. 38,
the inverted listing of heads of lists at the Directory output is also a
listing of the cells in which a search must take place, and this cell
Inverted List can be used to advantage in order to reduce the number
of record accessions from the DASD and, as a result, to improve the
presearch statistics.
Consider the query XZ. An examination of the output level of the
directory shows that list X contains sublists that are wholly contained
within Cells 0, 1, and 2 with list lengths, respectively, of 2, 3, and 1.
Similarly, list Z has sublists that are entirely contained within Cells
1, 2, and 3, and list lengths 2, 3, 1; therefore, one cannot expect to
find a conjunction of X and Z in any cells that are not contained within
the cell intersection between the head of list addresses of X and Z.
That is, list X contains a head of list address in Cell 0 but list Z does
not; therefore, no intersection of XZ exists in Cell 0 because list Z
does not appear in Cell O. Similar reasoning applies to Cell 3. There-
fore, the search on the conjunction XZ would be limited to those sub-
lists of X and/or Z contained only in Cells 1 and 2; furthermore, in
Cell 1, list Z is preferred because it has a list length of 2 versus 3 for
list X in Cell 1, and in Cell 2 list X is preferred with a list length of 1.
138 FILE STRUCTURES FOR ON-LINE SYSTEMS

W X y Z
AO.6 13 A 0.312 A 0.9/1 AI.7/2
A 1.2 12 A 1.413 A 1.4/2 A2.2 I 3
A2.3 I I A 2.0/1 A 2.11 I A3.71 I
A3.5 I I

Cell 0
,
I

------T---
I
,_LI _____ _\
A 0.9

',..J. A 1.2
' .... _ A 1.4 1 I
Celli
- AI.T
_ _ _ _ _ _ _ _ ...L_+ _ _ _ _ ....-...:_
/ I , ....
A2.1~ / A 2 2 , A2D
Cel12 . A2.3'
\
\
-------~----------~
Cell 3 I"
4A3.7
~
A~

Fig. 38. The Cellular Multilist File Organization


TECHNIQUES OF SEARCH FILE ORGANIZATION 139

The total list search for the XZ conjunction is then 3, instead of 6, as


it would be if the shorter of the two lists X versus Z were searched,
as in the Multilist file organization.
If, in addition, the cells were to be designated as separate disk
modules then a controlled overlap could be achieved since the search
executive would maintain the schedule of list searches within each of
the individual modules, where each sublist search is completely con-
fined to a particular module. In the case of a movable head disk, the
designer might want to consider an alternative cellular designation,
this being that a cell should correspond to a cylinder of the disk, so
that a particular sublist search is effected without any further head
motion, once the heads have been positioned to a given cylinder.
This particular cellular assignment also can be used to facilitate
the following kind of multiprogramming strategy. Assume that three
queries are currently in process in a multi terminal, real-time system,
and that the respective cell accession lists for each query, as deter-
mined by the above described intersection process, is as shown in
Table 12. Accordingly, query 1 (Ql) requires that sublists be
searched in Cells 0, 1,4, and 9; query (Q2) requires that sublists be
searched in Cells 1, 8, 9, and 15, and query 3 (Q3) requires a search
in Cells 0, 2, and 4. The executive first merges the individual
cell accession lists to form the merged cell accession list shown in
Table 13. In this way it is seen that queries 1 and 3 require service
in Cell 0, queries 1 and 2 require service in Cell 1, etc. Thus, the
executive can intermix the list searches, and can move the access heads
across the cylinders uniformly. This system also has the advantage
that the mean access time is reduced since list searching is within a
cylinder, which for a typical disk system, requires a mean access time
of around 50 milliseconds for the track read plus latency, and if more
than one record on a list is contained within a track, then the mean
accession time per record is 50 milliseconds divided by the number of
list records per track. Furthermore, the cylinder head motion will
be more localized, producing a somewhat smaller mean accession

Table 12
QUERY /CELL ACCESSION LIST

Query Number Cell Accession List

Q1 0, 1, 4, 9
Q2 1, 8, 9, 15
Q3 0, 2, 4
140 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 13
MERGED CELL ACCESSION LIST

Merged Cell Accession List Query Number

0 Ql. Q3
1 Ql. Q2
2 Q3
4 Ql. Q3
8 Q2
9 Ql. Q2
15 Q2

time for head motion. The system could also be constructed so that
higher priority queries would override this schedule and force the
heads to jump to those cylinders wherein high priority requests re-
quire service. If a new query were to be submitted to the system in
the course of the head traversal, then the executive would immediately
merge the new cell accession list into the current list so that the new
query would in turn go into execution immediately. When the head
reaches the highest cylinder number, the process would be reversed,
and the head would travel down the list on its return trip in order to
service those calls of queries that were entered in the wake of its prior
traversal.
This technique can also be applied to increase the effective batch
size in magnetic tape searches. Let cells be defined as tape segments
so that the merged cell accession list indicates which queries are to be
processed during the passage through the processor of the indicated
segment. Furthermore, the queries themselves can be sequenced on
another tape according to this same schedule. For example, Table 13
would generate a query tape of the form: (Cell 0), Ql, Q3; (Cell 1)
Ql, Q2; (Cell 2) Q3; (Cell 4) Ql, Q3, etc., where each query is
completely repeated on the tape. Then, as the file tape segments pass
through the processor, the query tape is moved, and only those queries
required for processing will actually be in core. If the file is also list
structured, then each record that passes through the processor will
be examined by the relevant query only if it is on the appropriate list.
A sequenced list of "next address" must be maintained by the search
executive in order to generate the record-to-query pointers, but the
benefits in search economy can be considerable. The limitations of
batch systems are (1) the number of queries that can be placed in
core, and (2) the time to process all queries in core against a single
record. As the number of queries increases, this time increases, and
the system passes from tape to processor limited operation. Using
TECHNIQUES OF SEARCH FILE ORGAN1ZATION 141

the Cellular Multilist strategy, only those queries requiring service in


a given segment must reside in core; therefore, the largest such number
for some segment is the effective batch size, where the actual batch
size of the run is generally somewhat larger. Thus, the actual size of
the batch is larger than that which can fit in core, and the batch size
increases as the cell size is decreased. More importantly, each record,
as it passes through the processor, is processed by only one query
(occasionally by more than one) at the minimal expense of main-
taining the list of sequenced "next addresses." The result is to enable
the tape to pass without stopping for extremely large batches.
Also, the processing of a record can overlap the reading of records
between the current one and the "next" one in the executive schedule.
This further increases the likelihood of remaining in a tape limited
mode.
Figure 39 presents another alternative, which completely eliminates
list structuring. This method is called the Cellular Serial file organi-
zation, since the file is similarly organized into cells as in Fig. 38, but
the cells are searched serially, as though they were strips of magnetic
tape. Again, the cell intersection strategy may be employed in order
to effect a gross file partition based on the query, and then each record
in the accessed cells is processed. This approach has the advantage
of programming simplicity and ease of update if reserve space is left
at the end of the cell. It is applied with more particular advantage to
DASDs in which both the average random accession time and the
serial data transmission rate are large.
That is, in comparing the cellular approach with and without intra-
cellular list structuring, if a high price is paid, in terms of access time,
to position the mechanism for cell retrieval, then, percentagewise, the
time saved in searching a list within the cell may not be significant,
versus serial transmission of the entire cell. For example, assume that
( I) the cell accession time is 500 milliseconds, (2) the serial trans-
mission is 250 KB/S, (3) the average random accession time per
record, once the DASD mechanism is positioned to the cell, is 50
milliseconds, (4) the cell size is 125 K bytes, and (5) that 5 out of
100 records in the cells are on the list to be retrieved. The Cellular
Serial approach would require one second to access and transmit the
cell, while the Cellular Multilist would require 3/4 seconds, but if the
list were to increase to 10, the retrieval time would approach very
near to one second; it could not reach one second until the list size
is the full cell size of 100 records. The compensation is found in a
decreased mean access time (from 50 milliseconds) as more records
142 FILE STRUCTURES FOR ON-LINE SYSTEMS

I I I
I • I· I •
I • I I
I •
I I


• I
I
N-NIf)

• • I
I

• •
•• I
• I
>-O-N
• •• • I

•• I

• I

><O-N • • I
I


rL L
• • •
I

I

I
I
~O-NIf)
• I

• I I
t 4
I

- o N

o • •
o •
o •
o
TECHNIQUES OF SEARCH FILE ORGANIZATION 143

appear on the same track. Unfortunately the magnetic strip and mag-
netic card DASDs like the IBM Data Cell have the larger access
times of several hundred milliseconds, but also have a relatively low
serial data transmission rate of around 50-70 KB/S; however, in time
it can be expected that serial transmission rates of devices in this
category will increase.

s. Aut01llatic Classificati01l

It is evident that the fewer cells in which a key is represented, the


more efficient is the retrieval process in a cellular file organization.
That is, the more densely populated the lists are within a cell, the
fewer will be the cells in which the list appears, and ideally each list
would be wholly contained within a single cell. Computer algorithms
that form aggregates or clumps [13] of keyed records have been de-
veloped for the purpose of information retrieval, and these same pro-
cedures could well be applied in order to assign the keyed records to
cells in an attempt to achieve the minimization of cell occurrences
per key. The design of the algorithm would have to be such as to
place within a cell records that have as many keys in common as
possible. This is an optimization problem of a dynamic programming
nature since a given record might be attracted to several different
cells under such a process, and some measure of closeness would
therefore have to be defined in order to provide the necessary deci-
sion rule. Since this process of aggregating or clumping is similar to
the traditional problem of library shelving and classification, it could
be thought of as a form of automatic classification, and hence may
also be useful in the development of a thesaurus and classification
schedule to be used by the querist via printed catalogs as an aid in
formulating his questions. An algorithm that is reasonably efficient
in operation, but not optimized, for generating these cell assignments
is presented in Appendix B along with a description of how such a
thesaurus and classified hierarchic schedule would be used in conjunc-
tion with the retrieval system.

6. OfJ-U1Ie Ge1lerati01l of Ust Structured Files

Chapter VIII will discuss methods of on-line file update, but, as


stated in Chapter III, every system does not require on-line file update.
144 FILE STRUCTURES FOR ON-LINE SYSTEMS

In such systems, the updates are made in batches, and even in on-line
update systems, there is usually a massive file generation and update
capability that is performed off-line as an economic expedient. It is
also of interest to note that such a procedure can most effec-
tively be carried out without requiring the random access mode
of the DASD, since it consists chiefly of sorts. The process starts
with the input of a linear magnetic tape file of records, each with
the keys separately identified and somehow distinguished from
the rest of the data in the record. In fact, for the purpose of the
following discussion, the record has three distinguished parts: an
accession number, the keys, and all remaining non-key data, in-
cluding the qualifiers. The list structured file can then be generated
by a series of sorts, as shown in Fig. 41. At the upper left of the
diagram appears the Input File, which contains records in a format
similar to that shown in Fig. 18, with the three distinguished parts.
The Link Address, which appears with each Key Name/Value pair
in Fig. 18, is the only record component that is not contained in this
input file, and it is one of the functions of this program to assign them.
Below each tape symbol in Fig. 41 are contained the relevant data
fields in each record. An underlined field designates the sort key of
the file. As shown, the input file is assumed to be in accession number
(AN) sequence, which could be regarded as the main file key in the
system. Alternatively, the system could be sequenced on any other
desired unique key. In Block 1 Disk Addresses are assigned to each
input record in turn, allowing space for the Link Address. The input
records are 'packed on the tracks, and a certain amount of reserve
space should be left at the end of each track if the file is to be subject
to real-time update. This allows for expansion of records in real-time
without having to relocate and readdress data records, since the
address of any given logical record is its physical track address plus
a sequence number indicating the position of the logical record on
the track. The length of each record may be indicated by a character
count contained within the table of contents, so that the required
record on a track is readily accessed by a series of indexing operations.
It is customary to leave approximately 10% reserve area at the end
of the track. Furthermore, another loading rule that is used in the
Multilist system is to avoid splitting a record between two tracks unless
it is the case that a single record requires more than one track of
storage. Therefore, if a record is shorter than one track but cannot
fit completely within a track, it is begun on a new track. The logical
record address, consisting of the module number, physical track num-
TECHNIQUES OF SEARCH FILE ORGANIZATION 145

A B C I Decoder
~1~------I~------c-I"""'---' Output

: : :' fl
r---~---'
W
~W
~----T----
;
0/

II I4

01 02

. ----0----.. .
I W
~ IF;I
L.~-<&.J
L___________ ..J

Fig. 40. Multiple Buffering and Logic Processing of Inverted Lists


2 3
.j::o.
0'1
Assign
Generate
-
Sort by
Disk
Key/AD/AN Key/AD
Addresses
Triples
(AD) 'Ij
AN* AN Key/AD/AN Key/AD/AN ...I""'
ttl
Keys AD Vl
-l
Non Key Data Keys :0=
c:::
()
4 -l
List c:::
:0=
ttl
Construct Key Vl
6 'Ij
Directory with
5 o:0=
Head of List AD
Insert LA o
Z
(HOLA) and List I
Sort by AN and create I""'
Length (LL). Z
File tape ttl
Generate A II
Key/AD/AN Key/AD/AN AD ~
Link Addresses Vl
AN -l
ttl
(LA) LA LA a::
Keys/LA fI)

Non Key Data


* UnderlinedData field (a)
indicates sort key.
AN
Keys
Key/ HOLA/LL Fig. 41. Multilist File Construction Non Key Data
TECHNIQUES OF SEARCH FILE ORGANIZATION 147

ber, and record sequence number, is mnemonically indicated as AD


in this diagram. Since the addresses are assigned in sequence, the
output of Block 1 is a tape that is in both AN and AD sequence. In
addition, only the keys are written onto this tape, not the non-key data.
From this tape, Key / Address/ Accession Number triples are generated
in Block 2, and in Block 3 they are sorted by Address within Key.
This tape then enters Block 4 where the Key Directory is constructed.
The first address appearing in a particular key subsequence becomes
the Head of List Address (HOLA) for that key in the Key Directory.
The number of times the key is repeated in the subsequence is the
List Length (LL). The DireCtory is generated as a tape containing
KEY /HOLA/LL triples in Key sequence. This tape is subsequently
processed by one of the methods described in Chapter VI to produce
a Directory Decoder. If the Multilist were to have controlled List
Lengths with a maximum List Length of N, then the first N records
within a given key subsequence would become the first list, the next N
records within the same key subsequence would become the second
list, and so forth, with the appropriate head of list addresses being
entered into the Directory, and the List Length being set for each
sublist at N, except for the last list, which might be less than N. If
the control parameter were a table instead of a single constant, then
a table look-up would produce the appropriate value for each key.
Similarly if the file being constructed were Cellular MultiIist then,
since the triples are in address within key sequence, a series of records
is removed within a key sequence list, all of which contain an address
within an indicated cell. When the address jumps to a new cell, a new
list is created in the Key Directory. The List Length in this case may,
of course, vary for each cellular sublist.
Each address within a particular key subsequence, subsequent to
the head of list address, becomes a Link Address and is appended as
a second field to the KEY / AD/ AN triple in a second output tape
from Block 4. That is, the Link Address as shown in this tape record
is actually the address field of the preceding record in the output tape
of Block 3. These records are then sorted by accession number in
Block 5 so that they can be merged with the original input file to form
the final list structured file. In Block 6 this merge takes place, and the
Link Address is appropriately inserted within the Key/Value/Link
Address field of the record. With this relatively straightforward
process, both the Key Directory and the list structured data file can
be generated, and any of the three variants of the Multilist system
can be produced.
148 FILE STRUCTURES FOR ON-LINE SYSTEMS

In principle, an Inverted List structure would be generated by


Fig. 41 if the List Length were limited to one, but a modified form
of this procedure generates it more efficiently.
Figure 42 presents the file construction process chart for the In-
verted List system. The input is the same, containing an Accession
Number, Keys, and Non-key Data. In Block 1, the Disk Addresses are
assigned to each file record. At this point the final list structured file
can be produced since no Link Addresses are required in the file area
itself. Hence, a record is written onto tape, which contains the Disk
Address (AD), the Accession Number (AN), the Keys, and the
Non-key Data. The citation of keys within the record is optional, be-
cause, as discussed previously, the decoding of the entire Boolean
expression of keys is performed in the Key Directory. Therefore, keys
would only be required within the record itself if it were desired to
perform some other computation on the key values or if it were de-
sired to print keys in the output. It is necessary to place the assigned
Disk Address in this file in order that the program that loads the file
from this output tape onto the disk will be able to properly place the
records in the tracks and leave the appropriate reserve area. An
alternate means of addressing could also be devised, whereby an index
number is placed before a series of records indicating the number of
such records to be placed on a single track, thereby assuring that the
appropriate reserve space will be left at the end of the track. Block
1 also produces a tape with the assigned addresses and keys for the
remainder of the processing. As in the Multilist file construction, a
tape with Key/Address pairs is created in Block 2. In Block 3 this
tape is sorted in address within key sequence producing the indicated
output. From this tape the Inverted List Key Directory is formed in
Block 4. All addresses appearing in a given key subsequence are
tabulated beneath a single citation of this key at the output of the
Key Directory. The form of this tabulation is indicated in the output
tape of Block 4. The Directory Decoder could, again, be any of the
techniques of Chapter VI; however, the output of the decoder ad-
dresses a variable length list of addresses of the form shown in Fig. 43.
In this figure the Directory is depicted, for illustration, as a tree. The
decoding of Key X points to an address A 2, which contains in loca-
tion A 2 the full key term. If the decoder is of the non unique type,
such as a fixed tree or randomizer, then, in addition to the full key
term, a chain address would be required in the event that Key X,
found in location A2, did not match the query Key X. The linkage
would address another list that would also contain as its first word the
List AD
AN
Keys (optiona I)
Non Key Data

...,
ttl
2 3 (")
::z:
Assign z
Generate Sort by .0
Disk c:
ttl
Key/AD Key/AD VJ
~Addre Pairs o'"%j
( AI
AN AD Key/AD Key/AD VJ
ttl
Keys Keys >
~
(")
Non Key Data ::z:
'"%j

4 t=
ttl
o
Construct Key A ~
Inverted Last >
AD z
Key Directory AD
•• ~
(5
Key B z
AD
AD
.....
~
Fig. 42. Inverted List File Construction 1.0
150 FILE STRUCTURES FOR ON-LINE SYSTEMS

complete key. If this key in turn did not match, then it would chain
to another list, and so forth. If the Directory contained unique keys
(i.e., one of the variable trees) then the citation of the key and the
chain address would be unnecessary in the variable length inverted
list. Since the list is variable in length, it may also be desirable to
store the list length. This information may also be useful to the
logical expression decoder, since, if a conjunction of an extremely
long list and an extremely short list is required the decoder may
not even perform the list intersection, but rather will search the short
list directly. At the end of the list of addresses a reserve space should
be left so that if records must be added to a given list, the appropriate
address can be inserted in sequence without overflow. If the reserved
area should become exhausted, as shown for Key Y, then the last
location of the record contains a link address to another variable
length record where the list continues. When the file is reconstructed,
the key lists are once again pulled together into single variable length
records.
The procedure for processing these Inverted Lists according to
query logic has already been described, where, in Fig. 40, the lists,
A., B, and C correspond to the Inverted Lists of Fig. 43.

7. File Access Timing

The time to process a query in the File is a function of the file


organization technique, DASD characteristks, and the number of
records accessed from the File. A number of file, query, and DASD
related parameters must, therefore, be defined in order to formulate
this time. Some of these parameters are readily obtainable, in the
case of the DASD from manufacturer's specifications, and in the case
of the file from file generation statistics. Others, however, can only
be estimated initially, and, if desired, could be obtained more ac-
curately only from system operations, perhaps by the Automonitor
suggested in Chapter I.
Table t 4 contains a listing of these parameters and their definitions.
The file related parameters are as follows. The total Number of
Distinct Keys (i.e., Name/Value pairs) is V, and the number of Rec-
ords in the System is N r • The average Number of Keys per Record
is N k , and the Average List Length can then be computed approxi-
mately as (N,NdV). The average number of Characters per File
Directory
I
( Directory
Output)
...ttl
n
:=
A2 I Key X/LA/LL AS Key Y/LA/LL z
AD AD ~ttl
(Il
AD AD 0
"%j
AD AD
(Il

• AD ttl
>
:=
AD n
• AD Variable :=
• Length "%j
AI8 ...
t""
Inverted ttl

Lists 0
~
AI8 I AD ~
N
AD >
...
• S
• z

Fig. 43. Tree Directory and Inverted Lists VI

--
152 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 14
SYMBOL DEFINITIONS FOR FILE PROCESSING CALCULATIONS

Parameter
Symbol Definition Type

V Number of Distinct Keys in Vocabulary File Related


Nr Number of Records In System
N. Number of Keys/Record (Av.)

L Average List Lenath (;, Nr:. )


C, Characters/File (logical) Record (Av.)
R. Records/Cell (Av.)
C. Cells/Key

Nt Number of Terms in a Single Query Product (Av.) Query Related


Np Number of Nonnesated Terms in a Single Query
Product (Av.)
L. Shortest list Lenath in Query (Av.)
p Ratio of Query Response to L. (Av.)
a The Ratio of Query Cell Responses to C. (Av.)

A Number of File Record Addresses per DASD Device Related


Physical Record
Tr Random Access Time of DASD (Av.)
R. Transfer Rate of DASD (8/S)
R Rotation Time of DASD (Sec)

Record is Ct. The average number of Records per Cell, in cellular


partitioned systems is R" and the average number of Cells per Key
is C k •
The query related parameters are as follows. The formulations are
all based upon a single query product. If the query has a sum of
products, then the time must be multiplied by the number of sums.
The average number of terms (keys, negated, and nonnegated) in
the query product is Nt. and the number of nonnegated terms is N,.
The shortest list, on the average, of the keys in the product is L 8 , the
average ratio of query responses to L. is p, and the average ratio of
the number of cells in which a query is found to the average number
of cells per key (Cd is Q.
The DASD oriented parameters are as follows. The Number of
Addresses that can be stored in a physical record of the DASD is A
(this is also a function of core buffer allocation) ; the average Random
Access Time, excluding latency and read time, in the DASD is Tr; the
data Transfer Rate is BPS of the DASD is Rt, and the Rotation Time
of the DASD is R.
The File search times for the Multilist, Inverted List, and Cellular
Serial file structures are given in Table 15.
TECHNIQUES OF SEARCH FILE ORGANIZATION 153

Table 15
LIST SEARCH RETRIEVAL TIME

Process MultHist Inverted List Cellular Serial

Directory Decodins NpTn NI Tn NpTn

List (Cell)
intersection
- [ ~] NdTd 1.5 R) [~J Np(Td 1.5R)

List (Cell)
Search and L, (T, + 1.5 R) nL. (T, + 1.5 R) ",C, (Td Rc;,')
Record Transfer

[x] = INT + [xl

The actual record access time for a record is the time to position
the head of a movable head DASD (Tr) plus latency (~R) plus the
track read (R), assuming that a physical record is one track.
If the physical record were of any other length, R would be in-
terpreted as the record read time and modified appropriately. For a
fixed head DASD, Tr = O. The Multilist decodes, say, in an n-Ievel
tree, all nonnegated keys in the query product; the Inverted List de-
codes all terms, and the Cellular Serial also decodes only the non-
negated terms. There are no list intersections prior to the File search

in the Multilist system; the Inverted List system must access [~ ]


physical records, as shown in Fig. 40, for each of the Nt keys.
Similarly the cellular serial system must access and intersect [~k ]
physical record for each of the Np nonnegated keys. The Multilist
system must access every record on the shortest list (L.) in the file,
while the Inverted List system will access only those records (pL,)
that satisfy the key logic, and the Cellular Serial system must access
all of the intersected cells (aCd, where a cell contains ReC, bytes.
This yields File search formulations (exclusive of Directory Decoding)
of:
TML + 1.5 R)
L. (1'. (1)

TIL (N{~] + ~L.) (T. + 1.5R) (2)

Tcs = Np [~kJ (T. + 1.5R) + ex C k (T. + RR~f). (3)

Substituting (l) into (2) produces a relationship between Multilist


and Inverted List timing.

TIL =(~:[~J+~)TML (4)


154 FILE STRUCTURES FOR ON-LINE SYSTEMS

Therefore, Inverted List search is faster than Multilist search when

(5)

This inequation says that the more lists to be intersected (Nt), the
more physical records per average list ([ ~]>, and the smaller the
shortest list (L.), the more likely it is that the Multilist retrieval time
will be less than the Inverted List time. However, in general, inequa-
tion (5) is satisfied, because p is usually very small for Nt in excess
of 2. For example, for Nt = 3, A = 500, L = 200 and L. = 20; the
right side of (5) is .85. This means that the density of hits on the
shortest list for a 3-term conjunction would have to exceed 85 % be-
fore the benefit of performing the list intersections in the Inverted Lists
would be overcome by simply searching the shortest list. One might
expect p, for 3 terms to be on the order of 0.1 to 0.25.
Again, the reader is cautioned in the interpretation and use of these
formulations. In some systems two (or more) DASD types may be
used. For example, a Disk Pack might be used for Directory storage
while a Data Cell is used for File and perhaps Inverted List storage,
in which case the DASD oriented parameters must be appropriately
assigned their respective values in the formulations.
CHAPTER VIII

ON-LINE FILE UPDATE AND


MAINTENANCE

THE PRINCIPAL REASON for on-line, real-time update of a file is to


serve the operational needs of those systems in which the Update
transaction must be usable in the file within a very short time from
its declaration. Cost comparisons between on-line versus batched
updating are deceptive because of the manner in which one may
choose to account for such fixed costs as the DASD, terminals, and
software. Therefore, the decision should be qualitatively based upon
a requirement for timely updates, unless the hardware and software
accounting is so clear-cut as to enable a cost comparison. For ex-
ample, if the software were supplied with the equipment, and all equip-
ment were charged strictly on a time, type, and capacity basis (i.e.,
broken down by equipment type and quantity of direct access stor-
age), then one could readily compare costs. Other than in this kind
of clear-cut situation, the difficulty arises in apportioning the share of
hardware cost to be carried by the update functions, because, not-
withstanding on-line update, the same DASD and terminal costs
would have to be sustained in support of retrieval functions. Software
design and implementation costs, however, would be higher, and these
should be accounted. A single update can be executed typically in
terms of time from a few tens to hundreds of milliseconds of process-
ing time, depending on equipment type. This represents a constant
unit processing charge, Cpr, regardless of the number of updates. The
batched system, however, has a minimum processing charge, Cpb ,
which is based on the time to linearly pass the file on magnetic tape.

155
156 FILE STRUCTURES FOR ON-LINE SYSTEMS

As long as the update processing is tape limited, this cost is inde-


pendent of batch size, and since update runs are normally tape lim-
ited, one can assume this to be a fixed processing cost for the batch.
However, the tape update run is always preceded by a sort of the
transactions into the file sequence, and the cost, in processing time,
for this sort will be some function of the number of transactions. Let
this charge be denoted C.(b). Then, a comparison could be made
between the unit cost (per transaction) for on-line update, Cr , versus
batched update, C b , as:

(1)

Cb -_ C'b + C.".N + C. (b)


. (2)

where C'r and C',. are the fixed hardware and software assessments
apportioned to the time period over which the batch of N transmis-
sions has accumulated, and it is obviously these assessments that are
the most circumspect.
The remainder of this chapter is concerned with the techniques of
file structuring for real-time update, given that the system designers
have decided that it is required. On-line updates can be put into five
categories:

1) Whole Record Addition.


2) Whole Record Deletion.
3) Deletion of keys.
4) Addition/Deletion/Modification of Non-key Data.
5) Addition of keys.

The designer of file structures with on-line, real-time update must


give attention to two problems. First is the update of the Directory
and/or repair of list pointers whenever a key is involved. This may
occur in all of the above categories except possibly (4). Second is
the relocation of a record when an update expands its size, and the
subsequent utilization of its former space. This may occur in cate-
gories (4) and (5). The procedures for effecting these updates and
handling these two major problems are somewhat different for
threaded and inverted lists. The techniques for threaded list update
will be described first, then Inverted Lists, and then cellular partitions,
each with respect to the five categories.
ON-LINE FILE UPDATE AND MAINTENANCE 157

I. On-Lhle Update of a Multilist Pile

It is assumed throughout this discussion that the category of update


is known to the query processor, and that a specification appears in
the query that identifies a record or class of records to be updated.
In order to increase the efficiency of updates without having to utilize
bidirectional links, certain controls, as shown in Fig. 44, are inserted
into a record. At the beginning of the record, in a fixed position, is
a single bit called the Record Delete Bit. If this bit is set to 0 it means
that the record is in the file; if it is set to 1, it means that the list search
process should skip over it, continuing to the next link address be-
cause it has been logically, though not physically, deleted from the
file. In addition, there is associated with each KeyfLink Address field
a bit called the Key Delete Bit, which when set to 0 means that the
key is in the record, and when set to 1, means that the key has been
removed from the record. The Key jLink Address pair cannot ac-
tually be removed from the record, because the list linkage would
then be broken. The alternative is to bridge the linkage from the
preceding record on the list to the succeeding record. That is, assume
that the record in which a key is to be deleted is the Nth on a list.
Then, record (N - 1) has a link address to record N, and record N
has the address of record (N + 1) on the list. A bridge across record
N could be created by transferring the address of record (N + 1)
to the link address field (for the particular key) to record (N - 1).
However, record N, which, by hypothesis, has been identified for the
query processor, cannot point to record (N - 1) unless the list struc-
ture is bidirectional. Therefore, the only way to access record (N - 1)
is to search the list from the beginning, which is inefficient when N
is large. It is much easier to set the key delete bit to 1 and retain the
record (N - 1) to (N + 1) linkage via record N.

Record Key Key Key


Delete Delete Key/LA Delete Key/LA --- Delete
Bit Bit Bit Bit

Fig. 44. Multilist Record Update Controls

Table 16 presents the process for real-time, Whole Record Addi-


tion. First, the record is edited for file entry. This means that it is
put into the general record format as typified by Fig. 18, and, as with
file generation, the only parts of the record that cannot be inserted at
158 FILE STRUCTURES FOR ON-LINE SYSTEMS

this time are the link addresses. After the record has been edited for
file entry, a DASD address AD, is assigned to it in accordance with
available space on the DASD. In any system using a movable head
DASD, new records should be assigned addresses in a monotonic
sequence, so that the mechanism moves in smaller increments, in one
direction as it accesses a list. The third step begins a loop in which
each key within the record must in turn be processed. The ith Key
in the record is decoded in the Directory. Then, in step four, the head
of list address of Key i currently in the Directory is transferred from
the Directory to the link address field of Key i in the new record. In
step five the address of the new record, AD, becomes the new head
of list address of Key i in the Directory. This effectively puts the new
record at the head of the list and maintains the remainder of the list
without having to access or affect any of the other list records. In
step six the list length of Key i in the Directory is incremented by 1,
and in step seven the loop is closed by repeating steps three through
six for all keys in the new record. Finally, in step eight the new
record is stored in the DASD at address AD.
Figure 45 schematically illustrates this process for a single key.
in the upper half of the figure is shown th-e Before Update picture,
and in the lower half of the figure is shown the After Update picture.

Table 16
WHOLE RECORD ADDITION

1. Edit record for file entry

2. Assign DASD address, AD, to new record

3. Decode Key i in Directory

4. Transfer head of list address of Key i from Directory to link ad·


dress field of Key i in new record

5. AD becomes new head of list address of Key i in Directory

6. Increment list length of Key i in Directory by 1

7. Repeat 3 through 6 for all keys in new record

8. Store record in file at location AD


Output Level
of
Key Directory

I 105 ---_._---
(
Ie fore
0
Update zI
t"'
Z
t!\
'!1
r::t!\
® c:::
Output Level ."
of IKe y X/182/31 ~
>-l
t!\
Key Directory I
>
I Z
0
~
>
....
After z>-l
Update t!\
Z
>
I
z
(')
i \ t!\

t
Addresses
Increasing
/ ....
VI
Fig. 45. Key Update \0
160 FILE STRUCTURES FOR ON-LINE SYSTEMS

At the top of each section, is shown the Output Level of the Key
Directory. For simplicity a single key update is illustrated, i.e., a
single iteration through the loop of steps three through six in Table 16.
As indicated in step seven, this procedure would have to be repeated
for every key in the new record. Consider, therefore, the Key X,
which has two items on its list, the first of which appears at address
105 in the file area. Addresses are assumed to increase in this dia-
gram in an upward direction. That is, when this file was created, the
first item entered was that at address 76, and the second entered was
at address 105. At this point in time, it is desired to add a third item
to the list. As indicated by step four, the head of list address in the
output level of the Directory, being 105, becomes the link address of
the new record, and, as indicated in step five, the address of the new
record, which is to be 182, becomes the new head of list address in the
Directory. Finally, as indicated in step six, the list length is incre-
mented by 1. The complete picture after update is shown in the lower
half of the diagram.
Figure 46 presents both the procedure and a schematic for Whole
Record Deletion. An entire record is deleted from the file in real-
time by a one- or two-step process. First, the record is accessed, and
the record delete bit is set to t. This step alone can complete the
update, since the record has been logically removed from the file and
all lists that pass through this record remain intact. The only other
update that may be desired, depending upon point of view, is the
list .length. If the designer regards this number as the physical list
length, i.e., the actual number of random accessions required to
traverse the list, then it should not be modified and the update is fin-
ished with step 1. If it is regarded as the logical length of the
list, i.e., the number of retrievable records that can respond to a
query, then the list length of every key in the record must be decreased
by 1. The former definition is more appropriate to use of the list
length for determining which list in a query logical product to search,
while the latter is more appropriate as a presearch retrieval statistic.
The system should also be so constructed as to maintain statistics on
the number of such delete bits that have been set, and periodically, in
accordance with these counts, the entire file should be regenerated
in order to repack the file and remove these records completely from
the system.
Figure 47 illustrates the deletion of individual keys within a record.
It also is a one- or two-step process. First, the record is located and
brought into the core. The key delete bit of all keys in the record
ON-LINE FILE UPDATE AND MAINTENANCE 161

that are to be removed is set to 1. The figure shows a list that formerly
contained four records under Key X. The third one has been deleted
by the placement of a 1 in its key delete bit. The search system when
searching this list would access each record in tum, but having the
third record in core, would detect the key delete bit, ignore the record,
and immediately access and process the fourth record. The list length
decrement is again optional, depending on its use as an indicator of
physical or logical list length.

(I) Set Record Delete Bit


(2) Decode every key in record and decrement
List LenCJth by one. [Optional]

Directory
~~------------------------------~

Fig. 46. Whole Record Deletion

(I) Set Key Delete bit of key(s) to be deleted.


(2) Decode Key(s) in Directory and decrement
respective Lists LenCJths by one. [Optional]

Directory
~--~~------~------------------------~

Fig. 47. Deletion of Keys


162 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 17 describes Addition/Deletion/Modification of Non-key


Data. The first step is to read the record from the DASD into core,
and to modify the record. If the modified record is shortened, then
the entire track (or physical record) can be repacked and written
back onto the DASD. The result will simply be to increase the re-
serve area at the end of the track. If, however, the modified record
is increased in size, then two subcases must be distinguished.
(A) If the repacked track, including the larger record, does not
overflow, then it is written back onto the DASD.
(B) If the repacked track overflows one track length, then the
entire record is deleted according to the record delete procedure
described in Fig. 46, and the entire modified record is added to the file
according to the procedure described in Table 16. This means that
the record will appear twice in the file, once in its unmodified form,
and once in its modified form, but in the former instance, the record
delete bit will be set to 1 so that the search program will ignore it in
any list search. As stated above, these records, for which the record
delete bit has been set, are removed from the file permanently at the
time of the periodic file regeneration process.

Table 17
ADDITION/D£LETlON/MODIFICATlON OF
NON-KEY DATA

1. Read record from DASD into core. and modify record

2. If record is shortened. repack track and write onto DASD

3. If record is increased. and

(A) repacked track does not overflow, then write onto DASD

(B) repacked track overf!o_. then delete whole record and add
modified record accordin. to whole record addition
ON-LINE FILE UPDATE AND MAINTENANCE 163

Figure 48 illustrates Non-key Data Modification. In each of the


two cases, a track is shown that contains four records labeled R 1
through R4. In Case A, record R3 is modified in such a way that it
increases in size but does not overflow the original track length; there-
fore, the only difference between the before and after picture is that
record R3 is larger and the reserve area is smaller. In Case B, it is
assumed that record R3 is modified in such a way that the repacked
track would exceed a track length; therefore, the record delete bit for
record R3 is set to 1 on the original track (X) and record R3 is re-
entered as a completely new record on another track (Y) at the end
of the file. Here it is assumed that it has become the second logical
record on Track Y where it was formerly the third logical record on
the track.
This technique is easily programmed and has the advantage of
leaving records physically intact, which increases retrieval efficiency,
but it also leaves islands of unused space distributed throughout
memory. Further on in the chapter, techniques for the maintenance
of this space will be described. There is, however, an alternative
approach to complete record relocation for this category of update.
This is to generate a trailer record that logically continues the ex-
panded record at randomly located position, usually within the reserve
area.
Table 18 defines the real-time process for the addition of keys to
a record. The first step is to read the record from the DASD into
core, and to add the desired keys to the record. The second step is
to decode in the Directory the ith Key that has been added to the
record. The third step is to transfer the head of list address of Key
i from the Directory to the link address field of Key i in the modified
record. The fourth step is to place the address of the new record into
the head of list address of Key i in the Directory, which effectively
places this record at the head of the Key i list. Finally, the list length
of Key i, is incremented in the Directory by 1. Steps two through
five are repeated for all keys that have been added to the record. It
should be recognized that this procedure is identical to that described
in Table 16 for the addition of an entire record, where every key is
effectively a new key added to the record; therefore, Fig. 45 is also
illustrative of this process. After the appropriate Directory updates
have been made, in accordance with steps one through six, the record
must be restored to the DASD file. At this point, however, since new
data have been added to the record, the repacked track may again
overflow; therefore, the same procedure as was applied to non-key
Reserve
~
-
Case (A)

Track X II R2 U0 I R3 Q R4 IWJ!I/J [Before]


:!l
t"'
t!j

~
Track X II R2 II 0 IR3(Modified) II R4 1[/11 [After] ~n
-I
c::
1:l
CIl

"II
Case (B) ~
~
Track x---i RI II R2 II 0 I R3 II R4 IVI/////1 [BeforeJ
z~
t!j

Track X
~
t!j

~
After

Track Y

Fie. 48. Non·key Data Modification


ON-LINE FILE UPDATE AND MAINTENANCE 165

data modifications in steps two and three of Table 17 must be applied


when restoring data. This means that the record may actually have
to be deleted and replaced in the file on another track, depending
upon the number of keys added and the amount of reserve space left
at the end of its current track.

Table 18
ADDITION OF KEYS

1. Read record from DASD address AD into core. an:! add keyes)

2. Decode Key i in Directory

3. Transfer head of list address of Key i from Directory to Link


address field of key in modified record

4. AD becomes new head of list address of Key i in Directory

5. Increment list length of Key i in Directory by 1

6. Repeat 2 through 5 for all keys to be added

7. Restore modified record according to Non·key Data Modification


procedure

2. On-Line Update of an Inverted List Structure

Table 19 illustrates the procedure for Whole Record Addition in


the Inverted List system. As in the case of Multilist, the first step
is to edit the record for file entry in the core memory. Then a DASD
address, AD, is assigned to the new record. The third step is to de-
code Key i in the Directory to the relevant variable length Inverted
List. The new address must then be inserted in sequence into the list.
If the insertion of this address causes the list to overflow the allocated
block in mass storage, as illustrated by the first variable length record
for Key Y in Fig. 43, another block is attached and a link address
inserted into the last word of the first block in order to connect them.
The first record of the list, which contains the Key Name/Chain Ad-
166 FILE STRUCTURES FOR ON-LINE SYSTEMS

dress/List Length triplet is then updated by incrementing the list


length by 1. Steps three through five in Table 19 must be repeated
for all keys in the new record. Finally, the new record is stored in
DASD address AD.
Table 19
WHOLE RECORD ADDITION (INVERTED LIST)

1. Edit record for file entry

2. Assign DASD address, AD, to new record

3. Decode Key i in Directory to variable length Inverted List

4. Insert AD, in sequence, into list

If reserve area is exhausted, attach the next available record


block, and insert a link address as last word of previous record

5. Increment list length by 1

6. Repeat 3 through 5 for all keys in new record

7. Store new record DASD at address AD

Table 20 presents two alternatives for Whole Record Deletion.


Method One is identical to that in the Multilist system, but since the
process of finding the preceding records on given key lists in the
Inverted List system can be effected more rapidly, the designer may
choose, as in Method Two, to physically as well as logically remove
the record from all of its key lists. This is done first by setting the
record delete bit, and then by removing the address of the record from
every key list on which it appears, repacking the key lists and restor-
ing them to their DASD records. The list lengths are also decreased
by 1 since the records will never be physically accessed.
Table 21 presents the update procedure for Deletion of Keys, which
is similar to that of Addition of Keys, except that the variable length
inverted list, from which the given key is to be deleted, is always
shortened and therefore can be repacked and restored on the same
DASD track, as indicated in step two of the procedure. Steps one
and two are repeated for all keys that are to be deleted from the
record. If the keys are also cited in the file record, then the record
ON-LINE FILE UPDATE AND MAINTENANCE 167

Table 20
WHOLE RECORD DELETION (INVERTED LIST)

METHOD ONE

1. Set record delete bit

2. Decode every key in record and decrement list leneth by 1


(Optional)

METHOD TWO

1. Set record delete bit

2. Decode every key of the record in the Directory, remove record


address, AD, from respective Inverted Lists and decrement
list leneth by 1

should be accessed and the key citation within the record should be
deleted. It should be noted that the key delete bit is not required in
the Inverted List system because linkages do not exist in the file area
itself and hence deletion of an entire key from a record does not de-
stroy the continuity of the list. In fact the record address is physically
deleted from the inverted key listing in step two. In those systems in
which the keys are not cited in the file record itself, there is no
requirement to perform step four.
The updating of non-key data in the Inverted List system is identical
to that in the Multilist system since there are no key listings involved,
and the essential difference between the Multilist system and the
Inverted List system is related only to the organization of key lists.
Table 22 outlines the procedure for the Addition of Keys to an
Inverted List system. It is very similar to the process of adding a
new record since, in order to add a new record, key lists must be
updated. The only difference is that the new keys must also be added
to the record in the file area, if keys are cited in records, and the
record must then be restored to the file area, again in accordance with
the non-key data modification procedure, because the addition of the
new keys may enlarge the record to the extent that it could not be
reinserted on its appropriate track.
168 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 21
DELETION OF KEYS (INVERTED LIST)

l. Decode Key I in the Directory to variable lenph Inverted List

2. Remove AD of record from which Key I is to be deleted from the


Inverted List, repack track, and restore to disk

3. Repeat 1 and 2 for all keys to be deleted from record

4. If keys are also cited in the file record, access the record, and
delete the key citations

Table 22
ADDITION OF KEYS (INVERTED LIST)

l. Decode Key I in Directory to variable leneth Inverted List

2. Insert AD of record to which Key i is being added, In sequence,


into list

3. Increment the list len8th by

4. Repeat 1 through 3 for all keys to be added

5. Read record from DASD address AD into core and add new keys
to record

6. Restore modified record according to Non-key Data Modification


procedure
ON-LINE FILE UPDATE AND MAINTENANCE 169

If the volume of updates in a system is considerably greater than


that of retrievals, there is an alternative that can be exercised in the
Inverted List system which greatly increases the update efficiency at
the expense of retrieval efficiency. This is to store the record acces-
sion numbers in the Inverted Lists instead of the addresses, so that
when a file record is relocated, none of the key lists has to be modi-
fied; only a single modification to the accession number decoder is
necessary. The retrieval process, however, is considerably less effi-
cient. The decoded accession list after key logic will consist of a
sequence of accession numbers each of which must then be decoded
to an address. The resulting list of addresses will not, in general,
be in sequence, so that if it is required to use a multiquery merging
strategy, as presented in Table 18, or if it is desired to move the
access mechanism uniformly through the mass storage medium, then
the list of decoded addresses will have to be sorted.

J. Oa-Uw Update of Cellular Partiti01lS

Since the Directory to File pointers designate cells (or at most in-
ternally list structured cells) in the cellular partitioned systems, only
the transfer of a record out of or into a cell requires a Directory up-
date. Also, in the special case where a new key is added to a cell
or a key occurrence is retired from a cell, it is necessary to update the
Directory. The Directory update can therefore be minimized by leav-
ing sufficient reserve space at the end, or distributing it through the
cell, so that the greatest majority of record relocations or trailers due
to expansion are within the cell.

All of the update procedures described in this chapter that have


included the setting of the record delete bit and the relocation of the
record, leave a block of completely or mostly unused space in memory.
In the case of the Multilist system there is a residual utility to this
space if the lists are unidirectional, since the linkages through the
logically deleted record must be left intact. In a bidirectional Multi-
list or Inverted List system the entire space could be reused, and it
is necessary, of course, to consider the reuse of this space, because,
in time, all of the originally allocated reserve space may be consumed.
170 FILE STRUCTURES FOR ON-LINE SYSTEMS

There are two general approaches to the regeneration of this space as


useable reserve. One is to do nothing about it within the operational
context of on-line maintenance, but to periodically regenerate the
entire partitioned file, off-line, according to the procedure of Fig. 41
or 42. This is by far the easiest way out if it can be accommodated.
In general, this procedure would be appropriate to a system in which
update activity were not too dynamic, and the cost of file regeneration
with the requisite periodicity were not prohibitive.
The second approach is to perform maintenance on the files in a
background mode, within the real-time system environment, in order
to collect the unused islands of space, either logically or physically.
The former is usually done by maintaining a list of available space,
which records, for each available block, its address and length; alter-
natively a threaded list can be maintained, so that each new block
is stuck onto the head of the list, and when a block of a desired size
is needed, the "space provider" scans the list (one random accession
per block) until it finds the appropriate size record. The disadvantage
of logical space maintenance is that the list of available space can
become large and requires maintenance itself, while the threaded list
requires additional processing that can become time consuming at
the time of record update, partially obviating the original reason for
using the record delete bit.
An alternative to logical space maintenance is physical space main-
tenance, which has the advantage of using low priority, background
time to collect and organize the unused blocks into large blocks of
reserve space that can most efficiently be used at high priority update
time. This process may descriptively be called space brooming, be-
cause it essentially pushes the reuseable blocks into heaps by a process
that sweeps through memory. It is called into operation whenever the
on-line terminal or higher priority background load is sufficiently low,
and it can respond to a sudden increase in foreground load within a
few seconds, leaving the files in a completely useable state. Figure 49
illustrates the process. Assume that the top illustration in the figure
is a part of the file with 16 records shown. The hash marked regions
are complete records or portions of records that have been logically
deleted by means of the record or key delete bits. The bottom half of
the diagram illustrates the state of the file after space brooming.
Record 1 is accessed, examined for delete bits and, not having any,
is ignored. Record 2 is accessed, compressed, and restored, leaving a
small amount of space between records 2 and 3. * Record 3 is moved
• If the logical records are packed into physical records such as tracks, the entire physical
record would be processed in thi~ way, and excess space pushed ahead to the next record.
ON-LINE FILE UPDATE AND MAINTENANCE 171

up, record 4 is completely deleted, and record 5 is therefore pushed


up behind record 3, with a consequent accumulation of space equal to
record 4 plus the deleted portion of record 2. Each record move or
deletion requires an update of all keys in the Directory pertaining to
the record. This may thus appear to be an expensive procedure, but
it is performed on a low priority basis, unless there is a sudden need
for space. Furthermore, the program can terminate and leave a com-
pletely usable file after completing its current record move. For a
record with 20 keys, and a Directory decoding and update time of
200 milliseconds, this would imply a file lock-out of 4 seconds. If this
procedure were to continue through all 16 records, this portion of the
file would then appear as shown in the lower half of the figure.

Before
Maintenance

After
Maintenance

Fig. 49. Background Maintenance by Space Brooming


172 FILE STRUCTURES FOR ON-LINE SYSTEMS

An effective compromise between physical and logical space collec-


tion would be to maintain a fixed list of distributed available space in
which each block had been collected by space brooming. Thus, the
list maintenance is very easy (there are always N blocks, relatively
few in number and of variable length), and the space brooming is
localized.
It should be emphasized that space brooming can only be imple-
mented on bidir~tional Multilist files or Inverted List files, because
when a record is relocated the links to its successors remain intact,
but the link addresses in its predecessors must be changed; if the
links are unidirectional, its predecessors are unknown and hence the
record's new address cannot be inserted as a new link address in its
predecessor on the list. This difficulty is obviously overcome with
bidirectional links, and is overcome in the Inverted List because the
new address of the relocated record simply replaces the former ad-
dress in the sequenced Inverted List. In fact, the sequence itself is
unchanged.

5. Update Timing

Tables 23, 24, and 25 present timing formulations for the Multilist,
Inverted List, and Cellular Serial systems with respect to the five on-
line update categories-whole record addition, whole record deletion,
deletion of n keys, non-key modification (with and without record
relocation), and addition of n keys. It is assumed that the list lengths
are physical lists, and therefore do not require decrements in the Di-
rectory output when the key or record delete bit is set, and that the
number of keys in a record is N k • It should also be noted, as shown
in the footnote of Table 23, that a complete revolution, R, must be
added to random access writes in which the DASD does not include
hardware vertification with an immediate read after write. Appendix
C indicates for each of the equipments described which ones have this
hardware facility. Those that do not, like the IBM 2311 Disk Pack,
should be verified by a programmed read (on the next rotation) of
the written data.
In each of these tables, the update procedure is broken into four
successive steps:
1) Decode Directory.
2) Access Record (to be updated).
Non-key Non-key Addition of
Whole Whole [I] Deletion ["I Modification Modiflclltion n Keys
Record Record of (without (with (without
Table 23
Process Alldition Deletion n Keys Relocation) Relocation) Relocation)
UPDATE TIMING
Decode Directory T. ['II 1. T. T. T.
FOR MULTILIST
Access Record TA TA TA TA TA
FILE STRUCTURE
Update Directory N.T" N.T. nT.
Store Updated Record T. PI T .• T. T. T. TA

l'1 T I = T, + 1.5R (Add It, if read after write verification is required). o


[2] Assumes that Directory list lengths are physical; therefore, Directory update is not required when key or Z
I
record delete bit is set. ['II T. = Decoding time for three-level tree (Cf formula [12L Chapter VI). t"'
(Add R, if read after write verification is required). Z
t'I1

Non·key Non-key Addition of "fl


Whole Whole Deletion Modification Modification n Keys F
t'I1
Table 24 Record Aecord of (without (with (without
Process Addition Deletion n Keys Aelocation) Relocation) Relocation) c:
'1:1
UPDATE TIMING o
FOR INVERTED Decode Directory T. T. T. T. T. >
Access Record T .• T. T .• TA TA -l
LIST FILE STRUCTURE t'I1
Update Directory N.T. N.T. nT. N.T. nT.
Update Inverted list N. Tl. PI N.TL nTL N.TL nTL >
Z
Store Updated Record TA T.(2] T .• T. TA T .. o
-----
~
(I) Tl. = (T,· + 'h [~ ] A) (Add A, if read after write verification is required). >
(0) Required only if delete bit is used. Delete bit required only if space brooming is used.
Z
-l
t'I1
I Z
Non-key Non-key l"l Addition of >
Z
Whole (11 Whole Deletion [2) Modification Modification n Keys (')
Aecord Record of (without (with (without t'I1
Table 25 Process Addition Deletion nKeys Aelocation) Relocation) Aelocation)
UPDATE TIMING Decode Directory T" T. T" T.
T"
FOR CELLULAR Access Record TA TA TA TA TA
SERIAL FILE STRUCTURE Update Directory
Update Inverted lists
....
-...I
(,0.)
Store Updated Record TA TA TA TA TA TA
~-------

PI Assume that all keys in record are already represented in cell.


(21 Keys not deleted from cell. [.) Assume that record is relocated within cell.
174 FILE STRUCTURES FOR ON-LINE SYSTEMS

3) Update Directory.
4) Store Updated Record.
This assumes that all updates are against a single record which is pre-
sumably identified by its accession number. In the case of generic
updates, appropriate multipliers are required to account for mUltiple
key decodes in step one and multiple record accessions, directory up-
dates, and record restores, in steps two, three, and four. A ranking
of processing complexity is revealed by these tables and summarized
in Table 26. The Cellular Serial is least complex, assuming that rec-
ord relocations are made within a cell, and that the deletion or addi-
tion of keys to a record does not completely remove from or require
the addition of keys to the cell, which will be the case in the majority
of updates. Intermediate in complexity is the Multilist structure, and
most complex is the Inverted List structure, because of the require-
ment to update the inverted lists.
It may also be seen from Table 26 that the following relation-
ships exist for the five categories between the Inverted List and Multi-
list update times.

Table 26
UPDATE COMPARISONS AMONG THE THREE FILE STRUCTURES

Cellular
Update Type Multilist Inverted List Serial

Whole Record Addition N.T. + TA N.T. + T .• + N.TI, TA


Whole Record Deletion T. + 2TA T. + 2TA + N.(T. + TL) T. + 2TA
Deletion of n Keys T. + 2TA T. + 2TA + n(T. + TL) T. + 2TA
Non-key Modification T. + 2TA T. + 2TA T.+2TA
(without Relocation)
Non-key Modification (N. + l)T. + 2TA (N. + l)T. + 2TA + N.TL T. + 2T.<
(with Relocation)
Addition of n Keys (n + l)T. + 2TA (n + l)T. + 2TA + nTL T. + 2T.<

Whole Record Addition


TIL = T,\fL + Nk TL (3)
Whole Record Deletion
TIL T}.f L + N~. (T3 .... (4)
Deletion of n Keys
TIL = T ML + n (T3 + T L) (5)
Non-key Modification (without Relocation)
TIL = TML (6)
ON-LINE FILE UPDATE AND MAINTENANCE 175

Non-key Modification (with Relocation)


TIL = I'M L + Nk TL (7)
Addition of n Keys
(8)

An example.

Table 27 contains a set of typical numerical value assignments for


the parameters of file timing appearing in Table 14. It is assumed
in this example that a combination of fast and slow DASDs are to be
used, where the fast DASD is something like a Disk Pack, and the
slow DASD, an IBM Data Cell system. Table 28 indicates the
assignment of Directory, Inverted Lists, and File to these device
types, and Table 29 summarizes the resulting calculations. This
table contains both retrieval and update calculations so as to en-
able cross comparisons, the former being made with formulations
( 1) to (3) from Chapter VII. As with the previous numeric ex-
ample, one cannot make categoric generalizations from these results
since they are very much tied to a specific set of parameter values.

Table 27
FILE. DASD. AND QUERY
PARAMETER ASSIGNMENTS

Parameter Value

V 10.000
NT 500,000
N~ 20
L 1,000
C, (fast) 3,600
Ct (slow) 2,000
R. 100
C. 100
N. 4
N, 3
L. 150
P 0.1
a 0.1
A (fast) 900
A (slow) 500
Tr (fast) 85 msec.
Tr(slow) 500 msec.
R.(slow) 50 KB/S
R (fast) 25 msec.
R (slow) 50 msec.
176 FILE STRUCTURES FOR ON-LINE SYSTEMS

For example, the values of p and a are very critical with regard to the
comparisons of Inverted List and Cellular Serial. It is possible, how-
ever, to confirm certain intuitive statements made previously. On
retrieval, the Inverted List approach is considerably faster than Multi-
list, and this is invariably the case. The Cellular Serial is faster than
Inverted List if a is approximately the same as p. Furthermore, the
highest quality presearch statistic is obtained from the Inverted List
system, and it is obtained after (N,Ta + [~ ] Nt[Tr + 1.5R])
milliseconds, which in this example is 5.3 seconds. This can be de-
rived from the first two rows of Table 15, in the Inverted List column.
The Inverted List structure is almost invariably slower on update
than the Muitilist, and the Cellular Serial, under the assumption that
relocation is always within the cell, is either equivalent to or slightly
faster than the Multilist system. In fact, relocation is a relatively time-
consuming process in the Inverted List system.

Table 28
DASD ASSIGNMENTS TO FILE COMPONENTS

File Component Multillet Inverted List Cellular Serial

Directory Fast Fast Fast


Inverted List -- Slow Fast
Data File Slow Slow Slow

Table 29
TIMING CALCULATIONS FOR ON·LlNE FILE UPDATE COMPARISONS

Transaction. Multillet Inverted List Cellular Serial

Retrieval 86.8 Sec. 13.9 Sec. 3.6 Sec.


Whole Record Addition 4.0 Sec. 15.0 Sec. 0.6 Sec.
Whole Record Deletion 1.3 Sec. 15.8 Sec. 1.3 Sec.
Deletion of 2 Keys 1.3 Sec. 2.8 Sec. 1.3 Sec.
Non·key Modification 1.3 Sec. 1.3 Sec. 1.3 Sec.
(without Relocation)
Non·key Modification 4.8 Sec. 25.8 Sec. 1.3 Sec.
(with Relocation)
Addition of 2 Keys 1.7 Sec. 2.8 Sec. 1.3 Sec.

• All transactions include decoding with a three·level tree. the top level stored in core
(Formula [11] Chapter VI).
ON-LINE FILE UPDATE AND MAINTENANCE 177

6. Summary 01 Fjle Structuring Tecmdques

The formulations and Table for computing File storage and process
timing presented in Chapters VI through VIII are indicative of the
difficulty in making general assessments of superiority of one file
structure over another. One must consider all of the various qualita-
tive and quantitative factors including ambiguous vs. unambiguous de-
coding, ambiguity resolution overhead in DASD memory space, range
search capability in decoders, decoder speed, decoder memory require-
ment, file retrieval speed, terminal environment (typewriter, printer,
CRT), requirement for interrecord processing, type of DASD, assign-
ment of file components (Directory, Inverted Lists; Data File) to
DASD types, query characteristics, real-time vs. batched update re-
quirement, update vs. retrieval transaction volume, and programming
complexity. Table 30 contains a summary of the most pertinent file
related properties, and the relative merits of the various Multilist file
organizations, the Inverted List organization, and the Cellular Serial
organization. Again, lower numbers in the chart indicate the more
optimal property value or attribute.
In teletype terminal systems the speed of initial response may be
a more important factor than either successive responses or total
response time, since the successive retrieval time is usually less, in
any of the systems than typing time. However, if there is interrecord
processing the total response time is more significant. In the Multilist
and Cellular systems this time is quite unpredictable, because the first
list intersection for a key conjunction may appear anywhere on the
shortest list, although the controlled list length and cellular variants
increase the possibility for a fast response. The Inverted List struc-
ture, on the other hand, has an initial response time that is directly
a function of the length and number of lists to be logically processed,
but if the presearch statistic is not required, the executive system could
be so constructed as to begin accessing File records as soon as the
logic processing produces its first retrieval address, and can subse-
quently overlap the File accessions with list processing. This will
assure a much faster initial response for the Inverted List system. In
a CRT system, the successive response time becomes more critical
when the viewer wants to quickly scan the responses, and hence the
Inverted List approach is clearly superior since it is invariably fastest
in this respect, being a function only of the DASD mean access time.
In a system with an on-line, high speed line printer, successive and
total retrieval time is of paramount importance.
Table 30
SUMMARY OF FILE ORGANIZATION PROPERTIES
-..I
00
Controlled LISt Cellular
-
MultlllSt Multma Multlillt Inverted Wst Cellular Serial

Speed of Initial First list First list Number of Query list Query (cell)
Response I. a function of: intersection intersection cells and len~h. list len~hs.
and dlstrlbu· first list size of cell. and ::l
t""
tlon over cells intersection first intersection
"'
~
Succe••lve Retrieval Time 4 2 2 1 3

Succe•• lve Retrieval Time Successive Successive Successive


~
Mean access Successive list ~
I. a function of: list inter· list Inter· list inter· time of DASD Intersections and :III
sections sections and sections and distribution over til
distribution distribution cells "'
"11
over cell. over cells
~
Total Retrieval Time 4 3 2 1 1
~I
t""
No. of File Random 4 4 3 2 1 Z
Acce••lon. per query
"'
Pmearch Retrieval Statl.tlcs 3 3 2 1 4 ~
Prosrammln8 Complexity 1 2 3 3 1 ~"'
Update Time 1 1 2 3 1

DASD Memory Requirement 3 4 4 3/1· 1/2-


(Excludlns Directory)
- -

- With Keys in the Inverted List File record/Without Keys in the Inverted List File record.
ON-LINE FILE UPDATE AND MAINTENANCE 179

The Cellular Serial system requires the fewest DASD random ac-
cessions in the File, but it manages this at the expense of large scale
serial transmissions. In terms of pure list structures, the Inverted List
is again superior to the Multilist systems, although the latter can be
somewhat improved in this respect by Cellular partitions.
The Cellular Serial system cannot provide very meaningful pre-
search statistics except possibly by actually sampling the Gells and
computing the expected response population by statistical inference.
This approach is quite feasible, easily programmed, and can be fairly
accurate. Furthermore, the responses in the sample can be displayed,
and if the user so indicates, a larger sample taken, providing increas-
ingly more reliable estimates and further retrievals. The Inverted List,
of course, provides the most accurate presearch statistics, followed by
the Cellular Multilist, which provides the sum of the shortest lists in
each of the responding cells, followed, finally, by the Multilist, which
provides the shortest list.
Programming complexity is highly subjective, because it is some-
what a function of the experience of and past methodologies employed
by the programmers and analysts. However, the Cellular Serial and
Multilist systems have "fewer moving parts" than the Inverted and
partially Inverted List systems, and therefore have a potential of
lesser complexity.
A major advantage of the Multilist System, in addition to inherent
simplicity, is its update speed; however, it gains this speed at the
expense of file disorder, and the space maintenance problem must be
solved in the background mode either by the list of available space
technique or bidirectional link space brooming. Otherwise, space
maintenance must be scheduled and performed as file regeneration.
Finally, a comparison of File memory requirement depends upon
whether or not the keys are stored in the record. This option is open
only to the Inverted List system since, if the keys are used only for
intrarecord logic processing and not for interrecord processing or
printing, then they can be omitted from the Inverted List File record.
In this case the Inverted List structure has the least memory require-
ment, because it has no repeated key citations, (i.e., it is cited only
once, in the Inverted List at the Directory output). The next most
economical is the Cellular Serial structure because it has repeated
keys in the records, but no link addresses other than in the inverted list
of cells, which is a considerably smaller body of data than the set of
link addresses. Next comes the full set of key citations and link
addresses found in the Inverted List (with record keys) and Multilist
180 FILE STRUCTURES FOR ON-LINE SYSTEMS

structures. However, if the inverted lists were to be stored in a more


expensive DASD than the File records (e.g. Disk vs. Data Cell), then
the memory expense of the Inverted List structure would be greater
than that of the Multilist because the link addresses in the latter are
stored within the File record. Of course, the reason for such a split
would be to improve prefile-search response time. The partially in-
verted Multilist systems have the largest memory requirement because
they must maintain a complete set of link addresses in addition to
some partial inversion data in the form of cell addresses.

7. CtmclusUms

In conclusion, this book has addressed itself to the needs of the


traditionally tape oriented batched system programmer and analyst
who is presently undergoing certain changes in his thought processes
toward the use of mass random access devices. It is also addressed
to the new programmer who must become aware of file processes in
general, in the organization and assembly of large scale information
systems. Toward this end, the purposes of the book have been to
regularize and classify major system concepts such as the real-time
Executive, the Query Interpreter, the Query Language, the Directory
and its Decoder, and the File structure. Within this framework spe-
cific techniques have been described, with major emphasis on the
Directory and File Structure, in sufficient detail to clearly indicate
programming methodology. Then, with some further qualitative
analysis and development of formulations, respective design trade-offs
have been indicated. A deliberate attempt has been made toward
brevity so as not to obscure with detail the basic elements of design,
in order that the analyst or programmer analyst may have sufficient
room to maneuver when applying these techniques to his own prob-
lem environment. It is felt that such an approach to the treatment of
this subject is most appropriate because, at the present developmental
stage of the technology, information system design is a combination
of craft and science, where theoretical knowledge is limited and
experience plays a most important role.
APPENDIX A

THE INFORMATION SYSTEM IMPLEMENT A TION PROCESS

ANY SYSTEM HAS its genesis with a need, and, therefore, once a need is
indicated a determination of requirements may be made. This is usually,
at the outset, a rather informal affair, wherein a few people get together
and agree that in fact some need for a system exists. Figure A 1 presents
a block diagram of the steps that are normally followed (with variations
that would depend on the size and structure of the organization), in pursu-
ing such an inclination. The informal expression is followed by a state-
ment that there does, in fact, exist a need for a system and this document,
sometimes called a Requirements Determination, may be used for estab-
lishing policy or making decisions with regard to the allocation of funds
toward the design and rmplementation process. The first serious and
committed step is the performance of a Requirements Evaluation and
Analysis, in which all the various requirements of indicated or potential
users of the system would be examined and analyzed in order to produce
a document, considerably more formal than the determination document,
that would clearly indicate in a convincing manner the need for the sys-
tem. Of course the extent and formality of these procedures would be
more or less rigid, depending on the type and size of organization involved.
The requirements evaluation and analysis document will generally con-
tain such specifics as the number and type of persons who will use the
system, the amount of information that they generate and in what form,
the expected increase in volumes based upon present modes of operation,
and projections based upon improved modes of operation. Having such
information in hand will enable the next stage of the process to be per-

181
182 FILE STRUCTURES FOR ON-LINE SYSTEMS

formed. This is a Gross System Specification (or description) with Alter-


native Performance Levels, which produces a document that specifies all
of the functions that the new system must perform in such a way that all
of the requirements will be satisfactorily met. Except for the possible in-
clusion of projected work loads in the requirements document, the Gross
System Specification is the first document that involves projective, creative
thought. In general, where the system is going to be relatively expensive,
particularly in relation to current methods of operation, it is desirable to
generate a series of alternative Gross System Specifications with different
performance levels, so that a set of corresponding Designs can be pro-
duced, and a Cost/Performance Evaluation for each performance level
can be made. In this way the organization management can decide, based
upon financial considerations, and upon the indicated quality factors,
which specification and design is most appropriate; hence the output of
the system specification block in Fig. A I indicates a series of documents,
each of which contains a Gross System Design and Cost Analysis. The
output of the Cost/Performance Evaluation block will then be a choice
of one from among the many submitted designs. The System Design at
this stage is gross, because there is often a considerable expense involved
in the design of large scale systems, and multiplying this expense for each
performance level would not in general be justified. When the evaluation
has selected a particular design, the final system may be fully specified
and designed. This culminates in the production of a series of very de-
tailed documents that will be used for any Hardware Procurements re-
quired, for Software Implementations, and Staffing and Training, as shown
in the figure. Of course it is not always the case that electronic data
processing equipment is involved. A somewhat simplified and streamlined
version of this process may take place for even the more simply conceived
semi automated systems that may only use tabulating equipment.
It is a moot point as to whether certain types of development that may
be required, particularly under Software Implementation actually take
place in the Final System Design or in the implementation, phase. In
addition, the Hardware Procurement may also involve special designs and
developments that are completed after the final system design. As indi-
cated in the figure, manual operations such as forms editing, keypunching,
and verification, which are part of file construction, usually begin shortly
after Staffing and Training. These procedures are generally called File
Construction.
It is sometimes the case that software debugging can partly be per-
formed on another machine before the hardware has been installed and
tested. After files have been partially or fully constructed, the hardware
installed and tested, and most or all of the software implemented, a com-
plete System Test will be made. After a final period of pilot system opera-
tion it may be turned over to the user organization for full System
Operation and use. In many organizations, particularly in government,
S,st.m Grall
R.quirements Requlrem."ts Specification S,stem Costl
Eyaluation with Alternotive P.rformanc.
Determination Deti9n and
and Anol,sis performance Itvels Cost Anal,sis Evaluation

Steffint
:0- on ..
Tr.inil...
' .... .....
.....
..... File >
io- "CI
-. Construc t ion ;g
Z

Finol Finol Haraware


~
Hortl.or. S,st.m S,st.m
S,sttm ~ Systtm ..... Install ot ion
Pr.cur ..... n.. Test ... Operation
SDteificatian Desi9 n -.. and Test -r--
Softwe-re
• Impl.menta tion

Fig. A!. The System Implementation Process 00


~
-
184 FILE STRUCTURES FOR ON-LINE SYSTEMS

where large scale systems are required and where the special skills needed
to perform these procedures may not be available, the work in these
various blocks may be separately contracted. In fact, it is not unusual to
find outside contractors involved in part or in whole within the function-
ing of each of these blocks in the diagram.
Figure A2 presents the same information as Fig. A I except from a
personnel and management viewpoint. After the system design is com-
pleted a Development Manager will be appointed who should have three
supervisors reporting directly to him. One is responsible for Hardware
Procurement, which may fall into three general categories: the procure-
ment of central Processors and Mass Storage devices, the procurement of
Communications and Terminals if these are required, and the procure-
ment of noncomputerized hardware devices such as Film equipment,
Printing, and Duplicating equipment.
The second supervisor is responsible for Software Development and
Implementation, and this also divides into three broad categories: the
Operating (OP) System Interfaces and Executive Programs, the program-
ming of the Storage and Retrieval Subsystem, and the generation of vari-
ous Application Programs. The manufacturer of the computer usually
provides an operating system that will always include an assembler for the
machine language code, one or more compilers, and a mode of operation
which may involve some degree of time-sharing or on-line job submission.
The use of the operating system and, in particular, programming in the
assembly language may require some specialized knowledge that would
be the responsibility of this group leader to provide. In addition, an execu-
tive central program and a query language interpreter for the information
system may be required. The Storage and Retrieval Subsystem is the
basic executive program that generates computer files, updates them, and
retrieves information from them in response to calls from the System
Executive. Procedures that are to be performed on individual records as
they are retrieved or on subfiles of records that may be retrieved are called
application programs. These may be quite varied and numerous at the
outset and may continue to accumulate even after the system goes into
operation.
The third supervisor is concerned with Organizational Interfaces.
Basically he is a person who makes up for imperfections that nearly always
exist in the Final System Design. That is, he is fully in contact with the
personnel in the organization who will be using the system, and who have
been involved with all of the preceding steps in the implementation
process. He can therefore make on-the-spot or ad hoc interpretations of
the specification and design documents for the programmers. In addition
he is respon.~ible for Staffing and Training, the manual Input Operation
for file construction, and for insuring that all of the required services are
being met by the system as it is being implemented. This most particularly
pertains to the generation of the applications programs.
System
Development
Manager

Software
Hardware Or9anizat iona I
Development and
Procurement Interfaces
Implementation

>
>a
>a
t t t t t t f!!
Z
Processor Film,
and Mals Printin9,
Op System Applicat ions Input ~
Services
Interfaces Pr09rams Operations
Storage Duplicatin9

Communications Storage and Staffi n9


and Retrieval and
Terminals System Trainin9

Fig. A2. System Implementation Management 00


1.11
-
APPENDIX B

AUTOMATIC CLASSIFICATION AND ITS APPLICATION


TO THE RETRIEVAL PROCESS

THE APPLICATION OF the digital computer to document classification offers


a new approach to this problem. Instead of looking to an a priori classifi-
cation of knowledge, every document can be indexed by subject heading
descriptors (as deeply as is considered necessary), and the entire vocabulary
of descriptors can then be classified in accordance with actual descriptor
usage within the collection of document descriptions. The inherent advan-
tages are (I) the classification can automatically be restructured in response
to changing technology and philosophy as reflected in the new associations
of descriptors as used in document descriptions, (2) new descriptors and
modified meanings of old descriptors are readily accommodated, (3) the
cross-referencing within the document collection becomes more systematic
and thorough, and (4) using the classification hierarchical breakdown,
the user may make a query more specific or alternatively more general
as indicated by an excess or lack of available responses.
In the computer classification scheme to be presented here, the docu-
ment descriptions, rather than an a priori division of knowledge, will
serve as the basis for the classification. Each collection of documents thus
creates its own ad hoc classification based upon the entire collection of
document descriptions in the library. As new documents are added to the
collection, the classification is reconstructed.
Classification implies a hierarchy or tree-like structure in which sets of
descriptors appear as the nodes of the tree and each descendent node (and
its associated set of descriptors) is in some sense subordinate and more
specific to its parent or to some higher ancestor node (which is more
generic). The sense of this subordinacy will now be fully defined.

186
APPENDIX 187

Assume, as a basic model, the tree structure shown in Fig. B 1. In actual


practice, the tree may contain any number of Levels and may branch
any number of ways at each node. The tree shown here contains four
levels, branches two ways at nodes 1, 1.1 and 1.2, and three ways at
node 1.2.2. The terminal (end) nodes of the tree are 1.1.1, 1.1.2, 1.2.1,
1.2.2.1, 1.2.2.2, and 1.2.2.3. Each node represents some set of descriptors
or key terms from the entire vocabulary, and hence each set of descriptors
is labeled Sk' where k is a node number. Each document is represented by
a set of descriptors Sd and will be associated with exactly one of the
terminal nodes.

Property I of the classification tree.

It is a property of this tree that every document contained in the library


for which this tree serves as a classification, is described by a set of
descriptors (Sa) which is wholly contained within a set of nodes that form
a path in this tree, from the apex to a terminal node. That is, a document
description may be contained in S1' S1.1' S1.1.2' S1.2.2.3, etc. Note that Sa
need not have descriptors from all of the sets in the path.

Property II of the classification tree.

It is a second property of this tree that each descriptor will appear only
once within the set of nodes of the path. That is, for instance, in the
path of the nodes 1, 1.1, and 1.1.2, those descriptors that are in node 1
are not in nodes 1.1 or 1.1.2; those that are in 1.1 are not in 1 or 1.1.2,
and those that are in 1.1.2 are not in 1 or 1.1.
This classification differs from those in which the descriptors are as-
signed according to a semantic hierarchy like Physics-Optics-Diffraction
in the DDC Thesaurus. Instead, the descriptors are placed in sets (Sh
S1.1, etc.) with no a priori assumptions made about the semantic
relationship. However, descriptors in the set S1 say, are used in com-
mon with those either in S1.1 and S1.2 or with further descendents of
S1.1 and SUo Hence, there is something more common or generic about
the usage of the descriptors of S1 than there is about the usage of de-
scriptors in lower level nodes.
Associated with each terminal node is a unit of machine storage called
the cell. Within a cell are stored the document records with descriptors
(keys) that are drawn from nodal sets that lie in paths that terminate
at the terminal node.
It is possible that a document descriptor set, Sd' may lie in a path that has
several terminal nodes to which it can be assigned. A decision would have
to be made as to which terminal node the document is to be assigned. For
example, Sa may be wholly contained in S1 U Sl1; how this decision is
188 FILE STRUCTURES FOR ON-LINE SYSTEMS

If')
N
N
-=

N
N
N
...: ..:
<Ii
~
...
Q)

c
0
N :ij
ro
(,)
N ;;::
-= "iii
CIl
ro
U
Q)
..c
en ~

,....;
III
00
i.i:

...J
W
>
W
...J
APPENDIX 189

made is discussed later. Thus, every document description may be asso-


ciated with a terminal node, which in tum is associated with a storage
unit called the cell. In practice, the cell might be a cylinder of a disk file,
a series of tracks in a head per track disk, a strip or card of Data Cell
Storage, or even a segment of magnetic tape.

1. Retrieval by a Conjunction of Keys

The tree structure that is thus created is used for document retrieval
in the following way. First the nodes of the tree are numbered canonically,
where the number of fractions represents the level of the node, and each
fraction counts branches from the parent node (which in tum is repre-
sented by the next fraction to the left). The nodes of Fig. Bl are so
numbered.
Two tables are to be maintained on the DASD for decoding from a
query description to the cells that must be searched for document records
satisfying this description. The first is called the Key to Node Table and
will be denoted KNT. The second contains a list of all terminal nodes by
canonical number, each with its corresponding cell number or address.
This table will be denoted TNT. The KNT translates from a given key
to all nodes (by canonical number) in which the key appears. Retrieval
in response to a query consisting of a conjunction of keys is effected by
the following algorithm.

1) Using the KNT, form a table that places all nodes with the least
number of digits (fractions) in Column I, all nodes with the next
higher number of digits in Column II, and so forth. In addition,
beside each node number, the key number is placed in parentheses.
For exarp.plc, assume that the query contains a conjunction of
the Keys 1, 5, 8, and that the KNT indicates that Key 1 is in
nodes 1.1 and 1.3.8.5.2.1, key 5 is in nodes 1.1.2 and 1.3.8.5,
and key 8 is in node 1.3.8. The table formed according to the
above is as follows:
I II III IV
1.1 (1) 1.1.2 (5) 1.3.8.5 (5) 1.3.8.5.2.1 (1)
1.3.8 (8)

2) Using the table a search is made for a path that includes all the
descriptors in the query conjunction.
In the above example a path including the descriptors 8, 5, 1
is found in column II, III, IV. The algorithm for doing this is
strictly combinatorial. All nodes of Column I are compared with
those of Columns II, III, and IV, etc. This process should be
continued until all possible paths are found.
190 FILE STRUCTURES FOR ON-LINE SYSTEMS

One observation is necessary here before continuing to the


next step. Since a node corresponds to a set of keys, the follow-
ing table is possible.
I II
1.2 (7) 1.2.4.1 (4)
1.2.4.1 (16)

This represents a valid path for the query 4, 7, and 16; hence a
path may contain a repeated node.
3) If no path (containing all of the query descriptors) can be found
in the table, then no such document exists. At this point the user
would have to modify his query either by removing or modifying
one or more of the keys in the conjunction.
If one or more paths do exist then for each path, a search in
the appropriate cells is required. These cells are found in the
following way: For each path select the last node (longest node
number) . A search is made using this node number in the
Terminal Node Table (TNT) to find all numbers of the TNT
which contain the decimal corresponding to the last node of the
path. These entries of the TNT indicate the cells that must be
searched.
For example. the path indicated by steps one and two above
would be 1.3.8.5, 1.3.8.5.2.1. The last node (longest node num-
ber) is 1.3.8.5.2.1. Assume that the TNT appears as follows:

Node Number Cell

1.1.3.4.1 18
1.2.1.1 19
1.2.2 20
1.2.4.6.1. 7.1 21
1.2.4.6.1.7.2 22
1.3.7.6 23
1.3.7.8 24
1.3.8.5.2.1.1 25
1.3.8.5.2.1.2.1 26
1.3.8.5.2.1.2.2 27
1.3.8.5.2.1.3 28
1.3.9.7 29
APPENDIX 191

The look-up in this TNT indicates that node 1.3.8.5.2.1 is not


terminal, but leads to four terminal nodes with associated cell num-
bers 25, 26, 27, and 28, and hence a search must be performed in
these cells (and only these cells) for the required documents.
The cell is then searched for the relevant documents either serially
or by a list structure, depending on its internal organization. *

2. Construction 01 the Classification Tree, T •.

2.1 Construction 01 an Intermediate Tree, T.

A straightforward approach would be to find groups of documents for


which maximum overlap, by some measure, among their descriptors
occurs. These groups would constitute the cells, and the union of all
descriptors in a cell would be the bottom nodes of a tree. Next, pairs (or
triples or higher tuples) of these nodes would be intersected according to
greatest overlap to form the next higher-level nodes, and the descriptors
in the intersection would be included in the higher (parent) node and
deleted from the lower (descendent) node, thus forming the above men-
tioned Terminal Nodes. The parent nodes would then be intersected
according to maximum overlap to form a next higher level, and so forth
up the tree until no further intersections were possible.
This procedure would meet the above objective; however, practical
difficulties involving long processing times and large memory require-
ments would be encountered in such an algorithm. This problem has been
solved by means of a very efficient IBM 7094 computer program; [14]
however, the entire process must take place within the core memory for
practical computation, and the program is thus limited to descriptor
vocabularies of approximately 900 descriptors, which greatly limits its
application to real-life problems, where vocabularies of 10 to 20 thou-
sand descriptors may be encountered. An alternative approach has there-
fore been used and has been programmed on an IBM 7040.£15]
The above difficulties are overcome through a reversal of the process
by constructing a tree T[ (as a preliminary to the classification tree Tr)
from the top down instead of from the bottom up. In terms of Fig. Bl
this means that the procedure starts at Sl and ends at S1.1.h Sl.1.2, etc.
The result is a tree that has the same form as that of Fig. Bl, but the
descriptor sets represented by the nodes of this tree are not the same as
those of the classification tree T r • The tree Tr that is produced from the
top down is called the Tree of Inclusive groups. It is shown in Fig. B2.
An additional process will be necessary to convert the sets of descriptors
at the nodes of T J to corresponding nodes in the tree Tc shown in Fig. B1.

• See Chapter VII, Section 4.


192 FILE STRUCTURES FOR ON-LINE SYSTEMS

If')
C\I
C\I

C\I
C\I
. C\I
C\I
Z
C\I
~
.=....
0
DO
.:
Qj
N .0
tV
....I
C\I
~
Z c\i
C\I III
Ilil
Z i;:

Z
C\I
.
Z
Z

--:
z
APPENDIX 193

2.2 Ge.eral De,criflti01l of the T. C01lstructi01l Process

First the algorithm will be generally described and then defined more
formally. T J is constructed by starting at a single highest level node of
TJo which contains all of the descriptors in the vocabulary (i.e., the union
of all descriptors used in document descriptions). In addition, all docu-
ments are associated with this top most node. In Fig. B2 this would be
node Nt. The descriptors of Nt are then distributed among some number,
N, of next (lower) level nodes. This distribution is based upon three
basic rules. In Fig. B2, N = 2 at this level, and the descriptors of Nt are
divided into nodes Nl.l and N1.2. The three rules for the partitioning are
as follows:

1) Every document description must be represented (i.e. its descrip-


tors must appear) in at least one of the groups (nodes).
2) The number of descriptors in each group should be roughly equal.
3) The distribution of descriptors among the groups is such as to
cause a given document to appear in as few of the groups as
possible, consistent with rule two. (Obviously, if one and three
were observed without two then the solution would be to put
all documents in one group).

When the descriptor groups Nl.l and N1.2 are formed, the documents
that are comprised of descriptors from these respective groups are like-
wise assigned to these nodes. If a document could be assigned to more
than one node then it is assigned to one of the nodes based upon equaliza-
tion of the documents at a node. Hence, when the process is finished
each document will have been assigned to exactly one node, and the
union of all document descriptors at a node is the set of descriptors that
now are contained in that node. Hence N 1.1 would have associated with
it a set of documents D1.l, a set of descriptors which is the union of all
descriptors in the documents of Dl.l and which we have previously called
Nl.l itself; similarly node N1.2 has the set of documents D1.2, the union
of whose descriptors are called N 1.2. If D] is the set of all documents as-
sociated with N 1 , then the following relations hold, according to the above
description.

lA) Dl.lUD1.2=D 1
2A) Dl.l nD1.2 = g
3A) Nl.lUN1.2=N 1
4A) Nl.lnN1.2 # g (in general)
The tree T J may now be constructed by repartitioning each node, Nl.l
and N1.2 into some number, N, of nodes where this N mayor may not
be the same as the previous one. Then, both the descriptors of, say, Nl.l
and the documents of Dl.l will be distributed among N1.l 1 and Nl.l.2
194 FILE STRUCTURES FOR ON-LINE SYSTEMS

and among D1.1.1 and D1.1.2, respectively, according to the three rules.
The same four relations will again hold:

IB) Dl.1.1UDl.1.2=D1.1
2B) D 1.1.1n D1.1.2 = 9
3B) N 1.1.1U Nl.1.2 = N1.1
4B) N1.1.1UNl.1.2 =1= 9 (in general)
It is rule one, expressed formally as relation (1 A) or (1 B), that insures
that Tc will have Property I. Furthermore, when it is decided that the
process should stop, with no further partitioning of any of the nodes,
then it will still be the case that, for the terminal nodes of the tree (in
the case of Fig. B2, Nl.1.1, N1.1.2' N1.2.1, N1.2.2.1, N1.2.2.2, N1.2.2.S)
each document will be assigned to one and only one node, and that all
documents will be so assigned. Hence the requirement of having each
document assigned to a terminal node of Tc is met because the terminal
nodes of Tc are the same as those of T[ except with intersecting descriptors
removed to higher-level nodes.
The decision as to when to stop partitioning a particular node, N K, is
made by determining when the documents in DK and all of their associated
data will fit into the physical memory space designated as a cell.
This completes the general description of the T[ construction and indi-
cates how all of the requirements for the construction of To and the
assignment of documents to cells is met.

2.3 Construction of T. from TI

The graph of the Te tree is identical to that of the T[ tree; only the
nodal descriptor constituencies are different. A given set of sibling nodes
of Tc are generated by intersecting the corresponding set of T[ nodes and
deleting from the respective T[ nodes all descriptors in the common inter-
section. The residues constitute nodes of Te. The common descriptors
become the parent node. This process is then repeated for the parent
node siblings. The entire process starts at the terminal nodes and proceeds
in the above described manner to the apex.

2.4 Formlll Description of the TI Construction Algorithm

The descriptors in a node N K are to be partitioned into N subgroups.


The document descriptions of DK are contained on a magnetic tape,
called the input tape. It is also assumed that the descriptors are encoded
as integers. In addition to the input tape one intermediate tape and one
output tape are required. The partitioning is effected by the following
three pass process.
APPENDIX 195

PASS 1
The descriptions are added one at a time, starting with the beginning
of the input tape. Consider the addition of a description, D. The inclu-
sive groups are numbered 1, 2, 3 . . . N.
1) Find the group which contains the most descriptors of N. De-
note this Group i.
2) If there are two or more such groups, select the smaller group
and denote it as Group i.
3) If two or more of these from step two are equal in size, select
arbitrarily the smallest group number. Let thi,s be Group i.
4) Let the number of descriptors in Group i be denoted n, and the
number of descriptors of D that are not in Group i be denoted ai'
Let e be a positive integer 0, 1, 2 ...
Then, if it is the case that
(nl+aJ";;; (nj + aj) +e

for j = 1, 2 ... N, j # i, the descriptors of D not in Group i


are added to Group i. Otherwise the descriptors are added to
Group j where (nj + a) is a minimum over all j.

PASS 2
1) The Input Tape is rewound.
2) If a description D appears in exactly one group, the descriptors
of D in that group are all flagged (i.e., are essential), and the
description is written onto an Output Tape along with its assigned
group number.
3) If D appears in more than one group, * the description is written
onto the intermediate tape, and no descriptors are flagged. This
is to say that a redundancy exists that may later be eliminated
by eliminating some of the descriptors from one or more groups.

PASS J
1) The Intermediate Tape is rewound.
2) If the descriptors of D are all flagged within at least one group
then write D onto the Output Tape along with anyone of these
group numbers, and move the intermediate tape to the next
description.
3) If step two is not the case, then select that group with the fewest
unftagged descriptors in D; flag them, and write D onto the Out-
put Tape along with the selected group number.
4) When the entire intermediate tape has been passed, eliminate all
unflagged descriptors.
• Note that in accordance with rule one, every description will be placed in at least one
group by PASS 1.
196 FILE STRUCTURES FOR ON-LINE SYSTEMS

The Output Tape can now be sorted by the assigned group numbers.
That is, all descriptions appearing in Group 1 are collected into a block,
those in Group 2 into a block, etc.
If a document description appears in more than one block, it is re-
moved from all blocks except the shortest in which it resides.
Each set of documents D K , corresponding to a node NK now sits in
a single block, having been sorted, and is ready for repartitioning if the
documents of D K , and their associated data still exceed the cell capacity.
Once a T/ has been so produced it should be possible to add new
descriptions to the file and update the classification without rerunning the
entire tape.
This incremental process can be accomplished by considering the new
descriptions to be an intermediate tape (like that produced by Pass 2),
and then running it through Pass 3 using the existing groups and the
corresponding e values that formerly produced them.

3. A" Illustrative Example of T, and T, Construction

The descriptors in the file of documents shown in Table Bl are to


be classified according to the principles of Inclusive partitioning in order
to produce a T/. (Descriptors are represented as integers).

Table Bl
A FILE OF DOCUMENTS

Document Document
Number Description

1 2
1
2 3 4
1
3 3 6
1
4 5 6
1
5 2 4 11
6 2 3
7 4 13 14
8 4 11
9 6 7 8
10 9 10
11 10 11
12 911
13 11 12
14 13 14 15

The first or top level of the T/ contains one group consisting of all 15
descriptors. The first-level group will be broken into two inclusive groups
by the three-pass process. Each second-level group will then be divided
into two inclusive groups to form a third level.
In Table B2, (a) shows the partition of the first-level inclusive group
into 2 groups for e = 0 after Pass 1. Pass 2 in Table B2 (b) creates
APPENDIX 197

an Intermediate Tape of redundant descriptions, and Pass 3 in Table


B2 (c) eliminates the unessential descriptors 9 and 10 from group two,
thus producing the final partition.

Table 82
PARTITION OF FIRST LEVEL INTO TWO INCLUSIVE GROUPS, for e =0
GROUP 1 2 Intermediate GROUP 1 2
Tape
1 1 1 1
2 3 1 2 2 3
5 4 9 10 5 4
6 6 6 6
4 2 4 2
11 13 11 13
7 14 7 14
8 9 8 15
10 10 10
9 15 9
12 12

Pass 1 Pass 2 Pass 3


(a) (b) (c)

Table B3 shows the tape blocks that are produced by Pass(es) 2 and
3 in which the descriptions from Table Bl are written along with a
designation of the group to which it has been assigned.

Table 83
TAPE CONTAINING DESCRIPTIONS BLOCKED ACCORDING TO THE
INCLUSIVE GROUPS OF FIGURE 82 (c)
1 2 1 3 4
1 5 6 1 3 6
2 4 11 2 3
5 11 4 13 14
6 7 8 13 14 15
9 10
10 11
9 11
11 12

Block 1 Block 2

Each of these blocks is then partitioned into two groups by the three-
pass process to produce a third level. The resulting tree is shown in
Fig. B3.
T c may now be generated from T I by intersecting the terminal nodes
of Tf, deleting the common descriptors from these nodes, and placing
them at a common (parent) higher-level node. The intersections continue
until an entire Tc is generated, as shown in Fig. B4.
Note that T(" of Fig. B4 possesses both Properties I and II.
Fig. B3. A Three Level TI for the 1,2,3,4.5.6.
Descriptions of Table Bl ....
\0
7.8. 9. 10. II. 00
12.13,14.15

~
t"'
t!l
1,2.4. 5, 6. 7- 1.2.3.4.6. (II
.-,j
8, 9. 10. 13. 14. ~
("l
II. 12 15 .-,j
c::~
t!l
(II

...,
o
~

~
1,2,4,9, 10, 1.5,6.7,8. 1.2. 3.4.13. 1.3.6.13. tz
II, 12 14 14.15 t!l
II I I
I 2 I 5 6 I 3 4 3 6
~
t!l
a::
(II
2 4 II 5 II 2 3 13 14 15
Documents ( 9 10 6 7 8 4 13 14
10 I I
9 II
,I I 12
~ ~ ~
Cell I Cell 2
---------- Cel13 Cell 4
Fig. 84. T. The Classification Tree Generated
----··1' 51
from T10f Fig. B3

> 5i 1.2

II 3,13,14

>
"II
51.1.1 "II
to!
2 z
2, 4, 9, 0
2,4 6,15 ~
10, 12, ~ a
I 2 I 5 6 I 3 4 3 6
2 4 II 5 1.1 2 3 13 14 15
10 6 7 8 4 13 14
Documents <109 I I
9 I I
II 12
""'--v--' ~ ~ ~
Cell I Cel12 Cell 3 Cell 4 \0
\0
-
200 FILE STRUCTURES FOR ON-LINE SYSTEMS

Table 84
THE KNT FOR FIG. 84

Key Nodes

1 1
2 1.1.1. 1.3.1
3 1.2
4 1.1.1. 1.2.1
5 1.1.2
6 1.1.2. 1.2.2
7 1.1.2
8 1.1.2
9 1.1.1
10 1.1.1
11 1.1
12 1.1.1
13 1.2
14 1.2
15 1.2.2

The Key to Node Table (KNT) for Fig. B4 appears in Table B4, and
the Terminal Node Table (TNT) appears in Table B5.

Table 85
THE TNT FOR FIG. 84

Tennlnal Node Ce"

1.1.1 1
1.1.2 2
1.2.1 3
1.2.2 4

4. Conclusio ..

Automatic classification can perform a number of functions in


an information system. It may be used solely to generate cell assign-
ments for records in order to increase the retrieval efficiency of Cel-
lular file structures. It may be used in conjunction with the tree To,
which is actually a classified thesaurus, the KNT and TNT as an
efficient mechanism for determining which cells to search in response
to Boolean key expressions. These same tables can be used as auto-
mated browsing and query formulation aids for the user. For example,
if the user addressed the query (1 AND 3) * to the file of Table
• 1 and 3 are descriptor numbers.
APPENDIX 201

Bl, the system would identify the Tc path (1) - (1.2) in Fig. B4, and
indicate that a search of 5 documents in 2 cells was indicated. The user
could then request to see a list of more specific keys that also lay in the
same path, whereupon the system would display the keys (2,4) from
node S1.2.1 and (6,15) from node S1.2.2' A selection from either or both
of these sets would narrow the request in the only ways possible within
this collection. That is, selections from nodes S1.1.1 or S1.1.2 are ruled out
because of Property 1 of the tree, namely, that all descriptors in a file docu-
ment must lie within a path of the tree. Similarly, the user could ask for the
other descriptors at the lowest level of his query, namely, node S1.2, where-
upon he might see a more suitable key than 3 to combine with 1, or he
may want to form a disjunction with 3 in order to make his query more
inclusive. Finally, in the opposite sense, the user may request to see a
higher-level node in order to make his search more generic by including
more of the tree.
APPENDIX C

As DESCRIBED in Chapter II, auxiliary or indirect access magnetic stor-


ages are of three types: (1) Fixed head (head per track) disks or drums,
(2) Movable head disks. drums or disk packs, and (3) Magazine stored
magnetic strips or cards. All computer manufacturers have adopted the
use of either or both of the first two device types and three, IBM, RCA,
and NCR, manufacture and use the third type. and some of the others
will interface this third device type to their own equipment. A chart has
been prepared for each of five computer manufacturers showing a repre-
sentative sample of device types (Le., from among these three cate-
gories). All other manufacturers, not represented, have similar equip-
ment, and what is most apparently illustrated here is the similarity in
equipment lines, although certain companies do have what may be re-
garded as a more uniquely identifiable DASD philosophy. For example,
Burroughs is presently committed to fixed head disk storage through all
speed and capacity ranges. IBM. on the other hand, is presently com-
mitted to fixed head drums for their highest speed devices, and predomi-
nantly to disk packs for both intermediate and large capacity storage, with
the somewhat lower random access speed. They also produce the Data
Cell, one of the third category devices, for mass storage (400 million
bytes) at lower cost and access speed, and UNIVAC uses movable head
drums for mass storage.
In the first column of the chart the manufacturer's model number of
the equipment is given. The second column indicates to which of the
three device classes the equipment belongs. The third and fourth columns
contain the rotation time and approximate average head positioning (or
card positioning in the case of type 3 devices) time in milliseconds. Tn the

202
APPENDIX 203

random head access time is not applicable to head per track devices, as
indicated by a bar. The fifth column contains the serial data transmission
rate from the device to the channel, in kilobytes or characters per second.
The sixth column indicates the modularity for those devices that have re-
movable packs, cartridges, cells or magazines, giving the storage capacity in
megabytes for the storage module. The seventh column contains the track
Cl1pacity in 8 bit bytes or 6 bit characters. The eighth column presents
the capacity of one entire storage unit, in megabytes (or megacharacters);
sometimes the unit is the same as the module, as in the case of the IBM
2311 disk pack, and sometimes it is an aggregate of modules, as in the
case of the IBM 2314, or the RCA mass storage. The ninth column
contains the number of units that can be placed on a single controller
or channel. The tenth and last column indicates whether the hard-
ware automatically performs a read and verification after a write, with-
out necessitating another revolution. In those devices without this feature,
it is customary to read on a subsequent rotation and program verify
whenever it is written. This means that an additional rotation must be
added for every DASD write in file update formulations such as in Table
26 of Chapter VIII.
N
~

SAMPLE OF BURROUGHS DIRECT ACCESS


STORAGE DEVICES .."
t=
t!1
Data Track Unit Units/ Hardware Read
DASD Rotation T. Transmission Modularity Capacity Capacity Controller after Write [Jl
o-j
Model Type (ms) (ms) Rate (KB/S) (MB) (Bytes) (MB) (Channel) Verification
c::
(')
'"
9370-1 H/T Disk 34 292 100 Bf 1 1 Ves o-j
- - Sector 2 1 c::
Variable No. t!1
[Jl
'"
of sectors
per track .."
o
9372-15 .. 40 - 218 - . 10-50 10 Ves o'"
ZI
. .. t'"
9375-0 46 - 377 100 10 Yes
Z
t!1
9375-2 .. 80 - 216 - .. 100 20 Yes
~
o-j
9375-3 . 120 395 .. 100 40 Yes t!1
- - ;s::
[Jl

• All devices for use with B2500 and B3500 Processors. Comparable devices available for B5500, B6500, B7500, and B8500.
SAMPLE OF HONEYWELL DIRECT ACCESS
STORAGE DEVICES

Data
Transmission Track Unit Hardware Read
Rotation T, Rate K Modularity Capacity Capacity Units/ after Write
Model DASD Type (ms) (ms) Char/Sec (m Char) (Char) (m Char) Controller Verification

155 Disk Pack 35 100 147 1.84 4602 3.68 2 No


(Movable Head)

258 " " 25 65 208 4.6 4602 4.6 8 No >


."
."
259 " " 25 80 208 9.2 4602 9.2 8 No t<I
Z
o
261 (Movable Head) 25.8 78 190 - 9216 150 8 No X

262 " " 25.8 78 190 - 9216 300 4 No

273 Disk Pack 25 50 208 18.4 4602 18.4 8 No


(Movable Head)

275 " " 25 50 208 18.4 4602 147.2 1 No

278 " " 25 50 416 35 8760 280 1 No

~
VI
N
o
""
SAMPLE OF IBM DIRECT ACCESS
STORAGE DEVICES
'f1
Data Track Unit Hardware Read t=:
DASD Rotation T, Transmission Modularity Capacity Capacity Units! after Write ttl
Model Type (ms) (ms) Rate (KB!S) (MB) (Bytes) (MB) Controller Verification til
>-l

2301 H/T Drum 17.5 - 1.200 - 20.483 4.09 4 No ~


(')
>-l
C
1301 Movable 34 120 70.1 - 2.160 56 5 No :>:I
ttl
Head Disk til

'f1
2311 Disk Pack 25 75 156 7.25 3.625 7.25 8 No o
:>:I
(Movable
Head) o
ZI
--- t""
2314 . 25 75 312 29.175 7.294 233.4 8 No Z
ttl

2321 Magnetic 50 400 55 40 2.000 400 8 No


(Data Strip. ~
>-l
Cell) Cell Store 350' ttl
(Movable s:::
til
Head) 550"

*Previous strip restored.


", * Previous stri p not restored.
SAMPLE OF RCA DIRECT ACCESS
STORAGE DEVICES

Data Track Unit Units/ Hardware Read


DASD Rotation T, Transmission Modularity Capacity Capacity Controller after Write
Model Type (ms) (ms) Rate (KB/S) (MB) (Bytes) (MB) (Channel) Verification

567-16 HIT Drum 16.6 - 333 - 5161 8.26 1 No


8 4.13 1
>
'II
'II
til
564 Disk Pack 25 85 156 7.25 3625 7.25 8 No Z
(IBM 2311)
~
568 MagnetiC 60 508 70 67 2048 536.8 8 Yes
Card,
Magazine
Store
(Movable
Head)

IV
o
-....l
~
00

SAMPLE OF UNIVAC DIRECT ACCESS


STORAGE DEVICES

Data Track Hardware Read '"!j


Unit Units! ....
DASD Rotation T. Transmission Modularity Capacity Capacity Controller after Write t""
ttl
Model Type (ms) (ms) Rate (KB/S) (MB) (Bytes) (MB) (Channel) Verification ,
CIl
'"'l
FH880 HIT Drum 34 - 360- - 6.144-' 4.7-- 8 Yes ~
(')
'"'l
FH432 . 8.5 1.440' - 4,096-' 1.6-' 8 Yes e~
ttl
CIl
FASTRAND Movable 70 57 153.5- - 10,752" 132-- 8 Yes
'"!j
II head drum o
~
---
8411 Disk Pack 25 75 156 7.25 3.625 7.25 8 No
(Movable ~I
Head) t""
52ttl
8410 Disk 50 110 85 3.2-12.8 16.000 1.6'-- 8 No
Cartridge
(Movable
~
'"'l
Head) ttl
-------_.-
a::
CIl

• KC/C
•• Characters
••• Per cartridge surface; may be turned over for another 1.6 MB
APPENDIX D

DISCUSSION TOPICS AND PROBLEMS FOR SOLUTION

1. Show how the Model of an Information System in Fig. 1 applies to


a library.
2. Configure a series of Information Systems using the components of
Table 1. Rank them by cost and describe their relative capabilities.
3. Explore both the differences and similarities of an Information and
a Communication system, beyond that which is done in Chapter I.
4. Discuss the hardware-software trade-offs in achieving the EDP
random system, as defined in Chapter I.
5. By obtaining a configurator from a computer manufacturer's sales-
man, construct and price a hardware configuration for Figs. 3, 5,
and 6.
6. List the advantages and disadvantages of present-day head per track
versus moveable head Disk Pack. Can certain manufacturers be
identified with one approach versus the other?
7. Given the book's definition of in/ormation structure and file organi-
zation, what do you think is the role of each in the total system
design?
8. At what level of discussion are the hierarchic and associative infor-
mation structures different? Must they be differentiated at all levels?
9. Give other examples of hierarchic information structures. Can you
think of one in which the network has cycles?
10. What class or classes of users, if any, do you think should be aware
of the system's file organization?

209
210 FILE STRUCTURES FOR ON-LINE SYSTEMS

11. Describe the processing sequence required to respond to the follow-


ing query in real time.
DATA CONDITIONS
Al (SAL. LT. 7500)
A2 (AGE. BTW. 20, AVERAGE [AGE] )
A3 (POSN = SALES)
PROCESSING
FILE = A 1 AND A2 AND A3
12. What verbs or functions, in addition to those given in Chapter IV,
would be useful in a task oriented query language?
13. Which directory decoder do you think is the most efficient with
respect to:
i) Programming ease
ii) Decoding speed
iii) Versatility
iv) Update speed.
Give reasons.
14. How could the randomizer be applied to single key, variable length
records without using the indirect addressing level of Fig. 32.
15. Derive a decoding tree formula based upon the parameters of Table 6
and the time to find a key in sequence that would indicate the trade-
off between physical record size (C t - C r ) and the depth of the tree.
16. Why does a disappear from Formula 9 in Chapter V?
17. How can a hierarchic information structure be implemented by a
purely associative file organization?
18. Describe an alternative approach to relocating expanded records after
an update.
19. An alternative to space brooming is space maintenance by the forma-
tion of a list of available space. Discuss one or two implementations
of this approach and compare it with space brooming.
20. Design a program to control the "Logic Processing" block of 40 in
order to process a query function in:
i) Disjunctive Normal Form
ii) Conjunctive Normal Form
iii) Any arbitrary factored form.
21. Design an economical, yet highly effective storage and retrieval sys-
tem for an individual who files 5 to 10 papers (letters, reports,
memoranda) per day. At what volume would your system lose cost
effectiveness?
22. Design and flow chart the batch processing system described in con-
junction with Table 8, Chapter VII.
BIBLIOGRAPHY

1. Stone, Philip J. An Introduction to the General Inquirer: A Com-


puter System for the Study of Spoken or Written Material. New
York: Simulmatics Corp., 1962.
2. Miller, George A. Penink, a Computer Program and Documentation
written for the IBM 7040. Philadelphia: University of Pennsylvania,
1964.
3. Ledley, Robert S. Digital Computer and Control Engineering. New
York: McGraw-HiII Book Co., 1960.
4. lonker Information Systems (Manual). Gaithersburg, Maryland:
Jonker Corporation, 1967.
5. Hsiao, D., and Prywes, N. S. "A System to Manage and Information
System," F.I.D./I.F.I.P. Conference 1967, On Mechanized Informa-
tion, Storage Retrieval and Dissemination. Rome, Italy: June, 1967.
6. IDS/COBOL General Electric Information System, CPB-144.
August, 1966.
7. Newell, Allen. Information Processing, Language V Manual. Engle-
wood, New Jersey: Prentice-Hall, Inc., 1961.
8. Perlis, A. J., and Thornton, Charles. "Symbol Manipulation by
Threaded Lists," Communications of the ACM, Vol. III, No.4
(April, 1960), 195-204.
9. Weizenbaum, J. "Knotted List Structures," Communications of the
ACM, Vol. V, No.3 (March, 1962), 161-165.
10. Prywes, N. S., and Gray, H. J., et al. "The Multi-List Type Associa-
tive Memory," Proc. of Symposium on Gigacycle Computing Sys-
tem'S. AlEE Publication, No. 8-136 (January, 1962), 87-107.
11. Landauer, W. I. "The Balanced Tree and Its Utilization in Infor-
mation Retrieval," Transactions on Electronic Computer of the IEEE,
Vol. EC-XII, No.5 (December, 1963).
12. Johnson, L. R. "An Indirect Chaining Method for Addressing on
Secondary Keys," Communications of the ACM, Vol. IV, No.5
(May, 1961).
13. Needham, R. M., and Sparck Jones, K. "Keywords and Clumps-
Recent Work on Information Retrieval at the Cambridge Language
Research Unit," lournal of Documentation, Vol. XX, No.1 (March,
1964).
14. Wolfberg, Michael S. "Determination of Maximally Complete Sub-
Graphs," Moore School Report #65-27, prepared under Contract
NOnr 551(40) (May, 1965).
15. Angell, Thomas. "Automatic Classification as a Storage Strategy for
an Information Storage and Retrieval System," Unpublished Masters
Thesis, University of Pennsylvania, 1966.

211
INDEX

Abstracting, 3, 12 Coordinate index, 10


Access, 30, 31, 34 Cost/performance evaluation, 182
Access key, 57, 58 CRT, 4, 7, 8, 64, 68, 81, 132, 177
Access time, 111, 141, 152 Cylinder, 31, 33, 93, 124, 139, 140
Accession number, 57, 58, 107, 147 Data Cell, 28, 29, 32, 33, 143, 154,
ALGOL, 67 175, 180, 189
Ambiguity, 117, 121 Data reduction, 2, 3, 4, 6
Analog, 7 DDC Thesaurus, 187
Associative, 9, 43, 44, 45, 46, 55, 58, Decision making, 80
60,89 Decoder, 22, 23, 24, 25, 40
Associative memory, 46 Decoding, 19, 80, 88, 92, 96, 106,
Automatic classification, 143, 186, 200 108, 153
Automonitor, 15, 150 Decoding speed, 122
Background, 21 Decoding time, 122, 124, 125
Balanced tree, 92, 118 Descriptors, 3
BCD, 106 Detail, 46, 48, 50, 51, 52, 53, 54, 55,
57,58
Bidirectional, 127, 157, 169
Dewey, 9
Boolean, 39, 67, 132, 148
Dialogue, 85
Branching factor, 111
Digital, 7
Browsing, 4 Directory, 12, 13, 19, 20, 22, 23, 24,
Cell, 90, 91, 133, 136, 137, 139, 140, 39, 40, 77, 79, 83, 85, 86, 89, 90,
143, 152, 179, 187, 189, 191 91, 95, 97, 98, 99, 102, 106, 108,
Cellular muItilist, 137, 141, 147 121, 127, 129, 130, 132, 133, 135,
Cellular partition, 90, 91, 137 136, 137, 147, 148, 150, 153, 154,
156, 158, 163, 165, 169, 171, 172,
Cellular serial, 141, 152, 153, 172, 174, 176, 177, 180
173, 174, 176, 177, 178, 179
Disk Pack, 33, 34, 122, 124, 154, 172
Chain, 48, 49,50, 111, 116, 117, 150,
165 Document, 1, 11, 186, 187, 189, 191,
193, 196
Chaining, 107, 108
Document file, 2, 3, 4, 7, 8, 12
Classification, 9, 11, 186, 191, 198
Classification system, 4 Efficiency, 39
Clumps, 143 Executive, 15, 16, 18, 19, 20, 21, 25,
Clutch, 60, 62, 65 82, 83, 85, 180, 184
COBOL, 24, 67 Executive program, 13
Command and Control, 5 File generation, 16, 22, 150
Command and control system, 69 File maintenance, 21
Console, 19, 22, 24 File organization, 37, 42, 43, 48, 54,
Context free, 61 82, 91, 121, 126, 127, 133, 141

213
214 INDEX

File processor, 13, 15 List structures, 88, 126, 136


File partitioning, 16, 39, 90, 91 List structuring, 16, 37, 55, 82
File protection, 70 Logic expression, 77
File regeneration, 162, 170 Logical expression, 24, 85, 129
Fixed head, 124 Logical list, 160, 161
Fixed head disk, 28, 31 Logical record, 48, 50, 51
FORTRAN, 24 Logical space collection, 172
Loop, 70
Garbage collection, 42
Generator, 1, 3, 11, 13 Magnetic card, 28, 143
Graph, 5,45 Magnetic strip, 28, 143
Hash coding, 86, 106 Magnetic tape, 28, 29, 35, 155, 189
Head per track, 30 Maintenance, 42, 155, 170
Head positioning, 122, 124, 125 Man-machine interaction, 75
Hierarch, 89 Management, 80, 81
Hierarchic, 9, 43, 44, 45, 46, 49, 54 Management Information, 5
Hierarchical, 186 Management information system, 69
Managers, 68
IDS, 45, 49 Mapping, 106
Inclusive group, 191, 196 Master, 46, 48, 50-53, 55, 58
Index, 70 Microfilm, 7
Indexing, 3, 6, 12 Model, 1
Indicative abstract, 3 Moveable head, 124
Indirect address, 107, 108, 118 Moveable head disk, 28, 31, 32
Information system, 1, 6, 26, 60, 68, Multifile, 69
82, 180 Multikey, 39, 40
Information structure, 37, 43, 46, 48, Multi1ist, 89-91, 126, 127, 129, 131-
59, 60, 82, 108 133, 135-137, 147, 148, 152, 153,
Informative abstract, 3 157, 165, 166, 169, 172-174, 176-
Interactive, 62 180
Intercord processing, 41, 59, 63, 66, Multiquery, 169
72,74 Multiterminal, 63, 83
Interface, 3
Natural language, 61, 66, 77, 85, 104,
Intracord processing, 41, 59, 66, 70, 106
72,79
Nest, 70
Inventory central system, 69
Network, 45, 48
Inverted list, 57, 89-91, 127, 129, 131,
132, 137, 148, 152-154, 156, 165- Node, 187, 189, 190, 191, 193, 194,
169, 172-180 197, 200
On-line executive, 16
Key, 12, 19, 24, 39-41, 55, 56, 59, 62,
63,65,66,67,70,77-79,81,83, 85, Operating system, 18-20, 25, 59, 184
86, 89, 92, 93, 97, 99, 101, 102, 104, Overhead, 117, 127
106-108, 112, 114, 120, 126, 127, Paging, 50
129, 130, 132, 133, 135, 143, 147, Partition, 40
150, 153, 156, 158, 160, 163, 165-
168, 171, 172, 189, 190, 200 Partitioning, 194
Keyboard, 7 Part inventory, 4
Knotted list, 16, 89 Physical space collection, 172
Presearch statistics, 85, 90, 129, 131,
Latency, 30, 34, 96, 124, 125, 152 178
Library, 186, 187 Privacy, 70
Light pen, 7 Qualifier, 40, 41, 55, 56, 58, 59, 63,
Link address, 50-52, 55, 57, 89, 93, 65,70,77,81, 131
95,99, 101, 144, 150, 157, 158, 160, Query interpreter, 20, 22, 23, 25, 85,
165, 172 180
List intersection, 129, 132 Query language, 59, 60, 62, 63, 75,
List structured, 130, 169 180
INDEX 215

Query processor, 13, 18, 19 System programmer, 180


Systematic inventory, 4
Random access, 56, 96, 136, 143 Syntax, 61
Random accession, 131, 132, 135
Randomized, 92, 108, 111, 116 Table look-up, 86
Randomizer, 117, 118, 120, 121, 125, Tags, 3
126, 148 Terms, 3
Randomizing, 86, 87, 106 Thesaurus, 200
Real time, 12, 37, 42, 57, 58, 78-80, Threaded list, 16, 55, 89, 126, 129,
83, 130, 139 132, 156
Redundancy, 120 Time sharing, 16, 19, 25, 184
Redundant, 101, 108 Timing, 150, 172, 173, 176, 177
Reference file, 2-4, 7, 10, 12, 13 Trade off, 122
Relational operator, 65 Trailer, 42, 52, 53, 55, 163
Relational operators, 88 Tree, 86, 87, 92, 93, 95-99, 102, 113,
Relevance, 6 114, 116, 118, 121, 122, 124, 148,
Report generation, 41, 75, 81 186, 187, 191, 194, 198, 201
Report generator, 80 Truncation, 92, 93, 95, 98, 104
Requirements determination, 181
Requirements evaluation, 182 Unidirectional, 127, 169
Reserve, 144 Unique, 101, 104
Reserve space, 97, 98 Universal Decimal, 9
Response, 43 Update, 6, 11, 15, 20, 23, 42, 43, 57,
63, 68, 79, 80, 90, 91, 98, 108, 127,
Space brooming, 21, 170 132, 141, 143, 144, 155-157, 159,
Space maintenance, 127 163, 165, 169, 172-174, 176, 178
Subject heading, 10, 12 Usage statistics, 135
Statistic, 24 User, 1-3, 12, 13, 19, 24, 25
System design, 182, 184
System management, 15 Verification, 12
System manager, 90 Vocabulary, 61, 113, 120, 152, 193

You might also like