Managing Gigabytes: Compressing and Indexing Documents and Images

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Managing Gigabytes:

Compressing and Indexing documents and images


By Ian H. Witten, Alistair Moffat, and Timothy C. Bell
Second Edition
MORGAN KAUFFMAN, Copyright 1999, 519 pp.
ISBN 1558605703 , List Price $62.95

Review by:
S. V. Nagaraj
svn1999@eth.net

This book published in 1999 is a revised The book has two appendices, one of which
second edition of the version that first is a guide to the mg system, whereas, the
appeared in 1994. The authors have other is a guide to the New Zealand Digital
incorporated many changes in the second Library (NZDL, a repository of information
edition by updating various chapters to freely available on the Web). NZDL makes
include the latest developments. The book is use of mg as its kernel. The guide to the mg
concerned with the task of managing large system describes the installation process, a
volumes of text and image data amounting sample storage and retrieval session, utilities
to several gigabytes. The fundamental for database creation, the process of
problems addressed in the book include querying an indexed document collection,
compressing such data and indexing it, in managing images, and a suite of three image
order to enable easy search. Many books compression programs.
look at compressing and indexing as if they
are unrelated techniques; however, this book The book comprises ten chapters. Chapter 1
brings out the advantages of combining presents an overview of the book and
them in a beneficial way. introduces topics such as document
databases, compression, indexes, images of
The book is meant for a wide variety of documents and the mg system.
readers including software professionals,
librarians, and distributors of items such as Chapter 2 is about text compression. It
CD-ROMs. The book has been authored by discusses topics such as models of the data
academics and is therefore well suited for that needs to be compressed and includes
academic use. It may be used for teaching adaptive models. Huffman codes are
courses in the area of data compression and discussed along with algorithms and data
information retrieval at various levels. An structures for dealing with them. Arithmetic
instructor’s supplement is also available and coding is discussed along with techniques
includes test questions and review material for implementing it. Symbol-wise models
for use during teaching. are introduced and four data compression
techniques based on them are discussed.
As a supplement to the book, a system They are Prediction by Partial Matching
known as mg (short for Managing (PPM), block-sorting compression, Dynamic
Gigabytes) developed by the authors is Markov Compression (DMC) and word-
available for download freely from the Web based compression. Dictionary-based
site of the authors along with the complete compression models such as LZ77, LZ78,
source code in C. The mg system was and the LZW variant of LZ78 are discussed.
developed for use on the Unix platform. Synchronization methods for achieving
random access in compressed files are also
described. The performance of various
compression techniques is evaluated by JPEG) technique. JPEG is a lossy
using different types of input data. compression standard for continuos-tone
images.
Chapter 3 is on indexing; the process of
creating indexes to aid the process of Chapter 7 is concerned with textual images
searching text efficiently by means of (images of texts). Both lossy and lossless
keywords. Sample document collections are compression techniques are discussed and
used to illustrate the indexing complexities their performance is evaluated. The JBIG2
associated with different types of data. The standard for textual image compression is
application of inverted files for enabling also described.
indexing is described. The processes of
compressing and indexing inverted files are Chapter 8 deals with techniques for dealing
also discussed. The performance of various with instances when both text and images
index compression methods is evaluated. occur together in documents. In such cases,
Signature files and bitmaps are described as it is worthwhile to differentiate between text
two other approaches to indexing. The three and images.
indexing methods are then contrasted. Chapter 9 is concerned with the
Chapter 4 deals with the subject of querying implementation-related aspects of various
indexes. Various data structures to store techniques described in the book. For text
lexicons are discussed. Disk-based lexicon compression, the choice of compression
storage is described besides the application model has a considerable impact. Text
of minimal perfect hashing for accessing the compression performance is explored
lexicon. Methods for querying when query focussing on effectiveness, decompression
terms are only partially specified are speed and the impact of memory. The
discussed. Techniques for processing implementation of index construction, index
Boolean queries are described. Alternatives compression and query processing are also
to Boolean queries are also explored. discussed.
Interactive and distributed retrieval Chapter 10 speculates on the future of
techniques are discussed and the retrieval techniques for managing gigabytes. The
efficiency of various methods is assessed. huge impact of the Internet and utilities such
Chapter 5 is concerned with the arduous task as digital libraries on every day life and the
of index construction. Memory-based and need for better data compression techniques
sort-based inversion approaches for index is emphasized.
construction are described. The application The book includes adequate references to
of compression for index construction is the literature that were current at the time of
discussed. This leads to a technique known publication. The handy index aids easy
as compressed in-memory inversion. The referencing.
inversion methods are then compared.
Techniques for constructing signature files Overall, the book is a treasure trove of
and bitmaps are discussed in addition to techniques for compressing and indexing. It
methods for dealing with dynamic will be handy for those interested in
collections. indexing and retrieving information from
large text databases running to several
Chapter 6 is concerned with image gigabytes. The style of presentation is
compression. The CCITT fax standard and laudable and the coverage of topics is
the JBIG standard for bilevel images are impressive. The authors could have included
discussed. The GIF and PNG image formats audio and video compression techniques.
are described. The FELICS and CALIC This good book will be very useful for the
lossless image compression techniques are intended audience.
discussed along with JPEG-LS (lossless

You might also like