Digital Library

Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding
Yue Lu, Li Zhang, Chew Lim Tan

Department of Computer Science, School of Computing
National University of Singapore, Kent Ridge, Singapore 117543
{luy, zhangli,tancl}@comp.nus.edu.sg
Abstract search engines over text documents. Meanwhile, with the

need of making digital library an online store of books,
A great number of documents are scanned and archived magazines, student theses, etc, so that users would be able
in the form of digital images in digital libraries, to make to do a search on a set of keywords and get a list of relevant
them available and accessible in the Internet. Information articles, for viewing or printing, it is necessary to develop
retrieval in these imaged documents has become a growing an efficient information retrieval method to facilitate access
and challenging problem. For this purpose, a word image to these imaged documents.
coding technique is proposed in this paper, and a web-based In the past years, various ways have been studied
system for efficiently retrieving imaged documents from to query on imaged documents using physical (layout)
digital libraries is described. Some image preprocessing structure and logical (semantic) structure information as
is first carried out off-line to extract word objects from well as extracted contents such as image features. For
imaged documents stored in the digital library. Then each example, Worring and Smeulders[1] proposed a document
word object is represented by a string of feature codes. retrieval method employing the information of implicit
As a result, each document image is represented by a hypertext structure extracted from original documents.
series of feature code strings of its words, which are Jaisimha et al[2] described a system with the ability of
stored in a feature code file. Upon receiving a user’s retrieving both text and graphics. Appiani et al[3] presented
request, the server converts the query word into feature a document classification and indexing system using the
code string using the same conversion mechanism as information of document layouts.
is used in producing feature codes for the underlying
However, for those imaged documents where text
imaged documents. Searching is then performed among
content is the dominant information, the content-based
those feature code files generated offline. An inexact string
retrieval approach, in particular, querying by words is
matching technique, with the ability of matching a word
still commonly used. It is obvious that the technique
portion, is applied to match the query word with the words
of conventional document image processing can be
in the documents, and then the occurrence frequency of the
utilized for this purpose. For example, a document image
query word in each corresponding document is calculated
processing system is used to automatically convert a
for relevant ranking. Preliminary experimental results with
document image to its machine-readable text codes first,
some imaged documents of students’ theses in the digital
and text information retrieval strategies[4] are applied
library of our university show that the proposed approach
then. Based on this idea, several commercial systems
is efficient and promising for retrieving imaged documents,
have been developed using techniques of document
with potential applications to digital libraries.
page segmentation and layout analysis, followed by
optical character recognition(OCR). These include Heinz
Electronic Library Interactive On-line System (HELIOS)
1. Introduction developed by Carnegie Mellon University[5], Excalibur
EFS[6], and PageKeeper[7] from Caere. All these systems
Modern technology has made it possible to produce, require a full conversion of the document images into
process, store, and transmit document images efficiently. As their electronic representations, followed by a text retrieval.
we look through the documents stored in digital libraries, However, OCR performance highly relies on the quality of
in particular, the digital library of our university, large the scanned images. It deteriorates severely if the image is
quantities of them are simply scanned and archived in image of poor quality and with high distortion.
form that cannot directly employ the current powerful It is generally acknowledged that the recognition
accuracy requirements for document image retrieval are document, instead of full OCR. It is virtually a trade-off
considerably lower than that for many document processing between computational complexity and recognition
applications[8]. Motivated by this observation, some accuracy. For example, Sptiz described character shape
methods with the ability of tolerating recognition errors of codes for duplicate document detection[21], information
OCR have been researched recently[9, 10]. Additionally, retrieval[22], word recognition[25] and document
some methods are reported to improve the retrieval reconstruction[26], without resorting to full character
performance by using the OCR candidates[11, 12]. recognition. The character shape codes encode whether or
In recent years, there have been much interests in not the character in question fits between the baseline and
the research area of document image retrieval[13, 14], the x-line or if not, whether it has an ascender or descender,
which avoids the use of OCR for retrieving information and the number of spatial distribution of the connected
from imaged documents. Document image retrieval(DIR) components. Its processing is very simple to get character
is relevant to document image processing(DIP), but there shape codes from document images, but ambiguity is a
are some essential differences between them. A DIP system problem. Additionally, to get the character shape codes,
needs to analyze different text areas in a page document, character cells must be segmented at the first step. It is
understand the relationship among these text areas, and then therefore unsuitable to deal with words with connected
convert them to a machine-readable version using OCR, in characters, especially heavily touched characters. In the
which each character object is assigned to a certain class. domain of Chinese document image retrieval, He et al.[23]
The main question that a DIR system seeks to answer proposed an index and retrieval method based on character
is whether a document image contains particular words codes generated from stroke density.
which are of interest to the user, while paying no attention In this paper, we propose a word image coding scheme
to other unrelated words. In other word, a DIR system for designing a system which is able to retrieve imaged
provides an answer of ‘yes’ or ‘no’ with respect to the user’s documents in digital libraries. The differences between our
query, rather than the exact recognition of a character/word work and Spitz’s can be summarized as: (1)We extract the
like that in DIP. Moreover, words, rather than characters, codes at the word level, rather than at the character level,
are the basic meaningful units for information retrieval. In to dispense with the need to separate connected characters.
particular, to overcome the problem caused by character (2)Compared to Sptiz’s work, the procedure of computing
segmentation, segmentation-free approaches have been word image codes in our work is more complicated, but
developed to avoid the segmentation step, i.e. word image results in an advantage of eliminating ambiguity among
matching approach. This approach treats each word as a words.
single, individual entity and identifies it using features of The present system, based on the proposed word image
the entire word rather than each individual character. Some coding scheme and matching approach, has been tested in
document image retrieval methods based on word image an experimental platform developed for the digital library
matching have been proposed recently by researchers. of National University of Singapore (NUS) with imaged
Therefore, directly matching word images in a document documents generated from past undergraduate theses. The
image is an alternative way to retrieve information from the overall results are encouraging.
imaged documents. The remainder of the paper is organized as follows: in
It is a fact that the information retrieval methods based the next section, we briefly illustrate the overall framework
on the technology of document image processing are still of our system. The details of our word image coding
the best so far, because a great deal of research has been are presented in Section 3. In Section 4, we describe the
devoted to this area, and numerous algorithms have been process of converting a PDF file containing pages of imaged
proposed in the past decades, especially for the research of documents to a feature code file containing the information
OCR. However, DIR and DIP address different needs and of its individual words. In Section 5, we elaborate the
have different merits of their own. The former which is word matching algorithm exploited in matching the feature
tailored for directly retrieving information from document codes. In Section 6, we detail the implementation of our
images could achieve a relatively higher performance in experimental system and the test results. Finally, we draw
terms of recall, precision, and processing speed. Therefore, some conclusions and point out future works in Section 7.
DIR which bypasses OCR still has its practical value today.
A lot of efforts regarding document image retrieval 2. System Structure Overview
research have been reported so far, with the applications
to word spotting[15, 16, 17, 18], document similarity For explanation sake, we first briefly introduce the
measurement[19, 20, 21, 22], document indexing[23], system structure. Figure 1 illustrates the overall system
summarization[24], etc. Among them, one approach is framework and its workflow, for retrieving imaged
using particular codes to represent the characters in a documents in a digital library. First of all, some image
Client Internet Server
Display Imaged
Search Results
Results Document
Archive
(PDF Files)
Merge results based on “AND”

“OR” or “NOT” Operation
that the user has specified Image
Preprocessing
Matched? Word Object

Bounding
N
Search Convert Query Word Search Feature Coding Files

Input
Index Table to Code Representation Code Files
Query Words &
Logical Operation
(AND,OR, NOT)
Offline Operations
Index
Table
(Oracle Add to Matched?
Database) Index Table Y
Database
Figure 1. System diagram
processing is carried out on the document images embedded index table is used to store information of the words that
in the PDF files, including connected component detection, have previously been searched. Hence, if there are matches,
skew estimation and rectification(if applicable), word object information of the corresponding documents that contain
bounding, etc. Next, each word object is represented using this query word will be retrieved and stored in a temporary
the word image coding as described below. With respect table for subsequent merging. This includes information of
to each PDF file, a feature code file is then constructed, the documents’ URLs as well as the normalized occurrence
including the URL of the corresponding PDF file, and each frequency of the query word appearing in each of these
word’s information, such as its location, word image codes. documents. Otherwise, if no matches are found in the index
This feature code file is stored in a server, and will be table, we generate the feature code string of the query word,
used instead of the original document images for document and then exploit the word matching algorithm to perform
retrieval. The above processing is performed off-line prior search in the underlying feature code files.
to the user’s online query. With the purpose of constructing an incremental
On the client side, the user is prompted to input a set intelligence system and speeding up the search process,
of query words through a web interface, in particular, an the results of earlier query searches are stored in the
Active Server Page (ASP). Meanwhile, they can choose index table. If there are newly found matches, the index
to perform logical operation(“AND”,“OR”,“NOT”) among table will be updated accordingly, i.e. the corresponding
these query words. Once the request is submitted to the query word will be added to the index table with the
server, the server will start processing each query word and current document’s name with its URL and the normalized
merge the results at the end of each step based on the logical occurrence frequency that the query word appears in this
operation the user has chosen. Finally, a temporary table document.
that stores all the matching documents with their URLs and
the normalized occurrence frequencies of the query words 3. Word Image Coding
will be returned to the user for display. The user will be
able to link to the actual documents for online reading or A word object extracted from the document images
download them for future reference. is represented by the word image codes according to
As for the processing of each query word, it is done it features. The features employed in our approach are
as follows: First, the server tries to search for the query Left-to-Right Primitive String(LRPS) which is a code string
word in an index table stored in an Oracle database. This sequenced from the leftmost of a word to its rightmost.
The line feature and traversal feature will be used to extract ‘Q’: the primitive is between the top-boundary and the
primitives of the word image. bottom-boundary.
A word printed in documents can be in various sizes, The definition of x-line, baseline, top-boundary and
fonts, and spacings. When we extract features from word bottom-boundary may be found in Figure 3. A word bitmap
bitmaps, we have to take this fact into consideration. extracted from a document image already contains the
Generally speaking, it is easy to find a way to cope information of baseline and x-line, which is a by-product
with different sizes. But dealing with multiple fonts and of the text line extraction in the previous stage.
touching characters caused by condensed spacing, is still
a challenging issue. 3.2. Generating Line-or-Traversal Attribute
For a word image in a printed text, two characters
could be spaced apart by a few white columns caused by The generation of LTA is performed in two steps. We
intercharacter spacing in general, as shown in Figure 2(a). extract the straight stroke line feature from the word bitmap
But it is also common that one character overlaps with first, as shown in Figure 3(a). Note that only the vertical
another by a few columns caused by kerning, as shown stroke lines and diagonal stroke lines are extracted in this
in Figure 2(b). Worse still, as shown in Figure 2(c), two stage. Then the traversal features of the remainder part are
or more adjacent characters may touch each other due to computed. Finally, the features from the above two steps
condensed spacing. This poses a challenge to us to separate are aggregated to generate the LTAs of the corresponding
such touching characters. We will utilize inexact feature primitives. In other words, the LTA of a primitive is
string matching to handle this problem. represented by either a straight line feature(if applicable)
or a traversal feature.
3.2.1. Straight Stroke Line Feature A run-length based
fr(a)
fr (b)
fr(c)
method is utilized to extract the straight stroke lines from
word images. We use R(a, θ) to represent a directional run,
which is defined by a set of black concatenating pixels that
contains pixel a, along the specified direction θ. |R(a, θ)|
Figure 2. Different Spacing: (a)separated is the run length of R(a, θ), which is the number of black
adjacent characters, (b) overlapped adjacent points in the run.
characters, (c) touched adjacent characters The straight stroke line detection algorithm is
summarized as follows:
(1)Along the middle line(between the x-line and
base-line), detect the boundary pair [Al , Ar ] of each stroke,
where Al and Ar are the left and right boundary points
3.1. LRPS Feature Representation respectively.
(2) Detect the midpoint Am of a line segment Al Ar .
A word is explicitly segmented, from the leftmost to the (3) Calculate R(Am , θ) for different θ, from which we
rightmost, to discrete entities. Each entity, called a primitive select θmax as the As ’s run direction.
here, is represented using definite attributes. A primitive p (4) If |R(Am , θmax )| is near to or larger than the
will be described using a two-tuple (σ, ω), where σ is the x-height, the pixels containing Am , between the boundary
Line-or-Traversal Attribute(LTA) of the primitive and ω is points Al and Ar , along the direction θmax , are extracted as
the Ascender-and-Descender Attribute(ADA). As a result, a stroke line.
the word image is then expressed as a sequence P of pi ’s. As shown in Figure 3, the stroke lines are extracted
as in Figure 3(a), while the remainder is in Figure 3(b).
P =< p1 p2 . . . pn >=< (σ1 , ω1 )(σ2 , ω2 ) . . . (σn , ωn ) > According to the direction of a line, it is assigned to one
(1) of three basic stroke lines: vertical stroke line, left-down
where the ADA of a primitive ω ∈ Ω={‘x’,‘a’,‘A’,‘D’,‘Q’}, diagonal stroke and right-down diagonal stroke. With
which are defined as: respect to these types of stokes, three basic primitives
‘x’: the primitive is between the x-line and the baseline. are generated from the extracted stroke lines. Meanwhile,
‘a’: the primitive is between the top-boundary and the their ADAs can be assigned based on their top-end
x-line. and bottom-end positions. Their LTAs are respectively
‘A’: the primitive is between the top-boundary and the expressed as:
baseline. ‘l’: vertical straight stroke line, such as that in the
‘D’: the primitive is between the x-line and the characters ‘l’, ‘d’, ’p’, ‘q’, ‘D’, ‘P’, etc. For the primitive
bottom-boundary. whose ADA is ‘x’ or ‘D’, we will further check whether
processing is not carried out on the part represented by
stroke line features as described above. According to the
value of TN , different feature codes are assigned as follows.
‘&’: there is no image pixel in the column(TN = 0).
It corresponds to the blank intercharacter space. We treat
intercharacter space as a special primitive. In addition, the
overlap of adjacent characters caused by kerning is easily
detected by analyzing the relative positions of the adjacent
connected components. We can insert a space primitive
between them in this case.
If TN = 2, two parameters are utilized to assign it a
feature code. One is the ratio of its black pixel number to
x-height, κ. The other is its relative position with respect to
the x-line and the base line, ξ = Dm /Db , where Dm is the
distance from the topmost stroke pixel in the column to the
x-line and Db is the distance from the bottommost stroke
pixel to the baseline.
Figure 3. Primitive string extraction (a) ‘n’: κ < 0.2 and ξ < 0.3
straight stroke line features, (b) remainder ‘u’: κ < 0.2 and ξ > 3
part of (a), (c) traversal TN = 2, (d) traversal ‘c’: κ > 0.5 and 0.5 < ξ < 1.5
TN = 4, (e) traversal TN = 6 If TN ≥ 4, the feature code is assigned as:
‘o’: TN = 4
‘e’: TN = 6
there is a dot over the vertical stroke line. If yes, the LTA of ‘g’: TN = 8
the primitive is re-assigned as ‘i’. Then the same feature codes in the consecutive columns
‘v’: right-down diagonal straight stroke line, such as that are merged and represented by one primitive. Note that
in the characters ‘v’, ‘w’, ‘V’, ‘W’, etc. a few columns possibly have no resultant feature codes
‘w’: left-down diagonal straight stroke line, such as that because they cannot meet the requirements of all eligible
in the characters ‘v’, ‘w’, ‘z’, etc. For the primitive whose feature codes described above, which is usually caused by
ADA is ‘x’ or ‘A’, we will further check whether there are noise. Such columns are eliminated directly.
two horizontal stroke lines connected with it at the top and As illustrated in Figure 3, the word image is decomposed
bottom. If so, the LTA of the primitive is re-assigned as ‘z’. into one part with stroke lines(as in Figure 3(a)) and other
Additionally, it is easy to detect primitives containing parts with different traversal numbers(refer to Figure 3(c)
two or more straight stroke lines. They are: for TN = 2,(d) for TN = 4 and (e) for TN = 6).
‘x’: one left-down diagonal straight stroke line crosses The number of legal combination of primitive’s two
one right-down diagonal straight stroke line. properities, i.e. σ and ω, is limited. For conciseness
‘y’: one left-down diagonal straight stroke line meets one sake, each legal 2-tuple is replaced by one exact letter,
right-down diagonal straight stroke line at its middle point. as listed in Table 1. Then, the primitive code string
‘Y’: one left-down diagonal stroke line, one right-down is composed by concatenating the above generated
diagonal stroke line and one vertical stroke line cross in one primitives from the leftmost to the rightmost. The
point, like character ‘Y’. resultant primitive string of the word image in
Figure 3 will be <nmuomuomonomu&Odomn&ceo
‘k’: one left-down diagonal stroke line, one right-down
&oemuOd&ndoOdonomu&y>.
diagonal stroke line and one vertical stroke line meet in one
point, like character ‘k’ and ‘K’.
3.3. Postprocessing
3.2.2. Traversal Feature After the primitives based on
the stroke line features are extracted as described above, To achieve the ability of dealing with different fonts,
the primitives of the remainder part in the word image is the primitives should be independent of typefaces. Among
computed based on the traversal features. various fonts, a significant difference that has an impact on
To extract the traversal feature, we scan the word image the extraction of LRPS is serif, especially for that expressed
column by column, and the traversal number TN is recorded by traversal features. Figure 4 gives some examples in
by counting the number of transitions from black pixel which some have serif whereas some have not. It is a basic
to white pixel, or vice versa, along each column. This necessity to avoid the effect of serif in LRPS representation.
Primitive properties
(o,x)
Coding represention
o
health (Times Roman) health (Arial)
(e,x)
(l,x) m
e
health (Bookman) health (Helvetica)
(c,x) c
(n,x) n health (Courier)
KHDOWK (Tahoma)
(u,x) u
KHDOWK KHDOWK
(v,x) v
(w,x) w (Century) (Verdana)
(g,D) g
(i,A) i Figure 4. Different fonts
(i,Q) j
(k,x) k
(x,x) x For example, the primitive string token of character ‘b’ is
(y,D) y <doc>, and that of character ‘p’ is <qoc>. Table 2 lists the
(z,x) z primitive string tokens of all characters. The primitive string
(l,A) d of a word can be generated by synthesizing the primitive
(l,D) q string token of each characters in the word and inserting a
(u,a) T special primitive <&> among them to identify a spacing
(c,a) P gap.
(o,A) O Generally speaking, the resulting primitive string of
(e,A) E a real image is not as perfect as that synthesized from
(c,A) C the standard PST of corresponding characters, due to a
(v,A) V multitude of facts such as connection between adjacent
(w,A) W characters, noise effect, etc. As shown in Figure 3, the
(k,A) K primitive substring with respect to character ‘h’ is changed
(x,A) X from <dnm> to <dom> because of the effect of noise.
(Y,A) Y The touched characters ‘al’ and ‘th’ have also affected the
(z,A) Z corresponding primitive substrings. Inexact matching will
(e,Q) Q be employed in Section 5 to develop robust matching.
Table 1. Primitive properties vs. coding

representation
3.4. validity verification
Our observation shows that a primitive produced by We use an English dictionary including 25133
serif can be eliminated by analyzing its preceding and commonly used words to evaluate the validity of the
succeeding primitives. For instance, a primitive ‘u’ in a proposed word image coding scheme. Each word
primitive subsequence <du&> is normally generated by is represented by its corresponding word primitive
a right-side serif of characters such as ‘a’, ‘h’, ‘m’, ‘n’, token(WPT) generated by aggregating the characters’
‘u’, etc. Therefore, we can remove the primitive <u> from primitive string tokens described above, according to
a primitive subsequence <du&>. Similarly, a primitive the character sequence of the word, with the special
<o> in a primitive subsequence <nom> is normally primitive <&> inserted between two adjacent PSTs.
generated by a serif of characters such as ‘h’, ‘m’, ‘n’, For example, the WPT of the word “health” will be
etc. Therefore, we can directly eliminate the primitive <o> <dnm&ceo&oem&d&ndo&dnm>.
from a primitive subsequence <nom>. The investigation found that each word in the dictionary
More postprocessing rules can be used to eliminate has a unique coding representation which can be
the primitives caused by serif. With this postprocessing, distinguished from others, although there is ambiguity at
the primitive string of the word in Figure 3 will become character level, e.g. the PSTs of character ‘l’ and ‘I’ are
<mumuomnm&dom&ceo&oemd&ndodnm&y>. same. This concludes that there is no ambiguity for our
Based on the feature extraction described above, we can coding scheme, but at the cost of more computational
give each character a standard primitive string token(PST). burden compared to Sptiz’s coding scheme.
Ch PST Ch PST
[FileName]
a oem A WV b20169802.pdf
b doc B dEd [url]
c co C CO http://137.132.82.156/ASP/DataStore/p1206dpdf/b20169802.pdf
[NumberOfPage]
d cod D dOC
114
e ceo E dE [NumberOfWord]
f ndT F dOT 34591
g g G COEO <WordInfo>
……
h dnm H dnd [FeatureCode]
i i I d -mnmnm*coc*cod*mum*d*oem*mn-
j j J ud [x]
k k K K 1
[PageNo]
l d L du 12
m mnmnm M dVWd [Location]
n mnm N dVd 231,1058,256,1129
o coc O COC ……
[FeatureCode]
p qoc P dOP -ceo*d*ceo*mnmnm*ceo*mnm*ndo-
q coq Q COQC [x]
r mn R dOEO 1
s oeo S OEO [PageNo]
85
t ndo T TdT [Location]
u mum U dud 198,1422,236,1496
v vw V VW ……
w vwvw W VWVW </WordInfo>
x x X X
y y Y Y Figure 5. Code file for a PDF with imaged
z z Z Z documents
Table 2. Primitive string tokens of characters
are also available for the subsequent processing. Extracted

4. Coding File Generation word bitmaps with baseline and x-line information are the
basic units for the downstream process of word matching,
With respect to each imaged document file, a and are represented using primitive strings, as described in
corresponding feature code file is generated off-line prior to earlier section.
the online searching procedure. The documents used in our
It is worth mentioning that, the ascender/descender
system are scanned from students’ theses and are packed
information of words is very helpful for the word matching.
in PDF files. Each PDF file has over 100 images in page
In particular, it can be utilized in coarse matching to speed
format. We need to extract each of these page images first
up the processing. Here we call it x-feature, which is defined
for further processing.
as:
Given a page image in a document, a connected
component analysis algorithm is first applied to detect all 0: There is neither ascender nor descender in the word;
the connected components. Those with too small areas are 1: There is ascender but no descender in the word;
considered as noise and thereby removed. Furthermore, 2: There is no ascender but descender in the word;
those with large areas are probably graphics or tables and
are therefore eliminated as well. It is presumed that the 3: There are both ascender and descender in the word.
processes such as skew estimation[27] and correction if As a result, with respect to the PDF document file, a
applicable, and other image-quality related processing, are corresponding feature code file containing the information
performed. of all words’ primitive codes and locations, x-features, as
Next, word objects are bounded based on a merger well as the URL of the PDF file, is generated and stored in
operation on the connected components. The left, top, right, the server. Figure 5 gives an example of a feature code file
and bottom coordinates of each word bitmap are obtained. recording the information of a PDF file with 114 pages of
Meanwhile the baseline and x-line locations in each word images.
5. Word Matching In a word image, it is common for two or more
adjacent characters to be connected with each other,
5.1. Coarse Matching which is possibly caused by low scanning resolution or
poor printing quality. This will result in the deletion
As the name indicates, coarse match is essentially a of the feature <&> in the corresponding feature string,
confinement step, which serves the purpose of restricting compared to the primitive string generated from the
the number of words to be compared in the feature code standard PSTs. Moreover, noise effect will undoubtedly
files within a narrow limit. This will effectively speed up the produce substitution or insertion of features in the primitive
word matching algorithm. The main criterion used in coarse string of the word image. The deletion, insertion and
matching is word’s x-feature and the number of primitive substitution are very similar to the course of evolutionary
codes. mutations of DNA sequences in molecular biology[28].
At the first step, the x-features are applied. Considering Lopresti and Zhou applied the inexact string matching
that a query word is possibly a portion of a word in the strategy to information retrieval[29] and duplicate
imaged documents, e.g. the word “unhealthy” should be document detection[30] for dealing with imprecise
matched if the query word is “health”, the following criteria text data of OCR results. Drawing inspiration from the
are used: (1)if the x-feature of query word is 0, all words alignment problem of two DNA sequences and the progress
in the documents are matched; (2)if the x-feature of the achieved by Lopresti and zhou, we apply the technique of
query word is 1, only the words whose x-features are either inexact string matching to evaluate the similarity between
1 or 3 are matched; (3)if the x-feature of the query word two primitve strings, one from the template image and the
is 2, only the words whose x-features are either 2 or 3 are other from the word image extracted from the document.
matched; and (4)if the x-feature of the query word is 3, From the standpoint of string alignment[31], two differing
only the words whose x-features are 3 are matched. For features that mismatch correspond to a substitution; a space
example, if the query word is “health”, any word objects contained in the first string corresponds to an insertion
without ascenders(x-feature equals 0 or 2) will be ruled out. of the extra feature into the second string; and a space in
Next, the number in the primitive codes is applied. the second string corresponds to a deletion of the extra
Suppose the number of the query word’s primitive codes feature from the first string. The insertion and deletion are
is NQ , and that of a word object’s codes is NW . The ratio the reverse of each other. A dash(‘-’) is therefore used to
of NW to NQ is considered. If NW /NQ is less than a represent a space primitive inserted into the corresponding
threshold, say 0.8 in our experiments , the word object will positions in the strings for the situation of deletion. Notice
be eliminated for further matching. that we use “spacing” to represent the gap in the word
Last but not least, the final step is to apply inexact codes image, whereas we use “space” to represent the inserted
string matching algorithm described below, to compare the feature in the feature string. Their concepts are completely
feature code of the query word and the feature codes of different.
word objects extracted from the original document images. For a primitive string A of length n and a primitive string
B of length m, V (i, j) is defined to be the similarity value
of the prefixes < a1 , a2 , . . . , ai > and < b1 , b2 , . . . , bj >.
5.2. Inexact Codes String Matching The similarity of A and B is precisely the value V (n, m).
The similarity of two strings A and B can be computed
An inexact string matching method is applied to measure by a dynamic programming with recurrences[32]. The base
the similarity between two primitive coding strings. The conditions are :
word matching problem can be stated as finding a ½
particular sequence/sub-sequence in the primitive string V (i, 0) = 0
∀i, j : (2)
of a word. The procedure of matching word image has V (0, j) = 0
become a measurement of the similarity between the The general recurrence relation is:
string A=< a1 , a2 , . . . , an > representing the features for 1 ≤ i ≤ n, 1 ≤ j ≤ m:
of query word and the string B=< b1 , b2 , . . . , bm > 

 0
representing the features of a word image extracted 
V (i − 1, j − 1) + ²(ai , bj )
from the document. Matching partial words becomes V (i, j) = max (3)

 V (i − 1, j) + µ(ai , −)
evaluating the similarity between the feature string A with 
V (i, j − 1) + ν(−, bj )
a sub-sequence of the feature string B. For example, the
problem of matching the word “health” with the word The zero in the above recurrence implements the
“unhealthy” in Figure 3 is to find whether there exists the operation of restarting the recurrence, which can make
subsequence A=<dnm&ceo&oem&d&ndo&dnm> in the sure that the unmatched prefixes are discarded from the
primitive sequence of word “unhealthy”. computation.
Finally, the maximum scoring is normalized to the found that, the score in the best matching track with respect
interval [0,1], with 1 corresponding to a perfect match: to the spacing primitive <&> in the query word’s primitive
string decreases if its corresponding spacing primitive in
S1 = max V (i, j)/VA∗ (n, n) (4) the real word is missing. To remedy this problem, the
∀i,j
normalized maximum scoring in Equation 4 is modified as:
where VA∗ (n, n) is the matching score between the string
A and itself. The maximum operation in Equation 4 and Sm = (max V (i, j) + τ (Ng ))/VA∗ (n, n) (6)
∀i,j
the restarting recurrence operation in Equation 2 ensure the
ability of partial word matching. where Ng is the number of such mismatched spacing
If the normalized maximum score in Equation 4 is primitives, and τ (Ng ) = 2 × Ng in our experiments.
greater than a predefined threshold δ, then we recognize
that one word image is matched with the other one(or its 6. Experimental Results
portion).
On the other hand, the similarity of matching two whole 6.1. System Implementation
word images in their entirety(i.e. no partial matching is
allowed) can be represented by: An experimental platform has been implemented for
retrieving imaged documents in digital libraries using the
proposed word image coding scheme. In general, this
S2 = V (n, m)/min(VA∗ (n, n), VB∗ (m, m)) (5) system has three main components in hardware: an ASP
web interface, an Oracle database and a win2k server, as
The problem can be evaluated systematically using shown in Figure 6.
a tabular computation. In this approach, a bottom-up Its web interface is developed in Active Server Pages
approach is used to compute V (i, j). We first compute (ASP) hosted on Microsoft Internet Information Server
V (i, j) for the smallest possible values for i and j, and (IIS) 5.0. Operations including the matching of word codes
then compute the value of V (i, j) for increasing values are implemented as COM Component using Visual C++.
of i and j. This computation is organized with a dynamic The index table containing information about previously
programming table of size (n + 1) × (m + 1). The table queried words is stored on an Oracle database server, while
holds the values of V (i, j) for the choices of i and j(see the document images and their underlying feature code files
Table 3). The values in row zero and column zero are filled are stored on a sever with Windows 2000. 478 document
in directly from the base conditions for V (i, j). After that, image files provided by the digital library of our university
the remaining n × m sub-table is filled in one row at a time, are included in the test. These document images were
in the order of increasing i. Within each row, the cells are scanned from the past students’ theses and packed in PDF
filled in the order of increasing j. files. Each of them contains about 100 to 200 pages.
The entire dynamic programming table for computing The ASP web interface is where the user inputs his/her
the similarity between a string of length n and a string of queries and gets a list of retrieved documents as the output
length m, can be obtained in O(nm) time, since only three ranked according to the occurrence frequency of the query
comparisons and arithmetic operations are needed per cell. words in each of these documents. The system supports
Table 3 illustrates the scoring table computing from the AND/OR/NOT operations over the set of input query
primitive string of the word image “unhealthy” extracted words. In addition, the user can link to the actual document
from a real document image and the word “health” whose
primitive string is generated by aggregating the characters’
primitive string tokens according to the character sequence ASP Interface
of the word, with the special primitive ‘&’ inserted between
two adjacent PSTs. It can be seen that the maximum
Query words
scoring achieved in the table corresponds to the matching
of character sequence “health” in the word “unhealthy”. Original
Results Document Images
The match may be imprecise in the sense that certain
primitives are missing. Some adjacent characters in an
image word may touch each other due to various reasons.
This results in a lesser number of character spacing to be Oracle Database Win2k server
detected, and hence a lesser number of spacing primitives (Index table) (Feature Code Files)
<&> in the primitive string. Let’s check again Table 3.
Figure 6. System components
We can find a best matching track if we backtrack from the
maximum score, as shown in Table 3. Our investigation has
un h e al th y
m u m u o m n m & d o m & c e o & o e m d & n d o d n m & y
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 2 1 0 2 1 2 1 0 0 0
n 0 0 0 0 0 0 0 2 1 0 1 2 1 0 0 0 0 0 0 0 0 1 1 3 2 2 1 4 3 2 1
h
d 0 2 1 2 1 0 2 1 4 3 2 1 4 3 2 1 0 0 0 0 2 1 0 2 3 2 2 3 6 5 4
& 0 1 1 1 1 0 1 1 3 6 5 4 3 6 5 4 3 2 1 0 1 1 3 1 2 2 1 2 5 8 7
c 0 0 1 1 1 1 0 1 1 5 4 5 4 5 8 7 6 5 4 3 2 1 2 3 2 2 1 1 4 7 6
e
e 0 0 0 1 1 1 1 0 1 4 3 4 5 4 7 10 9 8 7 6 5 4 3 2 1 2 1 1 3 6 5
o 0 0 0 0 1 3 2 1 0 3 2 5 4 4 6 9 12 11 10 9 8 7 6 5 4 3 2 1 2 5 4
& 0 0 0 0 0 2 2 1 0 2 2 4 4 6 5 8 11 14 13 12 11 10 9 8 7 6 5 4 3 4 4
o 0 0 0 0 0 2 2 2 1 1 1 4 4 5 6 7 10 13 16 15 14 13 12 11 10 9 8 7 6 5 4
e 0 0 0 0 0 1 2 2 2 1 0 3 4 4 5 8 9 12 15 18 17 16 15 14 13 12 11 10 9 8 7
a
m 0 2 1 2 1 0 3 2 4 3 2 2 5 4 4 7 8 11 14 17 20 19 18 17 16 15 14 13 12 11 10
& 0 1 1 1 1 0 2 2 3 6 5 4 4 7 6 6 7 10 15 16 19 19 21 20 19 18 17 16 15 14 13
d 0 0 0 1 0 0 1 1 2 5 8 7 6 6 5 5 6 9 14 15 18 21 20 19 22 21 20 19 18 17 16
l
& 0 0 0 0 0 0 0 0 1 4 7 7 6 8 7 6 5 8 13 14 17 20 23 22 21 20 20 19 18 20 19
n 0 0 0 0 0 0 0 2 1 3 6 7 7 7 8 7 6 7 12 13 16 19 22 25 24 23 22 22 21 20 19
t
d 0 0 0 0 0 0 0 1 2 2 5 6 7 6 7 6 5 6 11 12 15 18 21 24 27 26 25 24 23 22 21
o 0 0 0 0 0 2 1 0 1 1 4 7 6 6 6 7 8 7 10 11 14 17 20 23 26 29 28 27 26 25 24
& 0 0 0 0 0 1 1 0 0 3 3 6 6 8 7 6 7 7 9 10 13 16 19 22 25 28 28 27 26 28 27
d 0 0 0 0 0 0 1 0 0 2 5 5 6 7 6 5 6 6 8 9 12 15 18 21 24 27 30 29 28 27 26
n 0 0 0 0 0 0 0 3 2 1 4 4 5 6 7 6 5 5 7 8 11 14 17 20 23 26 29 32 31 30 29
h
m 0 2 1 2 1 0 2 1 2 1 3 3 6 5 6 7 6 5 6 7 10 13 16 19 22 25 28 31 34 33 32
Table 3. Scoring table and missing gap recovery
image through the available hyperlinks over the document’s word has not been queried before. In this case, we need to
name for online reading or download it for future reference. convert the query word to a feature code string and perform
The time used to process each input query is also recorded the word codes matching operation among the underlying
and shown on the top of the final query result for user’s feature code files stored on the win2k server.
information. Intermediate processes such as searching index
table and updating index table will be shown to notify the
6.2. AND/OR/NOT Operation
user if these operations are carried out for certain set of
query input. Figure 7 demonstrates an example of the user’s
The system supports AND/OR/NOT operations over a
query result.
set of query words. Users are prompted through a web
The Oracle database is used to store an index table that interface to input a set of query words separated by
contains corresponding information of all the words that an empty space and then choose to perform AND/OR
have been queried before. The information available there operation on them, then followed by a set of query words
for each word are the URLs of the matching document that are not to be included in the resulting documents. The
images and the occurrence frequency of this word in each NOT operation is performed after the AND/OR operations,
of the corresponding documents. Moreover, the Oracle which removes those documents that contain those words
database is also used to store a temporary result table that specified in the NOT query input box.
records all the retrieved documents during the processing
for each particular query word. This result table is merged 6.2.1. AND Operation Generally speaking, if the AND
with each subsequent result set and finally ends up to operation is chosen, the system will do as follows: It starts
be a list of documents that match the whole input query from the first word, processes it to obtain a result table
expression. The final result is then retrieved directly from and stores the table temporarily in the Oracle database.
this result table and returned to the user for display. It then joins this table with the resulting record set of
Finally, the win2k server is used to store a list of each subsequent round to obtain a new result table. In
feature code files generated from the original document this manner, at the end of each round, the result table
images. The feature code file generation step is done will store information of the documents that contain all
through some off-line operations with noise removal and the query words up to now and the multiplication of their
skew rectification preprocessing procedures. These files corresponding normalized frequencies appearing in this
are stored as a database and to be used during the word document. In the end, only those documents that contain
primitive codes matching step, which happens when the all the specified query words will be left in the result table
query word is not found in the index table, that is, the query for merging with the subsequent result of NOT operation.
Figure 7. Search Result
However, in the case where no single document contains

all the query words, the search will stop right at the round Documents Occurrence Documents Occurrence
1.pdf 3.45% 2.pdf 4.67%
when either the result table or the current record set is
2.pdf 2.56% 3.pdf 3.12%
empty. Hence, no time-wasting search will be performed
3.pdf 0.12% 4.pdf 1.22%
for the subsequent words. The user will be notified that no
Result table after round 1 (“approach”) Records found for “analysis”
document is found that matches the query input expression.
Join on “Documents”
To be more specific, let’s consider the following query
example, say, an AND operation on “approach analysis Documents Occurrence
assembly technique”. Suppose there are five underlying 2.pdf 0.12%
Document Occurrence
s
5.pdf 6.48%
documents in total that we will perform our search on, 3.pdf 0.0037%
namely 1.pdf, 2.pdf, 3.pdf, 4.pdf and 5.pdf. If “approach” is Result table after round 2
Records found for “assembly”
contained in 1, 2 and 3; “analysis” is contained in 2, 3 and

Join on “Documents”
4; ”assembly” is contained in 5; “technique” is contained in
3; then Figure 8 shows the result table at the end of each Empty Table (search stops)
round and also the merging process.
Figure 8. AND Operation
As we can see from the figure, both “approach” and
“analysis” are contained in 2.pdf and 3.pdf. Therefore,
after round 2, the result table will only have two entries.
Moreover, the corresponding normalized frequency is the join them with the result table after round 2. Since the join
multiplication of the normalized frequencies that each of operation is performed on “Documents” field, there is no
these two words appears in this document. Subsequently, common file that contains all the first three words. So after
we obtain the set of documents that contain “assembly” and round 3, the result table is empty and the search will stop
here without further search on the last word “technique”. return them back to the user for display. Figure 10 shows an
Finally, the user will be informed that no documents are example of the NOT operation.
found for the current input query.
6.2.2. OR Operation Similarly, for the OR operation, the Documents Occurrence Documents Occurrence
procedures are the same except for the merging step. In 1.pdf 3.45% 2.pdf 4.67%
this case, we will do a union instead of join. That is, 2.pdf 2.56% 3.pdf 3.12%
the new result table will contain all the documents that 3.pdf 0.12% 4.pdf 1.22%
appear either in the previous result table or in the current
Result table after round 1 (“approach”) Records found for “analysis”
record set. Moreover, the normalized frequency will be the
Remove 2.pdf, 3.pdf from
summation of the respective normalized frequencies if two result table
words both appear in the same document. Figure 9 shows
the search process for the same example above under the Documents Occurrence Documents Occurrence
“OR” operation. 1.pdf 3.45% 5.pdf 3.48%
Result table after round 2 Records found for “assembly”
Documents Occurrence Documents Occurrence Nothing is removed from

1.pdf 3.45% 2.pdf 4.67%
result table
2.pdf 2.56% 3.pdf 3.12%

Documents Occurrence
3.pdf 0.12% 4.pdf 1.22%
Result table after round 1 (“approach”) Records found for “analysis” 1.pdf 3.45%
Union on “Documents”
Figure 10. NOT Operation

1.pdf 3.45% Documents Occurrence
2.pdf 7.23% 5.pdf 6.48%
3.pdf 3.24% Records found for “assembly”
4.pdf 1.22%
Result table after round 2 Union on “Documents”

6.3. Speed
Documents Occurrence For the efficiency measure, the elapsed time since
1.pdf 3.45%
the user specifies the query words until he/she gets the
2.pdf 7.23%
3.pdf 3.52% displayed results has been tested and shown on the top of
3.pdf 3.24%
4.pdf 1.22%
Records found for “technique” each query result. There are two scenarios that we need to
5.pdf 3.48% consider. One is when all the input query words have been
Result table after round 3 Union on “Documents” queried before, so the word information is already stored
in the index table. In this case, the time to process these
1.pdf 3.45% words is merely to retrieve several entries from the Oracle
2.pdf 7.23% database and therefore trivial (usually less than 0.1 second
3.pdf 6.76% for each word). An example of this scenario is shown in
4.pdf 1.22%
Figure 7. Here, an OR operation is performed between
5.pdf 3.48%
“simulation algorithm”, followed by the NOT operation on
Result table after round 4
“computational approach”. The final result gives all the
Figure 9. OR Operation documents that contain either “simulation” or “algorithm””,
but do not contain “computational” and “approach”. Since
all the words specified here are queried before, the time
needed is only to search the index table and combine the
6.2.3. NOT Operation The NOT operation is carried out results. In this case, it is 0.34375 second.
after the AND/OR operations. In particular, after AND/OR The other scenario is when some of the query words do
operations, we will get a list of matching documents stored not exist in the index table. In this case, we need to perform
in a temporary result table. If the result table is not empty feature code matching in those underlying feature code
and there are some more query words in the NOT input files for each of them. If there are newly found matches,
box, we will go on processing each of these words and the system will update the index table and keep the user
remove those common documents that contain these words informed as well. This is usually time consuming since for
not to be included. Finally, the result table will contain all each word we have to search every underlying file in order
those documents that satisfy the user’s input query. We can to find the matches. The result shows that the time needed
then directly retrieve those corresponding information and for this scenario is relatively longer. An example of this
Figure 11. Another example
scenario is shown in Figure 11, where an AND operation been implemented, and the preliminary test results with the
is first performed between “computational approach”, then imaged documents of students’ theses in our digital library
followed by the NOT operation on “Poincare”. Since show that the proposed system provides an efficient and
“Poincare” was not queried before, we need to perform promising tool for document image retrieval.
partial word matching among the feature code files. As a Future work will focus on improving the experimental
result, one document is found containing “Poincare” and system to achieve a practical usage. The setting of matching
inserted to the index table as it is shown in the interface. In scoring between different primitive codes is left to future
this case, the time needed is longer than retrieving directly research. How to integrate linguistical knowledge to the
from the index table, which is 22.375 seconds. present system to improve the retrieval performance will
Therefore, the advantage of our system is that it is need further investigation in future. Especially, all of the
an incremental intelligence system, i.e., as more users word objects are included in the feature coding file in the
come to use this system, only searching the index table present system, which obviously slows down the processing
will probably fulfill the user’s request, hence achieve an speed. If some meaningful keywords are extracted from
impressive efficiency. them, and are used to generate the feature coding file, not
only the processing will speed up greatly, but also retrieval
performance benefits from this.
7. Conclusions and Future Work
Information retrieval in imaged documents is in urgent Acknowledgements

need for digital libraries. In this paper, a word image coding
technique is proposed for designing an information retrieval This research is jointly supported by the Agency
system with ability of dealing with imaged documents for Science, Technology and Research, and Ministry
stored in digital libraries. An experimental platform has of Education of Singapore under research grant
R-252-000-071-112/303. [16] J. DeCurtins, and E. Chen, Keyword spotting via word shape
recognition, , Proceeding of SPIE, Document Recognition
II(Editor: Luc M. Vincent and Henry S. Baird), vol.2422, San
References Jose, California, pp. 270-277, 1995.
[17] S. Kuo and O. F. Agazzi, Keyword spotting in poorly printed
[1] M. Worring, A. W. M. Smeulders, Content based internet documents using pseudo 2-D hidden Markov models, IEEE
access to paper documents, Int’l Journal on Document Trans. Pattern Analysis and Machine Intelligence, vol. 16, no.
Analysis and Recognition, vol.1, pp. 209-220, 1999. 8, pp. 842-848, 1994.
[2] M. Y. Jaisimha, A. Bruce, T. Nguyen, DocBrowse: a [18] A. L. Spitz, Using character shape codes for word
system for information retrieval from document image data, spotting in document images, Shape and Structure in
Proceedings of the SPIE, vol.2670, pp.350-361, 1996. Pattern Recognition, D. Dori and A. Bruckstein(eds.), World
[3] E. Appiani, F. Cesarini, A. M. Colla, et al., Automatic Scientific, pp. 382-399, 1995.
document classification and indexing in high-volumn [19] C. L. Tan, W. Huang, Z. Yu, Y. Xu, Imaged document text
applications, Internatoinal Journal on Document Analysis retrieval without OCR, IEEE Trans. Pattern Analysis and
and Recognition, vol.4, pp. 69-83, 2001. Machine Intelligence, vol. 24, no. 6, pp. 838-844, 2002.
[4] G. Salton, J. Allan, C. Buckley, and A. Singhal, Automatic [20] Y. Lu, C. L. Tan, Document Retrieval from Compressed
analysis, theme generation, and summarization of Images, Pattern Recognition, vol. 36, no. 4, pp. 987-996,
machine-readable text”, Science, vol. 264, pp. 1421-1426, 2003.
1994. [21] A. L. Spitz, Duplicate document detection, Proceedings of
[5] E. A. Galloway, and V. M. Gabrielle, The Heinz electronic SPIE, Document Recognition IV(L.M Vincent and J. J Hull
library interactive on-line system: An update, The edit), vol. 3027, San Jose, CA, USA, pp. 88-94, 1997.
Public-Access Computer Systems Review vol.9, no. 1, [22] A. F. Smeaton, A. L. Spitz, Using character shape coding for
1998. information retrieval, Proc. of the Fourth Int’l Conf. Document
[6] http://wint.decsy.ru/du/EXCALIB/INDEX.HTM Analysis and Recognition, pp. 974-978, 1997.
[7] http://www.caere.com/ [23] Y. He, Z. Jiang, B. Liu, H. Zhao, Content-based indexing
and retrieval method of Chinese document images, Proc.
[8] K. Tagvam, J. Borsack, A. Condir, and S. Erva, The effects of
of the Fifth Intl Conf. Document Analysis and Recognition,
noisy data on text retrieval, Journal of the American Society
Bangalore, India, pp.685-688, 1999.
for Information Science, vol. 45, no. 1, pp. 50-58, 1994.
[24] F. R. Chen, D. S. Bloomberg, Summarization of imaged
[9] Y. Ishitani, Model-based information extraction method
documents without OCR, Computer Vision and Image
tolerant of OCR errors for document images, Proc. of the Sixth
Understanding, vol. 70, no. 3, pp. 307-319, 1998.
Int’l Conf. Document Analysis and Recognition, pp. 908-915,
[25] A. L. Spitz, Shape-based word recognition, Int’l Journal
2001.
on Document Analysis and Recognition, vol. 1, no. 4, pp.
[10] M. Ohta, A. Takasu, and J. Adachi, Retrieval methods 178-190, 1999.
for English text with misrecognized OCR characters. Proc.
[26] A. L. Spitz, Progress in document reconstruction, Proc. of
of 4th International Conference on Document Analysis and
16th Int’l Conference on Pattern Recognition, vol. 1, pp.
Recognition, pp.950-956, 1997.
464-467, 2002.
[11] T. Kameshiro, T. Hirano, Y. Okada, F. Yoda, A [27] Y. Lu, C. L. Tan, A nearest-neighbor-chain based approach
document image retrieval method tolerating recognition to skew estimation in document images, Pattern Recognition
and segmentation errors of OCR using shape-feature and Letters, vol. 24, pp. 2315-2323, 2003.
multiple candidates, Proc. of 5th International Conference on
[28] A. Apostolico, and R. Giancarlo, Sequence alignment in
Document Analysis and Recognition, pp.681684, 1999.
molecular biology, DIMACS Series in Discrete Mathematics
[12] K. Katsuyama, H. Takebe, K. Kurokawa, et al., Highly and Theoretical Computer Sciences, vol. 47, pp. 85-115,
accurate retrieval of Japanese document images through a 1999.
combination of morphological analysis and OCR , Proc. SPIE, [29] D. Lopresti, J. Zhou, Retrieval strategies for noisy text,
Document Recognition and Retrieval, vol. 4670, pp. 57-67, Proceedings of the Fifth Annual Symposium on Document
2002. Analysis and Information Retrieval, Las Vegas, NV, pp.
[13] D. Doermann, The indexing and retrieval of document 255-269, 1996.
images: A survey, Computer Vision and Image [30] D. P. Lopresti , A comparison of text-based methods
Understanding, vol.70, no.3,pp.287-298, 1998. for detecting duplication in scanned document databases,
[14] M. Mitra, and B. B. Chaudhuri, Information retrieval from Information Retrieval, vol. 4, no. 2, pp. 153-173, 2001.
documents: A survery, Information Retrieval, vol. 2, nos. 2/3, [31] D. Gusfield, Algorithms on strings, trees, and sequences,
pp. 141-163, 2000. Combridge University Press, 1997.
[15] Y. Lu, C. L. Tan, Word searching in document images using [32] R. A. Wagner, and M. J. Fisher, The string-to-string
word portion matching, D. Lopresti, J. Hu, and R. Kashi(Eds.) correction problem, Journal of the Association for Computing
Document Analysis Systems V, Lecture Notes on CS, Vol. Machinery, vol. 21, pp. 168-173, 1974.
2423, pp.319-328, 2002.

Digital Library

Uploaded by

Copyright:

Available Formats

You might also like

Digital Library

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Digital Library

Uploaded by

Copyright:

Available Formats

Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding

Yue Lu, Li Zhang, Chew Lim Tan

Abstract search engines over text documents. Meanwhile, with the

Merge results based on “AND”

Matched? Word Object

Search Convert Query Word Search Feature Coding Files

Figure 1. System diagram

Table 1. Primitive properties vs. coding

Table 2. Primitive string tokens of characters

are also available for the subsequent processing. Extracted

Table 3. Scoring table and missing gap recovery

However, in the case where no single document contains

contained in 1, 2 and 3; “analysis” is contained in 2, 3 and

“OR” operation. 1.pdf 3.45% 5.pdf 3.48%

Result table after round 2 Records found for “assembly”

Documents Occurrence Documents Occurrence Nothing is removed from

2.pdf 2.56% 3.pdf 3.12%

Figure 10. NOT Operation

Result table after round 2 Union on “Documents”

Information retrieval in imaged documents is in urgent Acknowledgements

You might also like