Professional Documents
Culture Documents
IJDAR Bagdanov Rectangular
IJDAR Bagdanov Rectangular
– typographical genre, which determines the charac- lyzed in order to characterize visual similarity between
teristics of the various font styles, sizes, and forms of document genres.
emphasis applied to different information components As mentioned above, the intrinsic multiscale nature
– textual genre, which consists of the rules of expres- of documents lends itself well to multiscale analysis tech-
sion, terminology, and rhetorical devices used to ex- niques. Morphological scale-spaces possess conceptual
press linguistic content and practical advantages that make them particularly
suitable in the document image domain. In this section,
Written communication is highly structured, and doc- we first introduce some necessary concepts and terminol-
ument typesetting systems exploit this in organizing log- ogy from mathematical morphology and then describe
ical content into a physical realization of that content the specific extensions we use to measure visual struc-
in geometric layout structures. Document understanding ture in document images.
systems likewise exploit this structure in decomposing We are primarily concerned with scanned, binary doc-
document images into typographically homogeneous re- ument images, and all of our morphological operations
gions before attempting high-level understanding. Just will be defined on subsets of the Euclidean plane, or con-
as genre plays a key role in mediating author/reader stant Euclidean images in morphological parlance. All of
communication, it can play a similar role in document our notation and conventions follow those of Serra [10].
understanding systems. Figure 3 illustrates the concep- The basic operations in mathematical morphology are
tual pipeline of processing steps in a document under- the erosion and dilation. An erosion of image S by struc-
standing system and indicates the genre characterization turing element B is defined in terms of Minkowski sub-
stages inserted along the processing flow. Immediately traction:
after scanning, the visual components of a document’s
genre can be analyzed. After layout analysis, the typo- eB (S) = S−y = S B̌
graphical constructs can then be characterized. Lastly, y∈B
the textual components of genre can be extracted from
the text extracted by an OCR system. All of these compo- where S−y denotes the translate of S by −y, and B̌ de-
nents of genre are finally indexed in a document retrieval notes the reflection of B about the origin. All of the ero-
system. sions we will discuss use symmetric structuring elements,
Note the bidirectional communication between the and we will therefore denote erosion simply as S B.
logical analysis stage and the retrieval system. Logical Dilation is similarly defined in terms of Minkowski addi-
analysis algorithms may utilize genre characterization by tion:
exploiting genre-level similarity with known documents
already indexed in the system. As shown in Fig. 3, this dB (S) = Sy = S ⊕ B
work focuses on the characterization of visual compo- y∈B
nents of document genre. In the following sections, we
detail our approach to characterizing the visual appear- Note that erosion and dilation are dual with respect to
ance of documents. complementation, i.e., S c ⊕ B = (S B)c .
Combinations of erosions and dilations can be con-
structed that perform more elaborate transformations of
3 Granulometries images. The most basic of these are the opening and clos-
ing operations. The opening of an image by structuring
element B is:
The visual appearance of a document is wholly deter-
mined by the foreground and background pixels in the S ◦ B = (S B) ⊕ B
document image. While complete, this representation is
cumbersome due to the enormous semantic gap between and the closing
the sensor space, i.e., the field of pixels acquired by the
document scanner, and the conceptual space in which S • B = (S ⊕ B) B
documents are interpreted. Documents are intrinsically
multiscale, and their multiscale nature is evident in the One of the most useful tools in mathematical morphol-
scales distinguishing characters from words, words from ogy is the granulometry, which is constructed through
textlines, textlines from paragraphs, etc. A multiscale ap- sequential opening of an image.
proach is consequently an obvious choice for a genre char- Formally, a granulometry on P(R × R), where P(X)
acterization technique. Out approach is based on mul- is the power set of X and R is the set of real numbers, is
tiscale decompositions of the background of document a family of operators:
images. We can imagine a document image being de-
composed into a collection of maximal rectangles that Ψt : P(R × R) −→ P(R × R)
“fit” into the background of the image, as shown in
satisfying for any S ∈ P(R × R)
Fig. 1. Such decompositions supply information about
the information-bearing portions of a document or class A1: Ψt (S) ⊂ S for all t > 0 (Ψt is antiextensive)
of documents. In this and the following section, we show A2: For S ⊂ S , Ψt (S) ⊂ Ψt (S ) (Ψt is increasing)
how such decompositions may be constructed and ana- A3: Ψt ◦ Ψt = Ψt ◦ Ψt = Ψmax(t,t ) for all t, t > 0
A.D. Bagdanov, M. Worring: Multiscale document description using rectangular granulometries
Ψt (S) = S ◦ tB
4 Document representation
1
Φ(S, r)
Φ‘(S, r)
0.9
0.5
visual appearance of document images, but rather the
measurements taken on the filtered images Ψx,y (S).
0.4
0.3
0.2
4.1 Rectangular size distributions
0.1
The size distributions and pattern spectra introduced by
0 Maragos [8] have been subsequently extended to multi-
5 10 15 20
r
25 30 35 40 45
variate granulometries [2]. The rectangular size distribu-
a b tion induced by the granulometry G = {Ψx,y } on image
S is:
Fig. 5a,b. Ambiguity in size distributions. All of the images
in a generate the size distribution shown in b. In many cases, A(S) − A(Ψx,y (S)))
univariate size distributions are incapable of capturing all of ΦG (x, y, S) =
A(S)
the degrees of freedom associated with grain sizing and orien-
tation A(X) denoting the area of set X. ΦG (x, y, S) is also a cu-
mulative probability distribution, i.e., ΦG (x, y, S) is the
probability that an arbitrary pixel in S is opened by a
rectangle of size x × y or smaller.
As mentioned in the introduction, documents with re-
gions containing reverse video text, i.e., white text on a
black background, are not thoroughly captured by the
openings Ψx,y . To account for this, we extend the rectan-
(100, 100) gular size distributions downward to include openings of
the foreground. The definition becomes:
A(S)−A(Ψ (S)))
Height
(3, 40)
(70, 40)
A(S)
x,y
if x, y ≥ 0
ΦG (x, y, S) = A(S c )−A(Ψ c
x,y (S )))
(2)
A(S c ) if x, y < 0
1 1
0.8 0.8
0.6 0.6
Φ(x,y)
Φ(x,y)
0.4 0.4
0.2 0.2
0 0
80 80
60 50 60 50
40 40
40 40
30 30
20 20 20 20
10 10
0 0 0 0
Height Height
Width Width
Fig. 7. Example rectangular size distributions for two documents from different genres in our test database. Note the prominent
flat plateau regions indicating regions of stability in the granulometry. These most likely correspond to typographical parameters
such as margin width, interline distance, etc. The size distribution on the left is constructed from the document shown in Fig. 1
and the one on the right from the document used to construct the example openings in Fig. 6
S Dv Dv (S) S yV
fv [x, y] Dv (S)>y
⊗ = −→
gv [x, y]
❄
S yV Dh Dh (S yV ) (S yV ) xH
Fig. 8. Efficient computation of an arbitrary rectangular opening. Distance transforms are used to effectively encode all possible
vertical and horizontal erosions. By thresholding these distance images we can obtain each desired erosion. The ⊗ operator is used
above to indicate the application of the recursive filters described in Eqs. 4 and 5. The first part of the opening, (S yV ) xH,
is illustrated above. The opening is completed by performing the same steps on ((S yV ) xH)c
rectangular granulometries and size distributions in order Next, we can eliminate the need to erode and dilate
to make their computation more tractable. the image by structuring elements increasing linearly in
First, each rectangular opening may be decomposed size. Using linear distance transforms for vertical and
into linear erosions and dilations as follows: horizontal directions we can generate all needed erosions
and dilations for each rectangular opening. The horizon-
Ψx,y (S) = S ◦ (yV ⊕ xH) tal distance transform of an image S is defined as:
= (S (yV ⊕ xH)) ⊕ (yV ⊕ xH)
Dh (S, x, y) = min{∆x | (x ± ∆x, y) ∈ S}
= (((S yV ) xH) ⊕ yV ) ⊕ xH (3)
and the vertical distance transform as:
This eliminates the need to directly open a document
image by rectangles of all sizes. Instead, the opening is Dv (S, x, y) = min{∆y | (x, y ± ∆y) ∈ S}
incrementally constructed by the orthogonal components
of each rectangle, which are increasing linearly in size These transforms can be efficiently performed using the
rather than quadratically. following recursive forward/backward filter pairs defined
A.D. Bagdanov, M. Worring: Multiscale document description using rectangular granulometries
0.03
0.02
0.01
−0.01
−0.02
−0.03
JACM
−0.04
−0.05
−0.06
50
40 70
30 60
50
20 40
30
Topology
10 20
10
0 0
a b c
Fig. 10a–c. Coefficients of the first principal component for a document collection. On the left are shown the coefficients in the
principal eigenvector mapped back into the original feature space of the size distribution (i.e., the same feature space as shown
in the examples given in Fig. 7). On the right, individual openings are interpreted: a shows the original images, b an opening
emphasizing the presence of the Topology logotype, and c an opening emphasizing the differences in margins
16
Table 1. Journals sampled for our dataset along with the ab-
breviations used throughout the experimental results section 14
and the number of articles in each class
12
Journal Abbrev # in class
Int J Netw Management IJNM 132
10 Errata
1
5 PCs
1
5 PCs
1
10 PCs
20 PCs
10 PCs
20 PCs
5 PCs
0.9 0.9
10 PCs
20 PCs
0.8 0.8
0.9
0.7 0.7
0.6 0.6
0.8
0.5 0.5
0.4 0.4
0.7
0.3 0.3
0.2 0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
IJNM JACM
0.6
1 1
5 PCs 5 PCs
10 PCs 10 PCs
20 PCs 20 PCs
0.9 0.9 0.5
0.8 0.8
0.7 0.7
0.4
0.6 0.6
0.5 0.5
0.3
0.4 0.4
0.3 0.3
0.2
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
Table 2. Genre classification results for 30 training samples Figure 12 gives the average precision/recall graphs
per class and various numbers of principal components. Clas- for each document genre in our database. The graphs
sification accuracy is estimated by averaging over 50 experi- were constructed by using each document in a genre as a
mental trials. The PCA is performed independently for each query, ranking all documents in the database against it,
trial and computing the precision at each recall level. These
# PCs individual precision/recall statistics are then averaged to
form the final graph.
Classifier 5 7 10 The graphs in Fig. 12 give a good indication of how
1-Nearest neighbor 94% 95% 98% well each individual genre is characterized by the rectan-
Quadratic discriminant 93% 94% 98% gular size distribution representation and also indicates
Linear discriminant 76% 80% 93% the overall precision and recall for the entire dataset. The
overall precision/recall graph is constructed by averaging
the precision and recall rates over all classes. This graph
For evaluation, a document is considered relevant if it indicates that, on average, 50% of all relevant documents
belongs to the same genre as the query document (i.e., it can be retrieved with a precision of about 80%.
is from the same publication). Note that this definition All of the precision/recall graphs have a characteris-
of relevance does not take into account the existence of tic plunging tail, indicating that there are some queries
visually distinct subclasses within a single publication. where relevant documents appear near the end of the
Precision and recall statistics can be used to measure ranked list. It is illustrative to examine some specific ex-
the performance of retrieval systems. They are defined amples of this phenomenon. Figure 13 gives some exam-
as: ple query images along with the highest-ranked relevant
document returned, excluding the query document itself,
# relevant documents retrieved and the lowest-ranked relevant document returned. In
Precision =
# documents retrieved most cases, these low-ranking relevant documents repre-
# relevant documents retrieved sent pathologically different visual subclasses of the doc-
Recall = ument genre.
# relevant documents
Another interesting phenomenon in graphs of Fig. 12
Rather than computing the overall precision, it is more is the presence of the nonmonotonic regions in the TOPO
useful to sample the precision and recall at several cutoff class. This occurs only when there is a dramatic change
points. For a given recall rate, we can determine what in the recall rate. In this case, the TOPO class contains
the resulting precision is, that is, the number nonrelevant articles with styles similar to those in the other classes.
documents that must be inspected before finding that It is common in the ranked retrieval lists to encounter
fraction of relevant documents. ranges of nonrelevant documents from another class fol-
A.D. Bagdanov, M. Worring: Multiscale document description using rectangular granulometries
Query
High Rank
lowed by a range of TOPO documents. Essentially, the search is currently focused on feature selection strategies
TOPO class does not have as much internal layout con- that also (re-)introduce spatial information into the size
sistency as the other classes. distribution representation.
The effects of noise on rectangular granulometries re-
mains an important open question. All of the document
6 Conclusions images considered in our experiments are clean images,
generated directly from a PDF source. While the feature
We have reported on an extension to multivariate gran- reduction techniques discussed in Sect. 4.3 will compen-
ulometries that uses rectangles of varying scale and as- sate for noise to a certain degree, it is unknown what
pect ratio to characterize the visual content of document precise effect noise will have on the resulting size distri-
images. Rectangular size distributions are an effective butions. A systematic theoretical and experimental in-
way to describe the visual structure of document im- vestigation is still needed.
ages, and with morphological decomposition techniques It should be noted that the techniques presented
they can be efficiently computed. Experiments have also in this paper are not limited solely to visual similar-
shown that rectangular size distributions can be used to ity matching but rather constitute a general approach
discriminate between specific document genres. Further- to multiscale analysis. As such, the granulometric ap-
more, principal component analysis can be used to re- proach may prove useful for applications such as table
duce the dimensionality of multivariate size distributions decomposition, text identification, and layout segmenta-
while preserving their discriminating power. One of the tion. A systematic study of the effects of noise on the
attractive aspects of rectangular size distributions is the representation is essential to establishing the widespread
ability, even under dimensionality reduction, to interpret applicability of the granulometric technique to document
significant features back in the original image space. understanding.
Document retrieval experiments also indicate the ef-
fectiveness of rectangular size distributions for captur-
ing visual similarity of documents. For our document References
database, 50% of relevant documents can be retrieved
with a precision of approximately 80%. 1. Antonacopoulos A (1998) Page segmentation using the
Principal component analysis has proved useful for ac- description of the background. Comput Vision Image Un-
derstand 70(3):350–369
centuating the important features in size distributions. A
2. Batman S, Dougherty ER, Sand F (2000) Heterogeneous
nonlinear PCA approach that maximizes interclass vari-
morphological granulometries. Patt Recog 33:1047–1057
ance while minimizing intraclass variance will certainly 3. Breuel T (2000) Layout analysis by exploring the space
improve both the classification and retrieval results. of segmentation parameters. In: Proceedings of the 4th
We plan to elaborate further on feature selection ap- international workshop on document analysis systems
proaches in the near future. The entire parameter space (DAS’2000), Rio de Janeiro, 10–13 December 2000
for rectangular size distributions is too expensive to sam- 4. Chandler D (2001) Semiotics: the basics. Routledge, Lon-
ple for document images. Feature selection, as opposed don
to feature reduction such as PCA, is more desirable be- 5. Doermann DS (1998) The indexing and retrieval of docu-
cause of this. Feature subsets are also more natural to ment images: a survey. Comput Vision Image Understand
interpret in terms of the original document images. Re- 70(3):287–298
A.D. Bagdanov, M. Worring: Multiscale document description using rectangular granulometries
6. Dougherty ER, Pelz J, Sand F, Lent A (1992) Morpholog- Andrew Bagdanov received his B.S.
ical image segmentation by local granulometric size dis- and M.S. in mathematics and com-
tributions. J Electron Imag 1:46–60 puter science from the University of
7. Haralick RM, Katz PL, Dougherty ER (1995) Model- Nevada, Las Vegas, where he was a
based morphology: the opening spectrum. Graph Models member of the Information Science
Image Process 57(1):1–12 Research Institute. He is currently a
8. Maragos P (1989) Pattern spectrum and multiscale shape Ph.D. student in computer science at
representation. IEEE Trans Patt Analysis Mach Intell the University of Amsterdam, work-
11:701–716 ing in the field of multimedia infor-
9. Matheron G (1975) Random sets and integral geometry. mation analysis. His research inter-
Wiley, New York ests include document understanding,
10. Serra J (1982) Image analysis and mathematical mor- pattern recognition, image processing,
phology. Academic, New York and functional programming languages.
11. Shin CK, Doermann DS (2000) Classification of docu-
ment page images based on visual similarity of layout Marcel Worring received his mas-
structures. In: Proceedings of SPIE Document Recog- ter’s degree (honors) and doctoral de-
nition and Retrieval VII, San Jose, 26–27 January 2000, gree, both in computer science, from,
pp 182–190 respectively, the Free University Am-
12. Vincent L (2000) Granulometries and opening trees. Fun- sterdam (’88) and the University of
damenta Informatica 41(1–2):57–90 Amsterdam (’93), The Netherlands.
He is currently an associate professor
at the University of Amsterdam. His
interests are in multimedia informa-
tion analysis and systems. He leads
several multidisciplinary projects cov-
ering knowledge engineering, pattern
recognition, image and video analysis,
and information space interaction, conducted in close cooper-
ation with industry. In 1998, he was a visiting research fellow
at the University of California, San Diego. He has published
over 50 scientific papers and serves on the program committee
of several international conferences.