Image Modelling by Linear Composition of Shared Shiftable Bases

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

000 054

001 055
002 056
003 Image Modelling by Linear Composition of Shared Shiftable Bases 057
004 058
005 059
006 060
Anonymous CVPR submission
007 061
008 062
009 Paper ID WZ 063
010 064
011 065
012 066
Abstract modelling p(W | Θ) and p(I | W, Θ) explicitly, especially
013 067
when W is simple, such as W ∈ {face, non-face} in face
014 068
This article proposes a statistical model for image detection.
015 069
patches of object shapes, in the form of additive composi- While methods targeting p(W | I, Θ) directly have had
016 070
tion of a set of linear bases selected from a large dictionary, considerable successes, it is still desirable to have an im-
017 071
such as Gabor wavelets at different locations, orientations age model with p(I | W, Θ). Even though such a genera-
018 072
and scales. The model has the following three features in tive model may not synthesize realistic I or provide efficient
019 073
terms of the selected bases. Sparsity: only a small number compression of I, it can still provide a context for explain-
020 074
of salient bases should be selected to represent any image ing the image data so that the explain-away competition can
021 075
patch with small error. Commonality: the selected bases be carried out to select the most plausible interpretation of
022 076
for image patches of the same object category should be a single image, or to discover the most plausible interpreta-
023 077
as common as possible, so that these image patches can be tion of a set of images for unsupervised learning.
024 078
modelled by shared bases. Shiftability: the shared bases In our opinion, the image model that comes the closest to
025 079
are allowed to locally shift their locations, orientations and being the foundation for further developments is the sparse
026 080
scales within limited range to represent each individual im- coding model of Olshausen and Field [6], which is a simple
027 081
age patch of an object category. This model can be used to linear additive model that is built directlyP on the raw image
028 082
select the bases and learn their compositions for a training intensities. The model is of the form I = i ci Γi +ǫ, where
029 083
sample of image patches. The computation can be accom- {Γi , i = 1, ..., N } is a dictionary of linear bases, ci are their
030 084
plished using what we call the shared sketch by shiftable coefficients, and ǫ is the error. The size of the dictionary N
031 085
bases. We show several experiments to illustrate the model can be many folds larger than the dimensionality of I. The
032 086
and the algorithm. key principle is sparsity. That is, for each typical natural
033 087
034 image I, only a small number of ci should be significantly 088
035 different from 0, or in other words, only a small number of 089
036 1. Introduction bases should be selected from {Γi }, in order to represent I 090
037 with small error ǫ. Formally, one can express the sparsity 091
1.1. Motivation and foundation principle by a regularity function of {ci } such as l1 norm
038 092
039 Pattern theory as advocated by Grenander [2] and Mum- that encourages sparsity, or by assuming that {ci } follow 093
040 ford [5] postulates a generative model in the form of p(W | some probability distribution that concentrates most of its 094
041 Θ) and p(I | W, Θ), where I is the image data, W is the probability mass around 0, but has heavy tails to account 095
042 interpretation of I in terms of what is where, and Θ denotes for occasionally large values [3]. Olshausen and Field [6] 096
043 parameters (including structural parameters) in the model. were able to learn a dictionary of localized, elongate and 097
044 Θ can be learned from training images (with or without the orientated linear bases from natural image patches using 098
045 corresponding W ). With the learned Θ, computing W for a this model. These bases resemble Gabor wavelets. Given 099
046 given image I can be guided by p(W | I, Θ), which tells us a dictionary of linear bases, the matching pursuit algorithm 100
047 what value of W gives the most plausible explanation of I. of Mallat and Zhang [4] can be used to sequentially select a 101
048 A number of generative models for various vision tasks small number of bases for representing an input image. 102
049 have been proposed in the literature. However, unlike the Sparsity principle is important for modelling purpose, 103
050 HMM in speech recognition, there has not been a generic because it is easier to model low-dimensional structures 104
051 modelling scheme that can serve as the common foundation than high-dimensional ones. However, sparsity alone is 105
052 for different vision tasks. Instead, it is often preferred to clearly inadequate for modelling image patches of different 106
053 target p(W | I, Θ) directly in supervised training, without object categories. In this article, we add two more features 107

1
108 162
to the linear sparse coding model, and develop a statistical sented symbolically by a bar of the same position, orienta-
109 163
model for image patches of objects. tion and length. In the first row, the left figure displays all
110 164
the 61 selected bases. The right figure displays the bases
111 165
1.2. Commonality and shiftability that are shared by all the three images. The selected Ga-
112 166
bor wavelets have little overlaps between them and they are
113 The two additional features are commonality and shifta- 167
well connected, so the symbolic representation is essentially
114 bility. 168
a “sketch” or a line-drawing of the input images. The right
115 1) Commonality. For image patches of each category, 169
figure in the first row can be considered the common sketch
116 we want the bases selected for these image patches to be 170
“averaged” over all the three training images. Clearly, most
117 as common as possible, so that these image patches can 171
of the selected bases displayed in the left figure of the first
118 be modelled by shared bases. In terms of probability mod- 172
row are common to all the three images.
119 elling, we want each selected base to have a very high prob- 173
For the rest three rows, the left figure displays the in-
120 ability to be turned on. Of course, for different object cate- 174
put training image, and the right figure displays the bases
121 gories, it is preferred that they do not share many common 175
that are actually used to represent the input image on the
122 bases. This feature was inspired by the work of Viola and 176
left. Most of these bases are shifted versions of the shared
123 Jones [8] on adaboost method for classification. The weak 177
bases displayed in the right figure of the first row. Clearly,
124 classifiers selected by their method are in the form of pro- 178
these shared bases can shift their locations and orientations
125 jecting the image patches onto Harr bases and thresholding 179
to represent each individual image.
126 the projection coefficients. The selected Harr bases are used 180
127 to characterize all the images, and they are selected to max- 181
128 imally tell the face images apart from non-face images. Our 182
129 method can be consider a generative version of this scheme, 183
130 where the shared bases are selected to maximally explain all 184
131 the input image patches. Our method can be easily extended 185
132 to modelling multiple object categories in either supervised 186
133 or unsupervised learning, where each category has its own 187
134 shared set of common bases. 188
135 2) Shiftability. Deformation is common in object shapes, 189
136 even within the same object category and the same pose. 190
137 The selected bases at fixed locations, orientations and scales 191
138 may not give optimal representations to all the image 192
139 patches, even they are reasonably well aligned. To account 193
140 for deformation, we must allow the shared bases to shift 194
141 their locations, orientations and scales within limited range 195
142 when representing each individual image patch. This fea- 196
143 ture was inspired by the work of Riesenhuber and T. Pog- 197
Figure 1. Top row: the left figure displays all the 61 selected bases,
144 gio [7] on HMAX model, which is a hierarchical bottom- 198
the right figure displays the bases shared by all the three images.
145 up model for object recognition. In their model, the com- 2nd to 4th rows: the left figure displays the 82 × 164 training
199
146 plex cells in V1 are assumed to compute the local maxima image, and the right displays the bases used for representing the 200
147 of Gabor filter responses relative to the shift of locations image on the left. A Gabor wavelet is represented symbolically by 201
148 and scales. Such a maximum pooling mechanism makes a bar at the same location, with the same orientation and length. 202
149 the responses of the complex cells invariant to limited de- 203
150 formation as well as changes in scale and pose etc. In It is worth noting that the first training image has strong 204
151 our method, this maximum pooling is incorporated in our edges in the background. However, these edges are not 205
152 shared sketch algorithm for fitting multiple training images shared by the other two images. So no Gabor wavelets are 206
153 simultaneously, where the shared bases can shift to the max- selected to represent them. Also, the car in the second im- 207
154 imally tuned locations, orientations and scales in represent- age has a moderately different pose than the other two cars. 208
155 ing each individual image patch. But it can still be represented by the shared bases after shift- 209
156 Figure 1 illustrates commonality and shiftability. There ing. Figure 2 shows another example for three training im- 210
157 are three 82 × 164 training image patches of cars. The ages of horses. 211
158 dictionary of linear bases are Gabor wavelets at 12 differ- The rest of the article is organized as follows. Section 2 212
159 ent orientations with a fixed scale. These Gabor wavelets gives the technical details of the model and the algorithm. 213
160 can be centered at any pixel within the domain of the im- Section 3 describes a number of experiments. Section 4 214
161 age patches. In this figure, each Gabor wavelet is repre- concludes with a brief discussion. 215

2
216 270
of Bk in location, orientation and scale in representing Im .
217 271
As to the range of δm,k , the base Bk can shift its location
218 272
along its normal direction within a range of a small number
219 273
of pixels, and for each shifted location, the base can also
220 274
shift its orientation within a small range of angles. We may
221 275
also allow the base to change its scale. For notational con-
222 276
venience, we may simply denote Bk+δm,k as the base that
223 277
is actually used to represent Im .
224 278
Representing linear composition: given {Bk } and
225 279
{αm , δm },
226 280
X
227 Im = ck,m Bk+δm,k + ǫ. (2) 281
228 αm,k =1 282
229 283
230 Figure 2. The size of the training images is 118 × 157. 67 bases Let cm = (cm,k , k = 1, ..., K). In order to model {Im }, we 284
231 are selected in total. See the caption of Figure 1 for explanation. need to select {Bk }, and model {αm , δm , cm }. 285
232 Let’s start from the simplest possible model, and then 286
233 generalize it later on. 287
234 2. Model and Algorithm Modelling commonality: αm,k ∼ Bernoulli (a) indepen- 288
235
2.1. Gabor bases: edges and spectrum dently across k = 1, ..., K, where a is the probability that 289
236 Bk is turned on for Im . a can be close to 1. 290
237 A Gabor function [1] is of the following form: Modelling shiftability: [δm,k |αm,k = 1] follows a uni- 291
238 form distribution over ∆. ∆ denotes the allowed range of 292
1 1 x2 x2
239 G(x) = exp{− ( 12 + 22 )}eix1 , (1) shift. 293
240 2πσ1 σ2 2 σ1 σ2 Modelling orthogonal composition: We assume that the 294
241 √ bases {Bk+δm,k , αm,k = 1} that are used to represent Im 295
242
where x = (x1 , x2 ) ∈ R2 , and i = −1. We can translate, have little overlap in spatial domain, or if they do overlap 296
243
rotate, and dilate the function G(x) of (1) to obtain a general in spatial domain, they have little overlap in frequency do- 297
244
form of Gabor wavelets: Gy,s,θ (x) = G(x̃/s)/s2 , where main, i.e., in orientation and scale. Thus they have little 298
245
x̃ = (x̃1 , x̃2 ), x̃1 = (x1 − y1 ) cos θ − (x2 − y2 ) sin θ, x̃2 = correlations between themselves, and we may assume that 299
246
(x1 − y1 ) sin θ + (x2 − y2 ) cos θ. The central frequency of these bases are orthogonal to each other. 300
247
Gy,s,θ is (cos θ/s, sin θ/s). For clarity, we use matrix notation in what follows. Sup- 301
248
For an image I(x), the projection coefficient of I onto pose Im is defined on a lattice D with |D| pixels. We can 302
249
G
R y,s,θ or the filter response is ry,s,θ = hI, Gy,s,θ i = vectorize Im as a |D| × 1 vector. For every Gabor basis 303
250
I(x)Gy,s,θ (x)dx. The local energy |ry,s,θ |2 is large if Γi in the dictionary {Γi , i = 1, ..., N }, we can vecotrize it 304
251
there is an edge or bar structure at y with scale s and orien- according to the same order of vectorization. We normalize 305
252
tation θ. The marginal average of local energy E[|ry,s,θ |2 ] every Γi to have unit l2 norm, so kΓi k2 = 1. We normalize 306
253
within an image region estimates the average power spec- each Im to have unit marginal variance, so kIm k2 = |D|. 307
254
trum around the central frequency of Gy,s,θ . This is a sensible thing to do to filter out the effect of overall 308
255 lighting variation. 309
2.2. Model: composing shared and shiftable bases
256 Let Bm = (Bk+δm,k , αm,k = 1), i.e., Bm is a |D|×Km 310
257 To model a sample of training images {Im , m = matrix whose
P columns are Bk+δm,k with αm,k = 1, where 311
258 1, ..., M }, we can select a small set of bases {Bk , k = Km = k αm,k . Note that Bm is completely determined 312
259 1, ..., K} from a large dictionary {Γi , i = 1, ..., N }, such as by αm , δm , and vice versa. Let Rm = B′m Im be the Km ×1 313
260 Gabor bases at different locations, orientations and scales, vector of projection coefficients or filter responses of Im on 314
261 K ≪ N . In order to represent each image Im using the se- the bases Bk+δm,k that are used to represent Im . 315
262 lected bases {Bk }, we need to define two sets of variables Let B̄m be an |D| × (|D| − Km ) matrix whose columns 316
263 to account for commonality and shiftability. are orthonormal and also orthogonal to all the column of 317
264 Representing commonality: αm = (αm,k , k = 1, ..., K). Bm . The columns of B̄m are the residual dimensions, and 318
265 αm,k = 1, if Bk is used to represent Im , i.e., Bk is turned R̄m = B̄′m Im are the residual error. kR̄m k2 = |D| − 319
266 on for Im . αm,k = 0, if Bk is not used to represent Im , i.e., kRm k2 . 320
267 Bk is turned off for Im . We model the components of Rm to have independent 321
268 Representing shiftability: δm = (δm,k , k = 1, ..., K). uniform distributions over [r0 , r0 + b], where r0 is a thresh- 322
269 δm,k is defined only if αm,k = 1, and it denotes the shifting old for the bases to be turned on. Given Rm , R̄m have uni- 323

3
324 378
form distribution over the sphere Ω = {R̄m : kR̄m k2 = if their correlation hΓi , Γj i is below a predefined threshold
325 379
|D| − kRm k2 }, so p(R̄m | Rm ) = 1/|Ω|. According to the (e.g., .01).
326 380
equipartition principle in information theory, if |D| − Km The following is a detailed description of the algorithm.
327 381
is large, this uniform distribution is equivalent to assum- We use the notation i to label the bases in the dictionary
328 2 382
ing that the components of R̄m follows N(0, σm ) indepen- {Γi , i = 1, ..., N }, and we use k to label the selected bases
329 2 ′ 2 383
dently, where σm = (|D| − kRm k )/(|D| − Km ). {Bk , k = 1, ..., K}. i and k run through two distinct sets.
330 384
The distribution p(Im | Bm )dIm = p(Rm )p(R̄m | Step 0: Initialization and thresholding. For i = 1 to N ,
331 385
Rm )dRm dR̄m . The dimensions are matched, and the Ja- for m = 1 to M , compute rm,i = hIm , Γi i. If |rm,i | < r0 ,
332 386
cobian is 1 because of the orthogonality. Moreover, set rm,i = 0. Let k = 1.
333 387
Step 1: Attempt shared shiftable fitting for all candidate
334 1 2 bases. For i = 1 to N , for m = 1 to M , do the following. 388
335
log p(R̄m | Rm ) = − (|D| − Km ) log(2πeσm ). 389
2 For δm,i ∈ ∆, if all rm,i+δm,i = 0, set αm,i = 0. Other-
336 390
If |D| is much large than both Km and kRm k2 , which is the wise, let δ̂m,i be the one with the maximum rm,i+δm,i , and
337 391
case in object recognition where Im can be a large image, set αm,i = 1.
338 392
and Bm only models a small patch of it, against a large Step 2: Select the best fitting base for shared sketch. For
339 393
background in Im of unit marginal variance, then i = 1 to N , compute
340 394
341 2 1 X 1  395
342
−(|D| − Km ) log(σm ) ≈ kRm k2 . Li = r 2
+λ . 396
M m:α =1 2 m,i+δ̂m,i
343 m,i 397
Likelihood function: Let B = (Bk , k = 1, ..., K) denote
344 398
all the selected bases. Assuming the above approximation Let j be the base such that Lj ≥ Li for i = 1, ..., N . Then
345 399
to be exact, then the joint distribution is: let Bk = Γj . For m = 1 to M , if αm,j = 1, let αm,k = 1,
346 400
347 let δm,k = δ̂m,j , and let rm,k = rm,j+δ̂m,j (note that k and 401
1
348 p(Im , Bm | B) = kRm k2 + λKm , j belong to two distinct sets). 402
2
349 Step 3: Inhibit overlapping bases. For m = 1 to M , if 403
350 where αm,k = 1, do the following. For i = 1 to N , if Γi overlaps 404
351 with Bk+δm,k = Γm,j+δ̂m,j , set rm,i = 0. 405
352 λ = log(a/(1 − a)) − [log(|∆|b) − log(2πe)/2], Step 4: Stopping criterion. If Lj is below a pre-defined 406
353 threshold, then let K = k, stop. Otherwise, let k = k + 1, 407
where the first term is the award for commonality, and the and go back to Step 1.
354 408
second term is the cost for coding the shift and coefficient The threshold on Lj corresponds to a prior distribution
355 409
of a selected base versus leaving it to residual dimensions. on K, which prefers small value.
356 410
p(ImP| B) can be obtained by integrating out αm and δm .
357 411
Using m log p(Im | B) as the likelihood, we can learn 2.4. Nonorthogonality and scale-specific sketch
358 412
B from a sample of training images {Im , m = 1, ..., M }
359 Before going to experimental results, we would like to 413
of a certain object category. After learning B, we can use
360 explain two important issues. 414
p(I | B) to get the likelihood that a testing image I belongs
361 Nonorthogonality: For nonorthogonal Bm , let Rm = 415
to the same category.
362 B′m Im be the Km × 1 vector of projection coefficients. 416
Usually, p(Im , Bm | B) is highly peaked, so that we can
363 The projection of Im on the subspace spanned by the bases 417
estimate αm and δm accurately at their posterior modes that
364 in Bm is Jm = Bm (B′m Bm )−1 Rm = Bm Cm . Jm 418
maximizes p(Im , Bm | B). So p(Im | B) can be approxi-
365 is the reconstructed image with Cm being the Km × 1 419
mated by p(Im , Bm | B) by plugging in estimated Bm .
366 vector of least squares reconstruction coefficients, Cm = 420
367 (B′m Bm )−1 Rm . kJm k2 = Rm ′
(B′m Bm )−1 Rm . 421
2.3. Algorithm: shared sketch by shiftable bases
368 Let R̄m = B̄m Im . Recall that B̄m consists of orthonor- 422
369 We develop a shared sketch algorithm for selecting bases mal columns that are orthogonal to the bases in B. Given 423
370 B and their shifted versions {Bm }. Similar to the match- Rm , R̄m have uniform distribution over the sphere Ω = 424
371 ing pursuit algorithm of Mallat and Zhang [4], the shared {R̄m : kR̄m k2 = |D| − kJm k2 }, which is equivalent to 425
372 sketch algorithm is a greedy one. At each step, it selects a independent N(0, σm2 2
), with σm = (|D| − kJ′m k2 )/(|D| − 426
373 base from the dictionary, and attempt its shifted versions on Km ). Then 427
374 all the input images simultaneously. If a shifted version is 428
375 used to represent an image, then this shifted version inhibits p(Im | Bm ) = f (Rm )f (R̄m | Rm )| det(B′m Bm )|1/2 , 429
376 all the other overlapping bases from representing the same 430
377 image. For two bases Γi and Γj , they are not overlapping where f (Rm ) is the distribution of the responses Rm , and 431

4
432 486
| det(B′m B)|1/2 is the Jacobian of the linear change of vari- References
433 ′ ′ ′ 487
able (Rm , R̄m ) = (Bm , B̄m )′ Im . Thus
434 [1] J. Daugman, “Uncertainty relation for resolution in space, 488
435 spatial frequency, and orientation optimized by two- 489
p(Im , Bm | B) = λKm + dimensional visual cortical filters,” Journal of Optical Society
436 490
1 ′ of America, 2, 1160-1169, 1985. 3
Rm (B′m Bm )−1 Rm + log | det(B′m Bm )| .

437 491
438
2 492
[2] U. Grenander, General Pattern Theory, Oxford Univ Press,
439 This can be used to modify the shared sketch algorithm. 1993. 1 493
440 When attempting to add a new base Γi to Bm in the algo- 494
′ [3] M. S., Lewicki and B. A. Olshausen, “Probabilistic frame-
441 rithm, and compute the changes in Rm (B′m Bm )−1 Rm and work for the adaptation and comparison of image codes,”
495
442 ′ 496
det(Bm Bm ), we can perform a Gram-Schmidt orthogonal- Journal of the Optical Society of America, 16(7): 1587-1601,
443 ization of Γi by Bm . This step can be naturally incorporated 497
1999. 1
444 into the shared sketch algorithm as a form of soft inhibition 498
445 that replaces the hard inhibition in the Step 3 of the original [4] S. Mallat and Z. Zhang, “Matching pursuit in a time- 499
446 algorithm. frequency dictionary,” IEEE Transactions on Signal Process- 500
447 ing, 41, 3397-415, 1993. 1, 4 501
Scale-specific sketch: In a given image, different pat-
448 terns may appear at different scales, and these patterns may 502
[5] D. B. Mumford, “Pattern theory: a unifying perspective,”
449 be organized into hierarchical whole-part relationships. It 503
Proceedings of 1st European Congress of Mathematics,
450 is therefore desirable to perform separate sketches at dif- Birkhauser-Boston, 1994. 1 504
451 ferent scale ranges or frequency bands, and then orga- 505
452 [6] B. A. Olshausen and D. J. Field, “Emergence of simple-cell 506
nize them into hierarchical relationships, instead of per-
453 receptive field properties by learning a sparse code for natural 507
forming a single sketch using Gabor bases across all the
454 images,” Nature, 381, 607-609, 1996. 1 508
scales. Let F be the range of frequencies covered by the
455 Gabor bases within a relatively narrow range of scales. [7] M. Riesenhuber and T. Poggio, “Hierarchical models of ob- 509
456 Then these Gabor bases are trying to fit the band-pass im- ject recognition in cortex,” Nature Neuroscience, 2 , 1019- 510
457 age within the frequency band F . Specifically, let Im = 1025 (1999). 2 511
458 P 512
ω Îm (ω) exp(iωx), where Îm is the discrete Fourier [8] P. A. Viola and M. J. Jones, “Robust real-time face detec-
459 transform ofPIm . Then these Gabor bases are trying to fit 513
460
tion,” International Journal of Computer Vision, 57(2), 137- 514
Im (F ) = |ω|∈F Îm (ω) exp(iωx), while neglecting all 154 2004. 2
461 the rest of the frequency components outside F . 515
462 516
Recall that we normalize the input image Im to unit
463 517
marginal variance. That is, for an input Im , we compute
464 518
S 2 = kIm k2 /|D|, and then change Im to Im /S. This
465 519
amounts to normalizing rm,i = hIm , Γi i/S. Because the
466 520
band-pass Im (F ) is the image that is being fitted, it is more
467 521
reasonable to normalize Im (F ). Specifically, S 2 (F ) =
468 522
kIm (F )k2 /|D| = |ω|∈F |Îm (ω)|2 /|F |, where |F | is the
P
469 523
470
number of frequency components in F , and S 2 (F ) is the 524
471
average spectrum of Im within F . So we should normalize 525
472
rm,i = hIm , Γi i/S(F ). The average spectrum S(F ) can 526
473
be estimated by the average of {|hIm , Γi i|2 } for all those 527
474
candidates Γi within this frequency band F . 528
475 In this band-pass setting, the dimensionality of Im (F ) 529
476 is |F | with its |F | Fourier components Îm (ω). After pro- 530
477 jecting Im onto the selected orthogonal bases Bm to get 531
478 Rm = B′m Im , there are |F | − Km dimensions left for 532
479 B̄m , and R̄m = B̄′m I belongs to the sphere Ω = {R̄m : 533
480 kR̄m k2 = |F | − kRm k2 }. The argument in the previous 534
481 subsection still follows through. 535
482 536
483 3. Experiments 537
484 538
485 4. Discussion 539

You might also like