Professional Documents
Culture Documents
A Model of Saliency-Based Visual Attention For Rap
A Model of Saliency-Based Visual Attention For Rap
net/publication/3192913
CITATIONS READS
9,459 10,647
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ernst Niebur on 25 January 2013.
ior and the neuronal architecture of the early primate visual Feature maps
system, is presented. Multiscale image features are combined (12 maps) (6 maps) (24 maps)
into a single topographical saliency map. A dynamical neu- Across-scale combinations and normalization
ral network then selects attended locations in order of decreas-
ing saliency. The system breaks down the complex problem of Conspicuity maps
Attended location
With r, g and b being the red, green and blue channels of the
Arbitrary units
input image, an intensity image I is obtained as I = (r +g +b)=3. Stimulus
I is used to create a Gaussian pyramid I (), where 2 [0::8] N (.)
is the scale. The r, g and b channels are normalized by I in x y x y
order to decouple hue from intensity. However, because hue
variations are not perceivable at very low luminance (and hence y
Arbitrary units
Orientation map
are not salient), normalization is only applied at the locations x
where I is larger than 1=10 of its maximum over the entire N (.)
image (other locations yield zero r; g and b). Four broadly-
tuned color channels are created: R = r ? (g + b)=2 for red, x y x y
G = g ? (r + b)=2 for green, B = b ? (r + g)=2 for blue, and
Y = (r + g)=2 ? jr ? gj=2 ? b for yellow (negative values are set Fig. 2. The normalization operator N (:).
to zero). Four Gaussian pyramids R(); G(); B () and Y ()
are created from these color channels.
Center-surround dierences ( dened previously) between a feature maps are combined, salient objects appearing strongly
\center" ne scale c and a \surround" coarser scale s yield the in only a few maps may be masked by noise or less salient ob-
feature maps. The rst set of feature maps is concerned with jects present in a larger number of maps.
intensity contrast, which in mammals is detected by neurons In the absence of top-down supervision, we propose a map
sensitive either to dark centers on bright surrounds, or to bright normalization operator, N (:), which globally promotes maps
centers on dark surrounds [12]. Here, both types of sensitivities in which a small number of strong peaks of activity (conspicu-
are simultaneously computed (using a rectication) in a set of ous locations) is present, while globally suppressing maps which
six maps I (c; s), with c 2 f2; 3; 4g and s = c + , 2 f3; 4g: contain numerous comparable peak responses. N (:) consists of
(Fig. 2): 1) Normalizing the values in the map to a xed range
I (c; s) = jI (c) I (s)j (1) [0::M ], in order to eliminate modality-dependent amplitude dif-
ferences; 2) nding the location of the map's global maximum
A second set of maps is similarly constructed for the color M and computing the average m of all its other local maxima;
channels, which in cortex are represented using a so-called \color 3) globally multiplying the map by (M ? m)2 .
double-opponent" system: In the center of their receptive eld, Only local maxima of activity are considered such that N (:)
neurons are excited by one color (e.g., red) and inhibited by compares responses associated with meaningful \activitation
another (e.g., green), while the converse is true in the surround. spots" in the map and ignores homogenous areas. Compar-
Such spatial and chromatic opponency exists for the red/green, ing the maximum activity in the entire map to the average over
green/red, blue/yellow and yellow/blue color pairs in human all activation spots measures how dierent the most active loca-
primary visual cortex [13]. Accordingly, maps RG (c; s) are tion is from the average. When this dierence is large, the most
created in the model to simultaneously account for red/green active location stands out, and we strongly promote the map.
and green/red double opponency (Eq. 2), and BY (c; s) for When the dierence is small, the map contains nothing unique
blue/yellow and yellow/blue double opponency (Eq. 3): and is suppressed. The biological motivation behind the design
RG (c; s) = j(R(c) ? G(c)) (G(s) ? R(s))j (2) of N (:) is that it coarsely replicates cortical lateral inhibition
mechanisms, in which neighboring similar features inhibit each
BY (c; s) = j(B (c) ? Y (c)) (Y (s) ? B (s))j (3) other via specic, anatomically-dened connections [15].
Local orientation information is obtained from I using ori- Feature maps are combined into three \conspicuity maps", I
ented Gabor pyramids O(; ), where 2 [0::8] represents for intensity (Eq. 5), C for color (Eq. 6), and O orientation
the scale and 2 f0 ; 45 ; 90 ; 135 g is the preferred orien- (Eq. 7), at the scale ( = 4) of the saliency map. They are
tation [11]. (Gabor lters, which are the product of a cosine obtained through across-scale addition, \", which consists of
grating and a 2D Gaussian envelope, approximate the recep- reduction of each map to scale 4 and point-by-point addition:
tive eld sensitivity prole (impulse response) of orientation- 4 M
M c+4
selective neurons in primary visual cortex [12].) Orientation I= N (I (c; s)) (5)
feature maps, O(c; s; ), encode, as a group, local orientation c=2 s=c+3
contrast between the center and surround scales:
4 M
c+4
O(c; s; ) = jO(c; ) O(s; )j (4) C=
M
[N (RG (c; s)) + N (BY (c; s))] (6)
In total, 42 feature maps are computed: Six for intensity, 12 for c=2 s=c+3
color and 24 for orientation. For orientation, four intermediary maps are rst created by
B. The Saliency Map combination of the six feature maps for a given , and are then
combined into a single orientation conspicuity map:
The purpose of the saliency map is to represent the conspicu- !
ity | or \saliency" | at every location in the visual eld by a X 4 M
M c+4
scalar quantity, and to guide the selection of attended locations, O= N N (O(c; s; )) (7)
based on the spatial distribution of saliency. A combination of 2f0 ;45 ;90 ;135 g c=2 s=c+3
the feature maps provides bottom-up input to the saliency map,
modeled as a dynamical neural network. The motivation for the creation of three separate channels, I ,
One diculty in combining dierent feature maps is that they C and O, and their individual normalization is the hypothesis
represent a priori not comparable modalities, with dierent dy- that similar features compete strongly for saliency, while dif-
namic ranges and extraction mechanisms. Also, because all 42 ferent modalities contribute independently to the saliency map.
3
5 5
1x1 noise patches 1x1 noise patches
# false detections
# false detections
4 5x5 noise patches 4 5x5 noise patches
3 3
2
(d) 2
1
1
0
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Noise density (d) Noise density (d)
We constructed a simple measure of spatial frequency con- if the properties of the noise (e.g., its color) were not directly
tent (SFC): At a given image location, a 16 16 image patch is con
icting with the main feature of the target.
extracted from each I (2); R(2); G(2); B (2) and Y (2) map, and The model was able to reproduce human performance for a
2D Fast Fourier Transforms (FFTs) are applied to the patches. number of pop-out tasks [7], using images of the type shown
For each patch, a threshold is applied to compute the number in Fig. 2. When a target diered from an array of surround-
of non-negligible FFT coecients; the threshold corresponds to ing distractors by its unique orientation (like in Fig. 2), color,
the FFT amplitude of a just perceivable grating (1% contrast). intensity or size, it was always the rst attended location, ir-
The SFC measure is the average of the numbers of non-negligible respectively of the number of distractors. Contrarily, when the
coecients in the ve corresponding patches. The size and scale target diered from the distractors only by a conjunction of fea-
of the patches were chosen such that the SFC measure is sensi- tures (e.g., it was the only red-horizontal bar in a mixed array
tive to approximately the same frequency and resolution ranges of red-vertical and green-horizontal bars), the search time nec-
as our model; also, our SFC measure is computed in the RGB essary to nd the target increased linearly with the number of
channels as well as in intensity, like the model. Using this mea- distractors. Both results have been widely observed in humans
sure, an SFC map is created at scale 4 for comparison with the [7], and are discussed in Section III-B.
saliency map (Fig. 4). We also tested the model with real images, ranging from nat-
III. Results and discussion
ural outdoor scenes to artistic paintings, and using N (:) to nor-
malize the feature maps (Fig. 3 and ref. [17]). With many such
Although the concept of a saliency map has been widely used images, it is dicult to objectively evaluate the model, because
in focus-of-attention models [1], [3], [7], little detail is usually no objective reference is available for comparison, and observers
provided about its construction and dynamics. Here we examine may disagree on which locations are the most salient. However,
how the feedforward feature extraction stages, the map combi- in all images studied, most attended locations were objects of
nation strategy, and the temporal properties of the saliency map interest, such as faces,
ags, persons, buildings or vehicles.
all contribute to the overall system performance. Model predictions were compared to the measure of local
SFC, in an experiment similar to that of Reinagel and Zador
A. General performance [18], using natural scenes with salient trac signs (90 images),
The model was extensively tested with articial images to red soda can (104 images), or vehicle's emergency triangle (64
ensure proper functioning. For example, several objects of same images). Similar to Reinagel and Zador's ndings, the SFC
shape but varying contrast with the background were attended at attended locations was signicantly higher than the average
to in order of decreasing contrast. The model proved very robust SFC, by a factor decreasing from 2:5 0:05 at the rst attended
to the addition of noise to such images (Fig. 5), particularly location to 1:6 0:05 at the 8th attended location. Although
5
this result does not necessarily indicate similarity between hu- model, although often related to local SFC, was closer to human
man eye xations and the model's attentional trajectories, it saliency because it implemented spatial competition between
indicates that the model, like humans, is attracted to \infor- salient locations. Our feed-forward implementation of N (:) is
mative" image locations, according to the common assumption faster and simpler than previously proposed iterative schemes
that regions with richer spectral content are more informative. [5]. Neuronally, spatial competition eects similar to N (:) have
The SFC map was similar to the saliency map for most images been observed in the non-classical receptive eld of cells in stri-
(e.g., Fig. 4.1). However, both maps diered substantially for ate and extrastriate cortex [15].
images with strong, extended variations of illumination or color In conclusion, we have presented a conceptually simple com-
(e.g., due to speckle noise): While such areas exhibited uni- putational model for saliency-driven focal visual attention. The
formly high SFC, they had low saliency because of their uni- biological insight guiding its architecture proved ecient in re-
formity (Figs. 4.2, 4.3). In such images, the saliency map producing some of the performances of primate visual systems.
was usually in better agreement with our subjective perception The eciency of this approach for target detection critically
of saliency. Quantitatively, for the 258 images studied here, the depends on the features types implemented. The framework
SFC at attended locations was signicantly lower than the max- presented here can consequently be easily tailored to arbitrary
imum SFC, by a factor decreasing from 0:90 0:02 at the rst tasks through the implementation of dedicated feature maps.
attended location to 0:55 0:05 at the 8th attended location:
While the model was attending to locations with high SFC, Acknowledgments
these were not necessarily the locations with highest SFC. It We thank Werner Ritter and Daimler-Benz for the trac sign
consequently seems that saliency is more than just a measure of images, Pietro Perona and both reviewers for excellent sugges-
local SFC. The model, which implements within-feature spatial tions. Supported by the National Science Foundation and the
competition captured subjective saliency better than the purely Oce of Naval Research.
local SFC measure.
References
B. Strengths and limitations [1] J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. H. Lai, N. Davis, F. Nu
o,
\Modelling visual attention via selective tuning," Articial Intelligence, vol.
We have proposed a model whose architecture and compo- 78, no. 1-2, pp. 507{545, Oct. 1995.
nents mimic the properties of primate early vision. Despite its [2] E. Niebur and C. Koch, \Computational architectures for attention," R.
Parasuraman, (Ed.), The attentive brain, Cambridge, MA:MIT Press, pp.
simple architecture and feedforward feature extraction mecha- 163{186, 1998.
nisms, the model is capable of strong performance with com- [3] B.A. Olshausen, C.H. Anderson CH and D.C. Van Essen, \A neurobiolog-
plex natural scenes. For example, it quickly detected salient ical model of visual attention and invariant pattern recognition based on
dynamic routing of information," J Neuroscience, vol. 13, no. 11, pp. 4700{
trac signs of varied shapes (round, triangular, square, rectan- 4719, Nov. 1993.
gular), colors (red, blue, white, orange, black) and textures (let- [4] C. Koch and S. Ullman, \Shifts in selective visual attention: towards the
ter markings, arrows, stripes, circles), although it had not been underlying neural circuitry," Human Neurobiology, vol. 4, pp. 219{227, 1985.
[5] R. Milanese, S. Gil and T. Pun, \Attentive Mechanisms for Dynamic and
designed for this purpose. Such strong performance reinforces Static Scene Analysis," Optical Engineering, vol. 34, no. 8, pp.2428{2434,
the idea that a unique saliency map, receiving input from early Aug. 1995.
visual processes, could eectively guide bottom-up attention in [6] S. Baluja and D.A. Pomerleau, \Expectation-based Selective Attention
for Visual Monitoring and Control of a Robot Vehicle," Robotics and Au-
primates [4], [10], [5], [8]. From a computational viewpoint, tonomous Systems, vol. 22, no. 3-4, pp. 329{344, Dec. 1997.
the major strength of this approach lies in the massively paral- [7] A.M. Treisman and G. Gelade, \A feature-integration theory of attention,"
lel implementation, not only of the computationally expensive Cognitive Psychology, vol. 12, no. 1, pp. 97{136, Jan. 1980.
[8] J.P. Gottlieb, M. Kusunoki and M.E. Goldberg, \The representation of vi-
early feature extraction stages, but also of the attention focus- sual salience in monkey parietal cortex," Nature, vol. 391, no. 6666, pp.
ing system. More than previous models based extensively on 481-484, Jan. 1998.
relaxation techniques [5], our architecture could easily allow for [9] D.L. Robinson, S.E. Peterson, \The pulvinar and visual salience," Trends
real-time operation on dedicated hardware. in Neurosciences, vol. 15, no. 4, pp. 127{132, Apr. 1992.
[10] J.M. Wolfe, \Guided search 2.0: a revised model of visual search," Psycho-
The type of performance which can be expected from this nomic Bulletin Review, vol. 1, pp. 202{238, 1994.
model critically depends on one factor: Only object features [11] H. Greenspan, S. Belongie, R. Goodman, P. Perona, S. Rakshit, C. H.
Anderson, \Overcomplete steerable pyramid lters and rotation invari-
explicitely represented in at least one of the feature maps can ance," Proc. IEEE Computer Vision and Pattern Recognition (CVPR), Seat-
lead to pop-out, that is, rapid detection independently of the tle, Washington, pp. 222-228, Jun. 1994.
number of distracting objects [7]. Without modifying the preat- [12] A.G. Leventhal, The Neural Basis of Visual Function (Vision and Visual
Dysfunction Vol. 4), Boca Raton, FL:CRC Press, 1991.
tentive feature extraction stages, our model cannot detect con- [13] S. Engel, X. Zhang and B. Wandell, \Colour tuning in human visual cortex
junctions of features. While our system immediately detects a measured with functional magnetic resonance imaging," Nature, vol. 388, no.
target which diers from surrounding distractors by its unique 6637, pp. 68{71, Jul. 1997.
size, intensity, color or orientation (properties which we have [14] C. Koch, \Biophysics of Computation: Information Processing in Single
Neurons," Oxford University Press (in press).
implemented because they have been very well characterized in [15] M. W. Cannon and S. C. Fullenkamp, \A Model for Inhibitory Lateral
primary visual cortex), it will fail at detecting targets salient for Interaction Eects in Perceived Contrast," Vision Research, vol. 36, no. 8,
unimplemented feature types (e.g., T junctions or line termina- pp. 1115{1125, Apr. 1996.
[16] M. I. Posner and Y. Cohen, \Components of visual orienting", Attention
tors, for which the existence of specic neural detectors remains and Performance X, (H. Bouma, D.G. Bouwhuis eds), Hilldale, NJ:L. Erl-
controversial). For simplicity, we also have not implemented any baum, pp. 531{556, 1984.
recurrent mechanism within the feature maps, and hence can- [17] The C++ implementation of the model and numerous examples of atten-
tional predictions on natural and synthetic images can be retrieved from
not reproduce phenomena like contour completion and closure, http://www.klab.caltech.edu/itti/attention/
important for certain types of human pop-out [19]. In addition, [18] P. Reinagel and A.M. Zador, \The Eect of Gaze on Natural Scene Statis-
at present our model does not include any magnocellular motion tics," Neural information and coding workshop, Snowbird, Utah, 16-20 Mar.
1997.
channel, known to play a strong role in human saliency [5]. [19] I. Kovacs and B. Julesz, \A closed curve is much more than an incomplete
A critical model component is the normalization N (:), which one: eect of closure in gure-ground segmentation," Proc. Nat'l Academy
provided a general mechanism for computing saliency in any of Sciences, U.S.A., vol. 90, no. 16, pp. 7495{7497, Aug. 1993.
situation. The resulting saliency measure implemented by the