Attention-Based Smart-Camera For Spatial Cognition

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Attention-based smart-camera for spatial cognition

Nicolas Cuperlier, Hakim


Guedjou, Frederic de Benoit Miramond
University Cote d’Azur,
Melo CNRS, LEAT
ETIS Lab. Nice, France
Cergy-Pontoise, France bmiramond@unice.fr
firstname.name@ensea.fr

ABSTRACT ii) a descending bias (top-down) depending on the con-


Bio-inspired attentional vision allows to reduce post- text of the perception, such as the current task, the anal-
process- ing to few regions of the visual field. However, ysis of the environment or the social interactions [8], iii)
the compu- tational complexity of most visual chains the competition between different modalities, in the visual
remains an issue for an embedded processing system such field between different characteristics, to access to limited
as a mobile and autonomous robot. We propose in this resources in the brain (short term memory, sensory data
paper an attention- based smart-camera and neural se- lection). The work presented in this paper, the
networks for place recogni- tion in the context of RobotSoC project1, is definitively inspired by these
navigation missions in robotics. The smart-camera observations. We introduced in previous works a multi-
extracts points of interest based on retina re- ceptive scale attentional archi- tecture where Point of Interest
fields at multiple scales and in real-time thanks to a (PoI) are detected in a sam- pled scale space based on an
dedicated hardware architecture prototyped onto recon- image pyramid and evaluated a single-scale hardware
figurable devices. The place recognition is computed by implementation on a robot homing experiment [4]. This
neural networks, inspired by hippocampal place cells, that paper extends these works by provid- ing hardware
code for both the descriptors (’what’ information) and implementation results of the full model, and study
the locations (’where’ information) of the points of interest several neural architectures allowing place recognition
pro- vided by the smart-camera. We experimented the (robot localisation) from the PoI extracted from the visual
addition in the recognition process of a coarse-to-fine scene by the smart camera.
approach and obtained improved results during robot Several digital models of attention have been proposed
localisation experi- ments. in recent works [1]. For example, the model used in [12]
mimics human visual perception from retina to cortex
using both static and dynamic information and hence is
Keywords compute- intensive. Authors proposed parallel adaptation
bio-inspired systems; artificial neural networks; FPGA; of this vi- sual saliency model onto GPU resulting on a
robot; smart-camera. real-time solu- tion at 22fps on multi-GPU for 320x240
images. We advo- cate in this paper the use of more
energy efficient processing by proposing dedicated
1. INTRODUCTION hardware prototyped onto reconfig- urable devices. This
Designing biologically-inspired models deals with robust approach has been widely studied in the literature [3] but
and versatile vision systems that can adapt to various users at our knowledge there is no existing hardware solution
and tasks and to unpredictable environmental conditions. allowing both ascending selection and de- scending
Saliency and attentional vision has been recently studied as contextual modulation.
a transposition in several digital applications outside of the The rest of the paper is organized in 3 sections. Section
original biologic domain. Today, bio-inspired visual mod- 2 describes the hardware architecture of the proposed
els open new perspectives in the treatment, the analysis, smart- camera that extracts PoI at multiple scales. The
the quality and the transmission of image sequences over results show that the entire visual chain can be embedded
a network. Current research in neurosciences describes at- into a FPGA-SoC device delivering up to 60 frames per
tentional processes according to three components: i) an as- second. In section 3, we present and evaluate several
cending selection (bottom-up) of visual characteristics salient neural mod- els for a place recognition, learning the PoI
over the sensory data called visual saliency [1], provided by the camera. Influence on the generalization
of the recognition and the recall-precision trade-off of
these models is studied. Among the proposed models, we
experimented the addition in the recognition process of a
contextual modulation fol- lowing a coarse-to-fine
Publication rights licensed to ACM. ACM acknowledges that this contribution was approach and obtained improved re- sults during robot
authored or co-authored by an employee, contractor or affiliate of a national govern- localisation experiments. We conclude and propose
ment. As such, the Government retains a nonexclusive, royalty-free right to publish perspectives in section 4.
or reproduce this article, or to allow others to do so, for Government purposes only.
ICDSC ’16, September 12 - 15, 2016, Paris, France
ACM ISBN 978-1-4503-4786-0/16/09. . . $15.00
DOI: http://dx.doi.org/10.1145/2967413.2967440 1
www-etis.ensea.fr/robotsoc/
2. HARDWARE ARCHITECTURE OF picted in fig. 1. It is composed of a chain of custom Intellec- tual
Properties (IPs), designed at RTL2 level, communicat- ing through a
THE SMART-CAMERA streaming interface (instantiated as an AXI streaming interface) into
The smart-camera described in this section provides the prototype.
three main contributions: i) computing in real-time a
high-level information, computed as a saliency map,
compared to raw images, ii) reducing the amount of data
to send over the network, and iii) providing a contextual
input allowing to constrain the identification of PoI.
The organization of the smart-camera architecture is de-
Figure 1: Global view of the multiscale architecture. The flow
of pixel comes from the camera, passes through the
convolutional IPs and goes to the CPU’s memory thanks to
DMA channels. An intermediate output can be selected and
the points of interest are read through the memory-mapped
interface.

2.1 Single-scale architecture


One scale of the vision chain is composed of the following processing
units:
1. the edge detector computed as a classical Sobel filter,
2. the first Gaussian filter computed as a convolution with a
Gaussian function.
3. The difference of Gaussian is computed thanks to a simple
subtract module and a shift register that syn- chronizes the input
pixels streams.
4. The PoI search algorithm consists in finding the local maximums in
the DoG images.

5. The PoI are then given to the sorting IP.


6. Finally, two IPs are responsible for the log-polar map- ping, Address
Generator and Transform.
2
Register Transfer Level
search area into a square window, hence avoiding the
The main contributions concern IP2, IP4, IP5 scalability limit of existing implementations [5]. The
and IP6. Firstly (IP2), even if the two- hardware nature of the computation requests again to adapt
dimensions convolution is sep- arated into two the initial software sort algorithm (sort by insertion) to
one-dimension convolutions, the implemen- avoid pointers management and external memory, since
tation of exponential based functions in the sort has to be computed in a single clock cycle (for each
hardware remains an issue for designing incoming pixel) in IP5. Let’s consider that at each time, the
scalable architectures. To reach this goal, the hardware list of PoI is sorted. This list consists in chained
exponentiation operation is computed with an registers banks. When a new point inputs, its value is com-
iter- ative architecture adapted from the pared to the ones already sorted in the list. When it’s value
CORDIC algorithm. A sequence of shift-and- is large enough to be inserted, the structure is stored in the
add iterations approximates, step by step, the right register bank. Thanks to a routing switch, every PoI
result of the exponential function. For below in the list can be shifted to update the list.
example, a 16-bit precision is reached after 12 Finally in IP6, the Address Generator converts the
iterations. The expo- nentiation module is then Carte- sian address into the log-polar one. This component
built around a barrel shifter, an adder and stores the result of the conversion in a look-up table (LUT)
registers. The efficiency of this method has which, one more time, enables to satisfy the pixel rate of the
been studied in [6] in the context of cam- era. The Transform IP is responsible for the feature
neuromorphic architectures. The remaining IPs trans- formation itself. The number of Cartesian pixels needed
do not correspond to the classical hard- ware to generate a log/polar pixel varies according to the considered
components of image processing and are ring. Each output pixel is then computed as an averaged ac-
mostly imple- mented as embedded software. cumulation of the input Cartesian pixels. The exact parame-
For each pixel, IP4 searches in a disk of ters of this transformation are computed off-line to generate
radius r if the pixel is greater than the the hardware IP adapted to the chosen value of log/polar
others, to determine if it can be considered radius. Contrary to the previous components, the log/polar
as a PoI in this area. This search step needs IPs, as the tail of the vision chain, have to conserve their
a quadratic number of comparisons, O(r2), for data until an external read-back from the embedded proces-
each pixel in the stream. A first possible sor. So, the IPs are duplicated N PoI times, with NPoI the
design optimizes the latency by computing
the comparison on a single clock cycle, but maximal number of PoI to detect.
needs a huge number of hardware resources
when consider- ing important disk radius. In 2.2 Multi-scale architecture
the trade-off between latency and resource We described above the hardware components that com-
consumption, a second solution would consists pose one scale of Gaussian pyramid. Each scale can now be
in separating the comparison kernel in two duplicated and chained with the others to build the com-
1D comparison trees, as for separable filters. plete pyramid. The reference design consists in the classical
We then proposed to trans- form the circular bottom-up pyramid, sequentially detecting PoI in a data-
driven way. To reduce the amount of PoI extracted, we
choose to set a fixed number of PoI extracted by each visual Table 1: The interface between the hardware vision
scale (NPoI 6). We then restrict the search area of higher chain and the embedded processor shows the struc-

scales at positions where PoI of the lowest scale were found. ture of the data delivered by the smart-camera.
Moreover, even if the experiments of section 3 don’t rely on Register Description Access Offset
it, this architecture also integrates a top-down constraint in FEATURE PIX The pixels of the visual RO 0x0000
the detection coming from the learning layer and acting as a descriptor
bias in the PoI recalling process (see dashed line in figure 0x0001
1). The Ethernet interface enables to configure this ... 0x2000
constraint at any time by a specific request. FRAME COUNTER The number of the cur- RO 0x2000
Whatever the architecture tested in the experiments, the rent frame
detected PoI are identified by their coordinates, scale and KP COUNTER Number of the PoI RO 0x2004
octave. As presented in table 1, the place recognition mod- KP X Coordinate of the KP RO 0x2008
KP Y Coordinate of the KP RO 0x200C
KP VALUE Intensity of the KP RO 0x2010
els of section 3 use these information (the Where relative KP CSTR TAG The TAG contain the RO 0x2014
chaining information
to the camera orientation) in addition to the rho-theta de- between KP across the
scriptor itself (the What) to build place cells from the vi- scales
KP INDEX Index of the KP in the RW 0x2018
sual scene captured during navigation. The FEATURE sorted list
PIX memory space can contain between 255 and 2048 X CSTR WIDTH 1 Width of the con- RW 0x201C
straint window in
pixels. The software embedded into the associate scale 1
processor runs an entire TCP/IP stack under embedded
Linux. The smart-
Y CSTR HEIGHT 1 Height RW 0x2020
camera acts as an Ethernet server that delivers data to ...
clients through a Gigabit Ethernet link. The client can also X CSTR WIDTH 6 Width of the con- RW 0x2044
ask for additional information for the calibration of the straint window in scale
6
cam-
era with the VIDEO OUT SRC that configures the
internal
Y CSTR HEIGHT 6 Height RW 0x2048
DMA to send either the raw video stream, or the stream at FPGA device (Zynq 7020) and were X CSTR X
the output of the gradient, Gaussian or DoG IPs. In the able to embed only one coordin
context of this paper, the client corresponds to the neural ate of
the
layer (section 3). constrai
nt of
2.3 Results index
CSTR
We already explored the impact of the parameters of the INDEX
IP described earlier on the hardware resources consumption Y CSTR Y
coordin
[4]. Thanks to our contributions, the resource consumption ate of
of all the hardware components became linear with the num- the
ber of PoI and the search radius. In [4] we targeted a smaller constrai
nt of index CSTR INDEX RW 0x204C RW 0x2054
CSTR INDEX Index of the con- straint to conftgure
CSTR EN Enable the constraint chain or freely identify RW 0x2058
PoI RW 0x2050
VIDEO OUT SRC Source of the addi-
RW 0x205C
tional output video
scale of the pyramid. However this exploration allowed to 1266 Kbits at each frame. A single scale of our
estimate the feasibility of a hardware implementation of architecture only generate 16 rho-theta descriptors of
the full multi-scale attentional architecture onto upcoming resolution 16x16 coded on 16 bits. The camera then sends
de- vices such as the 7045 Zynq considered in this paper. 64 Kbits of data per frame which corresponds to 5% of
This target device is a System-on-Chip from Xilinx, the raw data of a standard camera. Of course, the
com- posed both of a reconfigurable logic matrix (FPGA) reduction ratio decreases with the number of scales. With
and a dual-core cortex A9 embedded processor. The the full-scale architecture, the number of PoI increases. If
logic ma- trix is composed of 437k registers, 218k LUTs, we consider the worst case where the maximum number
2180KB of on-chip memory (BRAM) and 900 DSPs of PoI can be obtained, even at low scales, the throughput
blocs. The param- eters we used were: the number of PoI of our camera remains 30% lower. Finally, the data-flow
per scale NPoI = 16, the search radius r = 20, the architecture of the hardware system enables to work at the
detection threshold γ = 0, and the frequency of the camera frequency, configured in our experiments to
camera was set to 148.25 MHz. The threshold parameter deliver 60 frames per second.
γ can be fixed at compile time or modified dynamically
depending on the lightness and visual environment.
The resource utilization of the FPGA device for the 3. VISION-BASED SPATIAL COGNITION
single- scale and the multi-scale architectures are We choose to evaluate how a robot equipped with our
describes in table smart-camera performs in a spatial recognition task. In this
2. The scale factors show that the computation resources section we describe several neural models of place recogni-
(LUT and DSP) are less than the number of additional tion based on the multi-scale PoI descriptors provided by
scales (from 1 to 6), but that the system needs around ten the smart-camera. In order to compare place recognition
times more memory resources (registers and BRAM) to performances, we choose to use a neural architecture relying
store the intermediate data and the final local features. on single visual scale as a reference. It is based on a biologi-
Due to the important use of BRAM by our architecture, cally plausible model of place cells (PC) in the hippocampal
we can cur- rently compute raw images with a resolution system [7] (see fig. 2).
of 480x270 pixels (scales 1 and 2). It entails a resolution
of 240x135 pixels for scales 3 and 4, and 120x67 for the 3.1 Place Recognition Models
5 and 6 scales.
The following place recognition models are based on a PC
A 480x270 image with pixels coded on 10 bits generates
at time t is given by the following equation:
Table 2: Usage of the FPGA resources. Comparison Nd
between single-scale and multi-scale architecture.

Architecture Resource Utilization Percentage Scale si (t) = 1 − |wi − dj (t)| (1)
Single-scale LUT
type 15520 7,1% -factor chain as described in 2.1.NdNext,
j=1 we
j present the neural architecture on

Registers 5684 1,3% - which are also based the neural model taking into account the six visual
BRAM 41 7,6% - scales. (see A. in fig. 3.1.1).
Embedded 1,4Mb 7,6% - First, we apply a Deriche gradient filter (h(x)) )on the grey
memory scale image: h(x) = c.x.e−α|x|, with c = (1−e−α 2 . α sets
DSP 5 4,8% -
Multi-scale LUT 54781 25,06% 3,5
Registers 59765 13,67% 10,5
BRAM 401 73,66% 9,69
Embedded 14,4 Mb 73,66% 9,69
memory
DSP 246 27,33% 5,7

model in which a position in the environment is encoded as


a pattern of neural activities merging “what” and
“where” information coming from an attentional vision
mechanism. That is to say, extracted visual descriptors,
around points of interest (PoI) of each image of a
panorama, are encoded as signatures to provide the
“what” information and the “where” information is
provided by azimuths of these PoI related to an external
reference. In the following, all neu- ral models are based ”on
mean frequency” activity (between [0,1]) and learning is
controlled by a common binary signal named l(t) (not shown
on fig. 2 for sake of visibility). In the localisation task
addressed in this paper, this signal is trig- gered manually
but in a sensory-motor task (PC/action) it could also be
triggered by a self-assessment system [10]. In any case,
learning is performed in a one-shot way, meaning that the
signal is only set to one (learning allowed) once for each
visual input the system has to learn.

3.1.1 Single-scale Place Recognition Model


We first describe the single-scale vision process of this
model which corresponds to a single-scale of the vision
where dj is the jth element of the corresponds to a gaussian bell with standard deviation σazim.
descriptor vector d and wij is the weight of In order to reduce the computational cost, the 360 o
the synaptic link between si and dj . The surrounding field is discretised in Na orientations.
learning rule of these neurons is the Fusion of “what” and “where” information is performed in a
following: Spatial Working Memory of size N SWM = Ns N×a that is a
second-order tensor SWM in which each neuron mi codes for
wij (t) = dj (t).l(t) (2)
a signature-azimuth couple 4. It stores previous activ- ities
The input pattern is saved in the links while a visual scene exploration (panorama) is still in progress.
weights if the binary learning signal l(t) is Then, the activity of SWM neurons is:
equal to one (one shot learning). Thus, the SWM(t) = max[(s ⊗ a), SWM(t − dt).(1 − r(t))] (3)
closer the input is to the learned pattern, the
stronger the neurons activations. where Ns is the size of the signature vector, s and a are
In addition, the model relies on the “where” the signatures and azimuths vectorial representations, r a
information: the azimuth from which the PoI binary reset signal triggered at the end of a panorama and
is observed according to a fixed reference 3. For
sake of simplicity, head (camera) orientation is ⊗ is the tensorial product operator.
provided in this paper by a magnetic compass Pattern of activities in the SWM map codes for the
and propri- oception (coming from the pan current place. The gaussian shape of a allows to slowly
servo-motor). Azimuth of a PoI is then decrease the activity on small variations of the azimuth
computed from its x coordinate in the image and thus a better generalisation. Such pattern can be
and the current head orientation. The activities categorized in a place vector using the same equation as
of the azimuth neurons ai are obtained after a in 1, in which each neuron pi has the following activity at
lateral diffusion around the neuron coding for time t:
the direction of the current PoI to enhance NSW M
recognition generalization [7]. In our case, it Σ1 SWM
e−α
the detection and localization parameters of the gradient pi(t) = 1
NSWM |wij (t) − mj (t)| (4)
de- tector. Then, the output is convolved with a difference − j=1

of
Gaussian (DoG) filter consisting in two Gaussians of stan- where mj is the jth element of the tensor iSWM,
dard deviations σDoG1 and σDoG2 . Then NPoI points of wSW M is the weight of the synaptic link between j
pi
interest (PoI) are extracted as local maxima on this DoG and mj . The learning rule is the same as the equation
image by a local competition mechanism. Around each of (2), replacing dj by mj .
these PoI, local views (images) are extracted between two In the following, visual information (visual signatures
disks of radius rsmall and rbig. To avoid redundancies, two and their retinotopic positions) is provided by a software
PoI cannot be closer than rbig/2. Then, these local views imple- mentation of the multi-scale vision architecture
are encoded in a population of Nd neurons using a log-polar (presented in fig. 1 in section 2) running on a classical
transformation providing a descriptor d (log/polar descrip- CPU (see 3.2.1).
tor on fig. 2). This transformation has relatively little com- 3
putational cost, is invariant to small rotations and scale vari- The orientation can be provided by either a magnetic com-
ations, and gives good place recognition results [9]. pass or multimodal-compass modelling Head Direction Cells
A neuronal population s codes visual signatures (the ”what” founds in mammals[2].
4
information provided by d). The activity of each neuron We use in this work a two-dimensional SWM but this
model could be extended to a third-order tensor by taking
si into account another input like the signatures elevation [9].
B)
A)

X
coor
in
imagedinate

pt
or des
In
pu
t Gr
ad
ie
nt
im
ag Signature Signature
e Place cells

Place cells

C) D)

X X
in coor in coor
imagedinate imagedinate

pt
L

pt
L

or des or des

Place Place cells Scale 6


Signature cells Scale 6 Signature
s s
Context for Place cells at scale 5

Multi- scale Place cells

Context from Place cells


at scale 2

Place cells scale 1 Place cells scale 1

Figure 2: Neural architectures for place recognition. A) The model of Gaussier et al. where place cells
code the pattern of activity of a spatial working memory (SWM) [7]. For each image of a panorama, the
product of the visual signature (what information) and the azimuth (where information) of each Point of
Interest (PoI) are sequentially extracted and added to the SWM by an attentional mechanism. The azimuth
is computed from head orientation and X coordinate of the PoI in the image. Vision processes of this model
correspond to a single-scale of the vision chain as described in section 2.1. The neural models B)-D) differ by
the way they process the sorted-list of PoI descriptors provided by the multi-scale architecture (see section
2.2). B) Multi-scale PoI stacking model (MPIS): a single PC population learns a SWM activity pattern
formed by taking into account PoI found on the six scales. C) Multi-scale PC stacking model (MPCS): each
of the 6 visual scales is processed as in A), resulting in six unlinked PC populations. A multi-scale PC then
learns the stacking of theses PC responses. D) Multi-scale coarse-to-ftne model (MCFPC): as in C), PoI of
each visual scale (n) are coded by a population of PC, but in addition, PC of successive scales are also linked
such that the winning PC of a coarse scale (n+1) can bias PC recognition at a ftner scale (n).

3.1.2 Multi-scale PoI stacking (MPIS) model a neuromodulation link, bias the recognition competition
In this model (see B. in fig. 3.1.1-B) PoI descriptors ex- between PCs of the ”finer” scale n.
tracted at every scale of the vision system are fed to the For a given scale n, a PC neuron (pni ) has the following
same neural networks as in the single-scale model. It corre- activity at time t:
sponds to a direct adaptation of the single-scale system to . Σ
pn (t) = H (wn (t).pn+1 (t))
i iq q
multi-scale one by simply stacking the PoI
× .1 −
NSW
Σ
descriptors. 1
M
|wn(t) − m (t)|
n (5)
3.1.3 Multi-scale PC stacking (MPCS) model
Σ
ij j
Pattern of activities in each SWM code for the current NSWM
j=1
n
location of the robot perceived at different scales. At each where is the weight between the PC p at scale n and
n i i
wn+1 q n
scale of the model described in fig. 3.1.1-C, place cells are pq the winning PC at scale n + 1. m j
is the jth element
computed as in the single-scale model. Since all scales will of the tensor SWM at scale n; w is the weight between pn
n

independently propose a winning PC and that we need ij i


only one answer, we choose to merge them in a multi- and mjn; H(x) is the Heaviside function 5. As in equation
scale PC population that relies on the same equation (4). 2, the input pattern is saved in the links weights when l(t)
= 1.
3.1.4 Multi-scale Coarse-to-fine PC (MCFPC)
model 3.2 Material and Methods
We adopt a coarse-to-fine approach in this last model (fig. 5
Please note that the binarization of the contextual term
3.1.1-D). As in MPCS model, pattern of activities of each through the Heaviside function is not mandatory. The pur-
SWM can be categorized by PC neurons. But we added a pose is to ensure that the dynamic of place cell activities
hierarchical constraint on PC of successive scales. That is to only depends on the input pathway (instead of being
say, activities of PCs of a ”coarse” scale n + 1 can, via reduced by
pqn+1(t) which is <= 1).
several situations are recognized well enough.
Table 3: Parameters values Recall/precision is a classical evaluation criterion from the
Parameter Value Description
σazim 30 Std. dev. of azimuths diffusion (in
degrees)
Ns information
3000 Nb. of signature neurons ements queried. In this experiment, we consider three mea-
Na retrieval Nb. of orientations in the SWM sures of performance:
5
perspective testing 3WD (3 winners distances): It measures the average dis- tance (in
Map
the system’s capac- meters) between the current robot position and the positions where the 3
α Gradient resolution (edge detec-
ity 1to return best recognized places were originally learned. It characterizes the
tion) in the reference modelsystems ability to generalize and not only recognize places at the precise
relevant answers
σDoG1 according
5 to dev.
Std. the of 1st DoG gaussian location
(in where the corresponding PoI were learned. The sum is
number of el- in the reference model
pixels) normalized
by 2.5×d
σDoGlearn , which
2 is the
Std.average
dev. ofresult for agaussian
2nd DoG X in
position (in
2
the segment [PA, PC ] where the closest learned places PA,PB
pixels) in the reference model
NPoI and PC are the winners6. Thus, the results are greater to 1;
5 Nb. of PoI extracted per image The lower the better.
rsmall 5 Small disk radius in local views (in WNR (winners-to-noise ratio): Similarly to a classic signal-
pixels) to-noise ratio, it compares the level of the desired responses
rbig 13 Big disk radius in local views (in to the level of background noises. In this case, we consider
pixels) that the most relevant information consists in the average of
ND 8 × 8 Size of the rho-theta descriptor the three winning PCs activity while the noise is the average
of the remaining place recognition responses7. This measure
assesses whether a place cell activity decreases when the
robot is far from the initial learning location. Also, higher
3.2.1 Experimental Setup WNR scores indicate there’s less sensitivity to noise and
more robustness to small variations (the greater the better).
Our objective is to compare the capacity of thetested In other words, little risk that the fifth closest place cell
mod- els to discriminate places in terms of granularity. would be more active and wins the competition mistakenly.
The study we present in this paper is performed on a
dataset recorded in our University using the RobotSoC MAP (mean average precision): It is a traditional com-
robot. The dataset fits the experimental protocol involving pact representation of the recall-precision curves. It is the
on-line learning on real robots used in [10] and simulates mean of the average precision at every position. Indeed,
navigation in an uncon- strained real (no artificial cues) place cell activities are updated at each panorama. This is
and dynamic environment (moving people). It consists in analogous to a new query. So we can calculate the system
trajectories (80 meters long) recorded in a corridor and a precision (i.e. well ranked PC responses) depending how
rotunda of the first floor of our University building. The many PC activities would be considered in the output
robot camera captures 15 im- ages (with a size of 640 (from one to all). MAP scores are in [0, 1] and 1
480) over a 360 degrees panorama. During this process, corresponds to a perfect precision.
the robot stays still in∗ order to avoid distortions in the We measure the response of the PC at the finest scale
representation of the place. A magnetic compass was used for the MCFPC and MPCS models.
to acquire the robot orientation for sake of simplicity.
The dataset includes 40 learning panoramas used to learned 3.2.3 Results
40 PC (l(t) = 1) and 120 exploration panoramas used to The performance scores of the 4 tested models are pre-
test PC recognition (l(t) = 0). The closest distance be- sented in fig. 3. MAP scores of all the models are much
tween two learned places is dlearn = 2 0.03 meters in aver- greater than ”chance” (random selection of the PC
age. An exploration panorama is completed ± after traveling winning for each request in a given sequence). The single
dexplo = 0.5 0.03 meter. We calculate the ground truth by scale (ref- erence) model exhibits correct accuracy and
estimating the± robot position at every step based on that. generalization properties on the three tests. However, a
All experiments are run on a 6-core 12-thread 3.33GHz known limitation of this model is its sensitivity to the
CPU with 16 GB of RAM. We use the Promethe neural number of learned PC which tends to increases the ”noise
network simulator [11]. level” of the recognition. This ”noise level” corresponds to
the activities of the PCs that have not been learned close
3.2.2 Measures to the current position. MPIS (Multi-scale PoI stacking) and
the Multi-scale PC stacking (MPCS) models are also
Two criteria are important in the evaluation of robot lo- affected by this effect.
calisation models based on vision system, namely general- Two models combining multi-scale information outperform
ization and recall-precision trade-off. the single scale model in all tests. The best results are ob-
The generalization criterion consists in the system’s tained by MCFPC model (Multi-scale Coarse-to-fine PC)
ability to return relevant elements in new situations by then MPIS model (Multi-scale PoI stacking).
recognizing common characteristics shared with learned
patterns. In sensory-motor navigation, such property is 6
In the case of a perfect recognition, this value is bounded
crucial so that the robot can correctly perform in the real by 2 ×dlearn and 3 × dlearn which respectively corresponds to
world. Indeed, topologically close locations should have a retrieval on a learned spot and one on the starting or
close recognition levels. It also allows for using control ending points of a trajectory.
7
mechanisms that gen- erate smooth trajectories by The places where have been learned the 3 winning PCs
averaging PC responses when should be the closest from the current location.
MAP Scores
3WDScores WNR Scores MAP
0.3030 Scores

0.2525

0.2020

0.1515
3WD Scores WNR Scores
4.5 4.5 60 60 0.

4.04.0
0.
50 50
3.5.5
3

0. 0.1010
.0 40 40
3.03
s

2.5.52
Scores

0.
S cores

S cores

30 30
r
c
S

2.0.02

0.
.5 20 20 0.0505
1.51

1.0.01 0.

1010
.50.50
0.

00
Monoscale Multiscale1model Multiscalemodel
2 Monoscale
Single-scale Multiscmodel
ale1 Multiscalmodel
e2 Multiscale3 model Monoscale 0.0000 Multiscale1 Multiscale2 Multiscale3 Random
0.00.0 MPIS MPCS MCFPC model MPIS MPCS MCFPC
Multiscale3
Single-scale Single-scale MPIS model MPCS model MCFPC model Random

Figure 3: Performance scores of all tested models: coarse-to-ftne model MCFPC gets the best results in
almost all the tests or is second to one of its variants MPIS (Multi-scale PoI stacking model). Single-scale
(reference) model performs better than MPCS (reference) model. 3WD: The lower the better; WNR and
MAP: The great the better. MAP: Random corresponds to a random selection of the winning PC.

MPCS model gives the worst results in all tests. In this E. Cervera, and A. Morales, editors, 13th Int. Conference on
model, PC are computed independently on each scale and Simulation of Adaptive Behavior, pages 132–143. Springer, 2014.
then simply merged in a flat representation. [3] F. Dias, F. Berry, J. Serot, and F. Marmoiton. Hardware, design and
Processing the scales from coarse to fine based on a implementation issues on a fpga-based smart camera. In First Int.
hierar- chical PC learning allows to greatly enhance the Conference on Distributed Smart Cameras, pages 20–26, 2007.
robustness of the PC response. Indeed, in the WNR test,
we observe a real impact of the modulation link between
PC of successive scales, MCFPC model obtaining the
highest score.

4. CONCLUSION
We introduced in this paper a vision-based robotic ar-
chitecture composed of a smart-camera, implemented on a
dedicated-hardware (FPGA), providing PoI descriptors,
ex- tracted at multiple visual scales, to a neural
architecture simulated via a real-time neural network
software for the localisation of a mobile robot. We first
presented the im- plementation details and the resources
usage of the vision model taking into account 6 spatial
scales in the hardware system. Then we showed that the
global performance of a robot localisation task highly
depends on the way extracted PoI are combined and
learned across the visual scales. Re- sults highlighted the
advantage of exploring the visual scene following a coarse-
to-fine approach where coarse recognition levels modulate
finer recognition levels.

5. REFERENCES
[1] A. Borji, D. N. Sihite, and L. Itti. What/where to
look next? modeling top-down visual attention in
complex interactive environments. IEEE
Transactions on Systems, Man, and Cybernetics,
pages 1–16, 2012.
[2] P. Delarboulas, P. Gaussier, R. Caussy, and M.
Quoy. Robustness study of a multimodal compass
inspired from hd-cells and dynamic neural fields. In
A. P. del Pobil, E. Chinellato, E. Martinez-Martin, J.
Hallam,
[4] L. Fiack, N. Cuperlier, and B. Miramond.
Embedded and real-time architecture for
bio-inspired vision-based robot navigation.
Journal of Real-Time Image
Processing, 10(4):699–722, 2015.
[5] L. Fiack, B. Miramond, and N. Cuperlier.
FPGA-based vision perception architecture
for robotic missions. In Smart cameras for
robotic applications, page 4, Portugal,
Oct. 2012.
[6] L. Fiack, L. Rodriguez, and B. Miramond.
Hardware design of a neural processing unit
for bio-inspired computing. In 13th Int.
Conference on New Circuits and
Systems Conference, pages 1–4, 2015.
[7] P. Gaussier, A. Revel, J.-P. Banquet, and
V. Babeau. From view cells and place
cells to cognitive map learning:
processing stages of the hippocampal
system. Biological Cybernetics, 86:15–
28, 2002.
[8] C. D. Gilbert and W. Li. Top-down
influences on visual processing. Nature
reviews in neurosciences,
5(14):350–363, 2013.
[9] C. Giovannangeli, P. Gaussier, and J. P.
Banquet. Robustness of visual place cells in
dynamic indoor and outdoor environment.
Int. Journal of Advanced Robotic
Systems, 3(2):115–124, 2006.
[10] A. Jauffret, N. Cuperlier, P. Gaussier, and
P. Tarroux. From self-assessment to
frustration, a small step towards
autonomy in robotic navigation.
Frontiers in Neurorobotics, 7(16),
2013.
[11] M. Lagarde, P. Andry, and P. Gaussier.
Distributed real time neural networks in
interactive complex systems. In Int.
conference on Soft computing as
transdisciplinary science and
technology, 2008.
[12] A. Rahman, D. Houzet, D. Pellerin, S. Marat, and
N. Guyader. Parallel implementation of a
spatio-temporal visual saliency model. J.
Real-Time Image Process., 6(1):3–14,
Mar. 2011.

5.1 Acknowledgments
The authors would like to thank the French
CNRS, the ENSEA and the University of
Cergy-Pontoise for funding the RobotSoC
project.

You might also like