Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

AN SIMD-VLIW SMART CAMERA ARCHITECTURE FOR REAL-TIME FACE

RECOGNITION

R. Kleihorst , H. Broers , A. Abbo , H. Ebrahimmalek , H. Fatemi , H. Corporaal and P. Jonker



  

Philips Research Laboratories, Eindhoven,NL


Philips CFT, Eindhoven, NL


Eindhoven University of Technology, NL




Delft University of Technology, NL




Email: richard.kleihorst@philips.com

ABSTRACT bigger than a typical video surveillance camera. In


There is a rapidly growing demand for using smart our situation it is programmed in such a way that
cameras for various applications in surveillance and video goes in and the names of recognized people
identification. Although having a small form-factor, come out. A speech synthesizer output takes care of
most of these applications demand huge processing this and it will also advice the persons in the scene to
performance for real-time processing. Face recog- look straight in the camera and/or to come closer if
nition is one of those applications. In this paper we the person is detected but the recognition reliability
show that we can run face recognition in real-time by is not high enough for positive identification.
implementing the algorithm on an architecture which The platform we suggest for face recognition is
combines a parallel processor with a high performance the Intelligent Camera (INCA+) produced by Philips
digital signal processor. Everything fits within a dig- CFT [2] as shown in Figure 1. This camera houses a
ital camera, the size of a normal surveillance camera. CMOS sensor, a parallel processor for pixel crunch-
ing and a DSP for the high level programs. We will
I. INTRODUCTION show in this paper that this platform is ideal for face
Recently, face detection and recognition is becom- recognition.
ing an important application for smart cameras. Face The contents of the paper is as follows: In Sec-
detection and recognition requires lots of processing tion II we explain about the architecture of camera, In
performance if real-time constraints are taken into Sections III and IV respectively, we explain about the
account[1]. algorithm that we used for face detection and recog-
Face detection is the detection of faces in the scene nition. The results are given in Section V and con-
from video data, it is usually done using color and/or clusions are drawn in Section VI.
feature segmentation. Face recognition is de actual
II. MOTIVATION OF THE ARCHITECTURE
recognition of the person based on the pixels that
span the face region found in the detection process. Face recognition consists of a face detection and
Face recognition is usually performed by either neu- face recognition part. In the detection part face blobs
ral network matching or by feature measurement and (groups of pixels spanning a face) are detected in the
matching through a database. For robust recognition, scene and they are forwarded to the face recognition
the face needs to be at a proper angle and completely process where the found face blobs are matched to a
in front of the camera. database with a set of stored faces in order to recog-
What we want to show in this publication is that it nize and identify them.
is possible far smart camera architectures to achieve These two parts of the algorithms work on differ-
good, real-time face recognition results. A “smart ent data structures. While the detection parts works
camera” is hereby defined as a stand-alone device on all pixels of the captured video and is pixel ori-
which is prefferably programmable with a size not ented (low level image processing), the recognition
database

color (RGB) recognized ID


Sensor Xetal Trimedia
video face(ROI)
Parallel VLIW
CMOS Sensor Processor processor

Face Face Recognition


Detection Part Part

Fig. 2. Architecture of the INCA Camera

Fig. 1. INCA camera

part is working on face objects and is face oriented


(high level image processing). The detection part has
to do similar operations for all pixels in the scene to
determine if or not the pixel belongs to a face-blob.
While we have a high amount of pixels in a life video
stream, the operations are simple and similar for each Fig. 3. Xetal Architecture
pixel, allowing data-level parallelism.
The data rate in the recognition part is not that First part of the architecture is CMOS sensor, it

high, it only works on a few hundred faces per sec- can take up to frames per second with a resolution
 
ond but it has a high amount of operations in an it- of    pixels. The SIMD Xetal processor ex-
erative way while a database is “scanned”. Because ploits massive parallelism. It contains 320 pixel level
of the higher complexity of the instructions and the processors and each pixel processor is responsible for


combination with an operating system, this part of columns of the image. It can handle up to  in-
the algorithm is best mapped on a task-parallel archi- structions for each pixel. It has  line memories to
tecture. save information [3]. Figure 3 shows the architecture
The different aspects of the two algorithmic tasks of the Xetal processor in more detail. This processor
have made us to choose for a dual processor approach directly reads the pixels from the CMOS image sen-
where the low-level image processing approach of sor and performs the face detection part. Coordinates
the face detection part is mapped on a massively par- and subregions of the image where prospective faces
allel processor “Xetal” [3] working in SIMD (Sin- are found are forwarded to the TriMedia. The Tri-
gle Instruction Multiple Data) mode. The high level Media exploits limited instruction level parallelism;
image processing part of recognition is mapped on it can handle  operations in parallel. This proces-
a high-performance fully programmable DSP core sor scales and normalizes the subregions and matches
“TriMedia” [4]. This DSP has a VLIW (Very Long them to the faces in his database. In the fashion of a
Instruction Word) architecture where instruction fetch, real “smart” camera, only IDs are reported. These
data fetch and processing are performed in a pipelined IDs are send to a speech synthesizer that greets the
fashion. person recognized or asks the person to identify him-
For the defined task the two processors can be sim- self when not recognized.
ply connected in series as shown in Figure 2. The Xe-
tal does face detection, the TriMedia does face recog- III. FACE DETECTION
nition and the operating system also runs on Trime- In the face detection part we take an image from
dia. the sensor and detect and localize an unknown num-
Fig. 4. Skin region in UV Spectrum

ber (if any) of faces. Faces are found by colour spe-


cific selection. By removing too small regions and
enforcing a certain aspect ratio of the selected re-
gion of interest (ROI) the detection becomes more
reliable.
We detect skin parts in the image by searching for
the presence of skin-tone coloured pixels or groups
of pixels. The representation of pixels as they are
delivered by the colour interpolation routines from
the CMOS sensor image are in RGB form . This
is not very suitable for characterizing skin colour.
The components in RGB space not only represent
colour but also luminance, which varies from situ-
Fig. 5. The input face and classification of skin pixels,
ation to situation. By going to a normalized colour
note the number of small detected regions that have to be
domain such as YUV, this effect is minimized [5], removed.
[6]. The YUV colour domain is more suitable for
the detection because it separates the luminance (Y)
with the colours (UV). Y value can vary from 0 to pixel regions that are too small to be faces in the
255 whereas the U and the V can have values from - scene. The result for 3 faces in the scene is shown
128 to 128. A continuous auto white-balance and ex- in Figure 6. Eventually all three faces will make up
posure system ensures that the color spectra are well three detected regions.
defined, even under coloured lighting conditions. By imposing a (width:height) ratio on the detected
By using the YUV colour domain not only the de- blobs of around (1:1.6), the face regions are sepa-
tection has become more reliable but the skin tone rated from other skin-coloured blobs like hands. Fi-
indication has become easier, because skin tone can nal result is a region of interest spanning only the
now be indicated in a 2 dimensional space. We de- face such as shown in Figure 7.
fined the skin tone region as a square in the UV spec- Horizontally and vertically through the face, a gray-
trum. Everything in this region passes as skin-pixel level projection is performed whose minima enable
and a result is shown in Figure 5. Some checks on Y the detection of the position of the eyes in order to
are also performed to filter out the very high and low normalize the face blob around the eye positions be-
brightness regions where U and V are ill-defined. fore feeding it to the recognition phase [7]. See Fig-
To increase the reliability, the field of detected skin- ure 9 for the results and the principle. Xetal does the

tone and non skin-tone pixels is filtered using a horizontal projection and Trimedia does the vertical
erosion and dilation filter. This filter removes small projection and finds the minima.
Fig. 6. Skin Tone Detection Result

Fig. 8. Coordinates of faces found in the input image

Fig. 7. Region of Interest result

Next to this projection data, the face detection part


only sends luminance and coordinates of the face to
the recognition part as defined in Figure 8. The Tri-
media will only address the relevant regions. This
reduces the data content significantly.

IV. FACE RECOGNITION


This section introduces the neural net face recog-
nition process. As input for the recognition process,
the face blob detected in the previous section is nor-
malized around the eye positions and than identified
with respect to a face database.
Fig. 9. This image shows the projection data and the face
For this purpose a Radial Basis Function (RBF)
blob region that is send to the Trimedia, the latter proces-
neural network is used [8]. The reason behind using sor detects the eyes and nose for normalization as indi-
an RBF neural network is its ability for clustering cated by the crossing lines.
similar images before classifying them. RBF based
Input Nodes Output Nodes works, where the activation function (basis function)
Hidden Nodes of the hidden units is known by the distance between
1 the input vector and a prototype vector. The acti-
1 vation function of the RBF hidden node is stated as
2 1 follows [9]:
2
2 
       ! "  !
!$#$#$#! 


(1)
o

% 
where x is an n-dimensional input feature vector
m (normalized face  ), is an n-dimensional vec- 
n tor called the center of the RBF hidden node, is
also an n-dimensional vector called the width(also
called radius) of RBF hidden node and is the num- 
Fig. 10. Architecture of an RBF Neural Network ber of the hidden nodes. Normally, the activation
 
function of the hidden nodes is selected as a Gaus-
clustering received wide attention in the neural net- 
sian function with mean vector and variance vector
as follows:
works community. Apart from good clustering capa-
)+*-,/.0+14 23,4
bilities RBF networks have a fast learning speed, and
a very compact topology.   & (' 5 2 6 7! "  $! #8#8!  (2)

IV-A. Architecture of RBF Neural Network Because the output units are linear, the response of
9
the ’th output unit (among the number of outputs)
An RBF neural network structure is demonstrated for input is given as:
in Figure 10. Its architecture is similar to that of a
traditional three-layer feed forward neural network.
:<;>=@?    E
The input layer of this network is a set of units,
which accepts the elements of an n dimensional input
 A ?CB GF D    IH "J! 9 ! 9K !$#8#8! 
feature vector (here, the RBF neural network input is
where H
"J! 9  is the connection weight of the " (3)
the face which is gained from the face detection part.
Since it is normalized with a 

pixel face, it RBF hidden node to the 9 ’th output node and A
? ’th
is
follows that  
   ). the bias of the 9 ’th output.
The input units are completely connected to the

hidden layer with hidden nodes. Connections be- IV-B. Using RBF neural network
tween the input and the hidden layers have fixed unit The first step in face recognition is normalizing the
weights and, consequently it is not necessary to train region of interest(as shown in Figure 7) to the size of
them. The purpose of the hidden layer is to cluster the faces stored in the identification database( 

L
the data and decrease its dimensionality. The RBF pixels) and after that feed them to the neural network
hidden nodes are also completely connected to the input. Subsequently, we calculate the output for each
output layer. person, and we consider the maximum value between
The number of outputs depends on the number of the outputs and report that as the recognized person.
 
people to be recognized (for example, for per- Figure 11 shows the main kernel for using the RBF
sons   
). The output layer provides the re- neural network.
sponse to the activation pattern applied to the input
layer. The change from the input space to the RBF V. MEASUREMENTS AND PERFORMANCE
unit space is nonlinear, whereas the change from the In this section we evaluate the performance of our
RBF hidden unit space to the output space is linear. algorithm. Since the face recognition, and not the de-
The RBF neural network is a class of neural net- tection part, turned out to be the major bottleneck we
// compute output of hidden node to 20 and we want to recognize faces of five persons,

L1:
the number of executed instructions on a Trimedia
for 0 < i < Number_Hidden_Node{
sum =0 (   Mhz.) is about



, and the number of
for 0<j< Number_Input_Node(64*72=4608){
temp = data[j]-center_value[i][j]  K
 
cycles(taking memory delays into account) is about
, which corresponds to

. This is far
 
temp = temp * temp
temp = temp / sigma_value[i][j]
from real-time, therefore we employed several opti-
mizations like:
sum = sum+temp
 Replace all division operations in the program.

}
out_hiddennode[i] = exp(-sum) Use single precision floating point instead of dou-
ble precision.
 Use local variables instead of global variables.
// compute output  Perform loop-unrolling.
L2:
  
Then the number of executed instructions is re-

 
for 0 < i < Number_Output(5 person){
sum =0
duced to 
to

(thus, resulting in 

# 
and the number of cycles is reduced
execution time).
for 0 < j < Number_Hidden_Node(20){
sum = sum +
(out_hiddennode[i][j]*weight[i][j]) V-C. Complexity
}
The execution time in the RBF loopnests is related
sum = sum + bias_value[i]
 
to , , and (see Figure 11). The time complexities
 
output[i] = temp

} for the first ( ) and second ( ) loopnests are:

Fig. 11. Kernel for RBF Neural Network



 :  #  !
 :  #   

(4)

Therefore, the total time complexity is given by:


concentrate on that part first. At the end of this sec-
tion we evaluate the overall performance and recog-
     :  # B  #    :  # B    (5)
nition rate.
It is easily seen that the memory size required for
V-A. Face Recognition allocating all variables has the same complexity:
The algorithms described for using the RBF neu-      :  # B    (6)
ral network have certain demands on the processing
power, bandwidth and flexibility of the architectural 
Because is independent of and ( is more  
template. To measure the performance, we first need or less constant, and equal to the number of charac-
to extract the kernel loops and loop-nests, which are teristic in a face), execution time and memory size
done in the RBF neural network. If we take the ex- are linear in and [10]. 
ample of face recognition, the input is a normalized

face ( 



pixels, hence    inputs), there are V-D. Overall practical performance
hidden nodes (resulting in   

 weights between Our algorithms have been mapped to a handheld

the input and hidden layers), and output nodes, de- camera device as shown in Figure 1. After program-
pending on the number of people to be recognized ming using a host computer and a firewire or eth-
(hence   weights between the hidden and output ernet link the camera starts running the face recog-
layers). nition application. Because faces are recognized at
video rate and with more possible faces per frame (a
V-B. Adapting for Real-time performance

maximum recognition rate of faces per second is
We observed that most of the running time of the possible), an operation system running in the camera
algorithm was spent on calculating the output of hid- has to control the reporting process. The operating
den nodes and calculating the output (see Figure 11). system obtains the IDs of the recognized person and
For example, if the number of hidden nodes is equal monitors the reliability of recognition as reported by
the face recognition part. If this is high enough, a VII. REFERENCES
person is positively identified and will also not be re- [1] B. L. E. Hjelmas, “Face detection: a survey,”
ported in subsequent frames until he/she leaves the Computer Vision and Image Understanding,
scene or another person shows up. vol. 83, pp. 236–274, 2001.
A connected speech synthesiser reports the name [2] Centre For Industrial Technology.
of the identified person, asks an unknown person to http://www.cft.philips.com/, 2003.
identify himself or instructs the persons in the scene [3] A. Abbo and R. Kleihorst, “Smart cameras:
to look at or approach more towards the camera. Architectural challenges,” in Proceedings of

The overall performance we reach ranges from a ACIVS 2002 (Advanced Concepts for Intelligent

of out of

 
recognition rate of % with a false detection rate

to a recognition rate of % with a
Vision Systems), (Gent, Belgium), 2002.
 [4] TriMedia Technologies.
false detection rate of out of  dependent on the http://www.trimedia.com, 2003.


settings. These numbers are for a real-time (up to [5] T. Majoor, “Face detection using color based re-
faces per second) stand-alone system with  stored gion of interest selection,” tech. rep., University
“identifiable” faces. of Amsterdam, Amsterdam, NL, 2000.
[6] R.L.Hsu, M.Abdel-Mottaleb and A.K.Jain,
VI. CONCLUSIONS AND FUTURE WORK “Face detection in color images.”
Face recognition is becoming an important appli- http://www.cse.msu.edu/˜hsureinl/facloc/index
cation for smart cameras. However, up till now, the facloc.html, 2003.
processing required for real-time detection, prohibits [7] F. Zuo and P. H. de With, “Fast human face
integration of the whole application into a small sized, detection using successive face detectors with
consumer type of camera. This paper showed that by: incremental detection capability,” Proc. SPIE,
1. Proper selection of algorithms, both for face de- no. 5022, 2003.
tection and recognition, [8] J. Haddadnia, K. Faez, and P. Moallem, “Hu-
2. Adequate choice of processing architecture, sup- man face recognition with moment invariants
porting both SIMD and ILP types of parallelism, based on shape information,” in Proceedings
3. Tuning the mapping of algorithms to the selected of the International Conference on Information
architecture, Systems, Analysis and Synthesis, vol. 20, (Or-
this integration can be achieved. We implemented lando, Florida USA), International Institute of
Informatics and Systemics (ISAS’2001), 2001.

#
the algorithms on a small smart camera. As a result
[9] Y.-H. Hu and J.-N. Hwang, eds., Handbook of
we can recognize one face per 


ms, when we are
searching for  persons, with % recognition rate neural network signal processing. CRC Press,
and only % failure rate. 2002.
Future research will focus on further tuning the [10] H. H.Fatemi, R.P.Kleihorst and P.Jonker, “Real
mapping of the algorithms, e.g. by replacing float- time face recognition on a smart camera,” in
ing point operations with fixed point, trying other Proceedings of ACIVS 2003 (Advanced Con-
(cheaper) activation functions (see eq. 2), and fur- cepts for Intelligent Vision Systems), (Gent,
ther parallelization of the RBF neural network. This Belgium), 2003.
should allow for further speedups needed when search- [11] Electronic Privacy Information Center.
ing in much larger databases that can contain large http://www.epic.org/privacy/facerecognition,
numbers of identifiable faces. 2003.
A major part of future work will also be to use the
audio feedback in a better way, and in increasing the
reliability of recognition which is too low now for
professional systems [11]. Although the processing
time will probably increase, we believe that the per-
formance will be highly sufficient.

You might also like