Detection of Humans Using Color Information: V It Nov Ak

Detection of Humans Using Color Information
Vt Nov k a May 23, 2001
Prohl sen a
Prohlauji, ze jsem na reen diplomov pr ce pracoval samostatn s pomoc ves s e a e doucho pr ce a ze jsem nepou il jinou literaturu, ne je uveden v seznamu. a z z a Z rove prohlauji, ze nem m n mitek proti vyu it v sledk t to pr ce elektroa n s a a z y u e a technickou fakultou CVUT. Praha 23. kv tna e Vt Nov k a
Abstract In this report a method for detection of human faces from color images based on human skin color detection and subsequent segmentation and verication scheme is described. Using the large Compaq database, skin and non-skin color models are build on the basis of simple histogram approximation of the probability distributions and the Neyman-Pearson classier used in [8] for pixel-level skin detection is designed. A connected component analysis for reducing the detection errors is proposed, resulting in a slight improvement of ROC curve in comparison with [8]. Using this method and the output mask from a skin classier, the segmentation algorithm based on a region growing technique using the color distribution is applied. The set of face candidates obtained from the region growing algorithm is analyzed in a very simple rejecting scheme. We show respectable results from the face detection algorithm, although the face shape model is very primitive.
Anotace Tato pr ce popisuje metodu detekce obli ej v barevn ch obr zcch. Hlavn prvky a c u y a metody spovaj v detekci k ze, segmentaci homogennch oblast a n sledn c u a e verikaci jednotliv ch oblast . K detekci k ze jsou pou ity dva modely. Hisy u z togram barvy k ze a histogram ostatnch barev (nek ze). Tyto modely slou u u z jako aproximace rozlo en pravd podobnosti barvy k ze resp. nek ze s jejich z e u u z pomoc je navr en Neyman-Pearson v klasik tor barvy k ze poprv imple z u a u e mentovan v [8]. V stup klasik toru je pou it pro n slednou segmentaci, y y a z a kter vyuv techniku nar st n zalo enou na barv a prostorov m rozlo en. a z a u a z e e z V sledn mno ina kandid t je postupn ov rena pomoc primitivnho prosy a z au e e torov ho modelu obli eje. Pesto e zmn n model obli eje je velmi jednoduch , e c r z e y c y v sledn algoritmus vykazuje slibn v sledky. y y e y
I would like to express my thanks to Mr. Jones who provided us with the Compaq image database and especially to my supervisor Mr. Matas for his time and a great deal of help he gave me.
Contents
1 2 Introduction Related published work 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3 Segmenting Hands of Arbitrary Color [19] . . . . . . . . . . . . . Robust Face Tracking using Color [13] . . . . . . . . . . . . . . . Self-Organized Integration of Visual Cues for Tracking [17] . . . Statistical Color Models with Application to Skin Detection [8] . . Three Approaches to Pixel-level Human Skin detection [2] . . . . Tracking Interacting People [9] . . . . . . . . . . . . . . . . . . . Segmentation and Tracking of Faces in Color Images [14] . . . . Detecting Human Faces in Color Images [18] . . . . . . . . . . . 3 5 5 6 7 9 11 11 13 15 16 16 18 18 19 20 21 22
Training Dataset 3.1 3.2 3.3 XM2VTS Database [10] . . . . . . . . . . . . . . . . . . . . . . Compaq Database [7] . . . . . . . . . . . . . . . . . . . . . . . . WWW Face Database . . . . . . . . . . . . . . . . . . . . . . . .
Statistical approaches to skin detection 4.1 4.2 4.3 Bayesian Theory . . . . . . . . . . . . . . . . . . . . . . . . . . The Neyman-Pearson Strategy . . . . . . . . . . . . . . . . . . . Non-random intervention . . . . . . . . . . . . . . . . . . . . . .
4.4 5
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23 27 27 28 31 31 32 34 35 38 39 39 44 46 48 51 56 56 56 57 59 59 59
A Multi Color Model for Face 5.1 5.2 Face color model . . . . . . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Connected Components 6.1 6.2 6.3 Single face color distribution . . . . . . . . . . . . . . . . . . . . Region Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 Color variability of the region . . . . . . . . . . . . . . . Difference at the boundary . . . . . . . . . . . . . . . . . Compactness of the region . . . . . . . . . . . . . . . . . Homogeneous regions detection . . . . . . . . . . . . . . Starting points . . . . . . . . . . . . . . . . . . . . . . .
Face detection 7.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion and conclusions
A Applications description A.1 Face detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Application control . . . . . . . . . . . . . . . . . . . . . A.1.2 Functions description . . . . . . . . . . . . . . . . . . . . A.2 Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Application control . . . . . . . . . . . . . . . . . . . . . A.2.2 Functions description . . . . . . . . . . . . . . . . . . . .
Chapter 1 Introduction
The great increase of computational resources in the last decade has provided the computers with the means to solve many computer vision tasks like object tracking and recognition. Within these tasks, tracking and recognition of humans occupy one of the most important roles. Human motion tracking is essential for perceptual user interfaces, indoor and outdoor surveillance, sign-language recognition, efcient video coding etc. Recognition of human faces is useful in any user identication system or face database management. Detection of humans can be used as a initializing step to human motion, lip, or gestures tracking, as a localization step for face recognition as well as for image databases queries. The task of detection of people from a static image can by dened as a process with an image at the input and a set of positions of humans as the output. A simplied version of this task, would only return the number of people in the image, or just signaled if any people are present. We seek to nd a method which would solve the rst task for a large number of image conditions. Color is an often used feature in human motion tracking, because it is a well suited orientation and scale invariant feature for segmentation and localization. The color of the human skin lls only a small fraction from the whole color space and thus any frequent appearance in the image can be a pointer to a human presence. To localize the people in the images, the face is used very often, since the shape of the head is approximately invariant considering different positions, and the face features such as eyes and mouth are well suited to distinguish the face from other parts of the human body. In our work, we have studied the skin detection problem based on the color of individual pixels. The skin color model is build using a large database of images containing human skin. Using this model together with the color model of non3
skin colors occurring in the images we have build a skin pixels classier based on the statistical theory of pattern recognition, which was rst implemented by Jones and Rehg [8]. To detect faces in the scene, we have attempt to build a multi color model of the human face using the color of skin lips and hair and a similar statistical approach. But we have found this model to be too erroneous for any robust application. Instead, we have proposed a connected component analysis, interconnected with a segmentation method based on the color. Finally we apply a very simple validation scheme for the face detection based on the shape features of the human face. The report is arranged in the following way. Chapter 2 introduces several projects solving human tracking and detection tasks, generally based on the color. Then the used training image datasets are briey mentioned in the chapter 3 and the statistical approach, solving skin detection problem is analyzed in the chapter 4. The proposed multicolor face model is outlined in the chapter 5 and results are shown. Then the proposed segmentation technique is reported and examined in the chapter 6 and the nal algorithm for the face localization together with the simple verication scheme is presented in the chapter 7. Finally the results are discussed in the chapter 8.
Chapter 2 Related published work

The problem of detection of people arises in many applications of computer vision, e.g. in the area of human-computer interaction, video coding, lip tracking and surveillance. For this purpose, color information is widely used. In the description of related published work, we focus on the detection methods which uses color as the primary feature. Our attention is kept mainly at the human skin detection part. However, a brief overview of the application area is given. The key method for our skin segmentation technique is based on [8, 2] which are described in sections 2.4 and 2.5.
2.1 Segmenting Hands of Arbitrary Color [19]

In humancomputer interaction, visual perception of hands can be very useful, since the number of possible expressions with the hand gestures is great and the discriminability between individual gestures is relatively simple. Zhu and Waibel use the color to build a model for hand segmentation method, based on simple statistical approach using Bayes decision theory. There, the hand segmentation is used for a wearable computer application. When the user moves his open hand into the eld of vision of the camera, menu items are shown at each nger. The user can select one item by bending the relevant nger. The color is used to model hand and background, but the method do not requires a predened skin color model. Instead it generates a hand color model and a background color model for any given image. Then the generated models are used to segment the hand. The key to build the model is the prior knowledge about the hand position in the image. This is expressed by the probability distribution 5
over the image pixels
Models are generated using the restricted EM algorithm. It means one of the components has a xed mean and limited prior probability. This component is used to model hand color while the rest of the components is used to model the background. The mean of the hand color (rst component) is gained from the given image using the distribution and the weight is xed about which is obtained from the training data. As soon as the hand color model and the prior probability applied to each pixel in the given image. , background color model is known, the Bayesian rule can be
Since the color models are build for each image apart, the prior knowledge of the distribution and probability are necessary. Then the hand needs to appear in particular place in the image and should have approximately same size as in the training data. Another limitation stems from the use of a single Gaussian for modeling the hand color. The method can fail if the hand color distribution is not consistent.
2.2 Robust Face Tracking using Color [13]

Schwerdt and Crowley describe robust face tracking technique used for video coding. The technique has two components: 1) A face tracking system which keeps a face centered in the image at particular size, and 2) an orthogonal basis coding technique, in which the normalized face image is projected onto a space of basis images. We will focus at the tracking technique. Because the face is a highly curved surface, the observed intensity of a face exhibits strong variations. These variations are eliminated by transforming color space into two dimensional intensity-normalized chromaticity space of vectors.
To detect the face in an image the Bayes rule is applied to every pixel, using two histograms: color distribution of the skin and color distribution of the whole image P(r,g). The histogram of skin color is made from a region of an image known to contain skin. This histogram is initialized by eye blink detection and updated within the tracking procedure. The result of the Bayes rule, applied
2' 3$10 $('
HSRWVUC I4 E R P E
@%#BA
`2 %B&9YX 4 5
%$#
@$# &)$('
"!
HSRTSSC Q5 E R P C

4 5
H E IFGFDC &
, which is learned from the training data.
865 @ 97&4
on the image, is the probability map
for each pixel . Using this map, the skin pixels are grouped together by counting the position and spatial extent of the color region. To reduce the jitter from outlying pixels coming from the pixels with skin color (e.g. hands), the probability map is weighted by a Gaussian function placed at the location where the face is expected. The initial estimate of the covariance of this Gaussian is the size of the expected face. Once initialized, the covariance is estimated recursively from the previous image using the mean and covariance of the obtained skin region. The use of a Gaussian weighting function for new input data can lead to a problem, if the object moves above a certain speed. If the speed is too hight, the product of the new skin position and the old Gaussian function would vanish. Authors use a Kalman lter to eliminate this property.
2.3 Self-Organized Integration of Adaptive Visual Cues for Face Tracking [17]
Another tracking method proposed by Triesch and Malsburg uses more cues to localize a human and rather focuses on integrating the individual results obtained by using the cues than to provide more sophisticated model for skin color. However the integration method can be viewed as the basic scheme for robust tracking yet using more specic models. To cope with extensive environmental conditions like illumination and background changes, occlusions by other people and so on, the system uses several cues, which agree on a result and each cue adapts towards the result agreed upon. Experiments has showed that the system is robust if the changes in the environment disrupt only minority of cues at the same time, although all cues may be affected in the long run. There are ve visual cues that the system uses for detecting the head of tracked person. Each cue produces a saliency map where indexes the cue and . Positions with value close to one, indicate high condence that the head is present. To extract the saliency map, the cues are compared at each position in the current image with a cue prototype .
dh f h d! f
` 2 d"c 4 1 5 bB&9X P " a&9BXA ` 2 ` u h f h ` !dd FxFwdh f f vP d! re h f d!c ge

s h f p tp d!c re qi
For the integration, each cue dispose of the reliability coefcient , with . The saliency maps are then combined to produce the total result by computing a weighted sum, with reliabilities acting as weights.
The estimated target position
is the point yielding the highest result
After nding the result , the quality is dened for each cue, which measures how successful the cue was in predicting the total result. The qualities are normalized that and are updating the reliabilities with using the time constant as exibility parameter
where is the feature vector extracted from position . As far as the system seeks to be robust it is more concerned about the integration of the cues than the cues itself. So that there are very simple techniques employed to extract the cues:
Intensity change tries to detect motion based on the thresholded difference of subsequent images, where the threshold is a xed value, so that this cue is not updated.
The color cue is computed by comparing the color of each pixel to a region of skin colors in HSI (hue, saturation, intensity) color space. If the pixel falls within the interval of allowed values, the result is one, otherwise zero. The prototype color region is adapted towards the color average taken from neighborhood of the estimated position if the standard deviation does not exceed certain threshold.
Motion continuity tries to forecast actual persons position in virtue of the last two estimated positions. This cue is not adaptive.
Shape cue computes the correlation of a -pixel grey level template of the target to the image. High correlations indicate a high likelihood of the target being at this particular position. 8
j k qGj
g dh f hdh f P dch f d 3e f
When the position ities
is found, the prototypes are updated alike the reliabil-
dch f 5
dh f d 5
s P dh f y f dh dh h wdd GC AadP dch dh f h h f dd e dch f 5 vP d! GC g f d5 dch f 5 dch 3 P dch f fe dch
h d! GC s P dh f 5 f y dh f i i i i
The tracking technique was tested on 84 image sequences of people crossing a room. The testing sequences set consists of six classes from normal, where the person just crosses the room, via lighting changes and turning, to occlusions by other moving human. The system is initialized with the reliabilities of the color and intensity change cue set to 0.5 and all other reliabilities set to zero. This choice reects limited a priori knowledge about the target. A detection threshold was dened for deciding whether the person was in the scene or not comparing it with the total result . The most important parameters are the time constants and for the adaptation and the detection threshold . If the system adapts too fast, it has no memory and will be easily disturbed by high temporal changes. If adaptation is too slow, the system has problems with harmless changes occurring in quick succession. The system does not get adapted fast enough and the result is missed detection. If is too high, relatively small changes in the scene result in missed detection and vice versa if is too low, if the person leaves the room the system tends to track the background. The total error rate over the whole set of sequences was 58 % of correct tracking. The worst disturbance appeared the occlusion by other human. This is obvious as long as the color, contrast and intensity change can give the same vote for both humans.
2.4 Statistical Color Models with Application to Skin Detection [8]

The existence of large image datasets such as photos on the World Wide Web make it possible to build powerful generic models for low-level image attributes like color using simple histogram learning techniques. Jones and Rehg use the color models constructed from nearly 1 billion labeled pixels. They compare the performance of mixture models and histograms and nd the histogram model to be superior in accuracy and computational cost. Using aggregate features computed from the skin detector an effective detector for naked people is build. The histogram models were constructed using a subset of 13,640 photos sampled from the total set of 18,696 photographs. In the 4675 photos containing skin, the skin pixels were segmented by hand. Pixels not labeled as skin were discarded to reduce the chance that segmentation errors would contaminate the models. These labeled pixels were placed into the skin histogram model. The 9
j k lGj m
% GC
f 1e
i m e
The contrast at the position x is dened as the standard deviation of the gray level values of the pixel distribution of a -pixel image region.
Figure 2.1: ROC curves from [8] for a family of skin detectors based on different histogram and mixture models. The best ROC curve (number 5) is the result of bin histogram model using a remaining 8965 photos which did not contain any skin pixels were placed into the non-skin color histogram model. Given these two histograms, the probability that a given pixel belongs to the skin and non-skin classes can be computed:
where c denotes triple, is the histogram of skin color and histogram of non-skin color. As soon as the skin and non-skin distributions are available, skin classier can be obtained. A particular RGB value is labeled as skin if
where is a threshold. The prior probabilities and are unknown and depends upon application-specic cost which is expressed by required false positive or false negative error. This problem had been solved by Neyman and Pearson (see section 4.2). An important property of the pixel classier is the receiver operating characteristic (ROC) curve, which shows the relationship between false negative and false positive error. The ROC can be used to compare classiers performance. Rehg and Jones show comparison between classiers based on histograms and mixture models (see Figure 2.1). Mixture models using the EM algorithm [3] 10
`2X &9b
`2 %B&9BXA
D D P %B&9Y'A `2X ('
p no p t )on t no p s no { ~ s no p } r no { ~ r |}|~ no { Yzy p q no q no p n no n YY B9 ) s ) Y Y B 9 B ) )s B BY Y B 9 B ) ()s
D%B B&@9)XX22`` $(' Y '
{ ~ z~ { z Y qr s t p q
A H E FGFDC G `2 (' G P %B&9)X$('
p no u no v no w no s p i p 7
x no
have been popular in earlier skin color modeling work. Separate mixture models with 16 Gaussian were tted to the full skin and non-skin pixel data. The histogram model was found to be superior to mixture models. In addition, Rehg an Jones regard the histogram with bins, using full dataset as the best one in classication performance.
2.5 A comparative Assessment of Three Approaches to Pixel-level Human Skin detection [2]
Brand and Mason refer to [8] and compare three approaches to pixel-level skin detection. The rst two approaches use simple ratios and space transforms respectively, whereas the third is the approach implemented by Rehg [8]. The Compaq skin and non-skin (see Section 3.2) database is used to quantitatively assess the three approaches. First approach is based on a very simple observation that, human skin tends is likely to be a good to have a predominance of red and thus the ratio of detector. The second approach employs a 3-D transformation, designed explicitly for skin detection, leading to classication of skin pixels using just one dimension. The third technique uses the discrete 3-D probability map of skin color obtained from Compaq database and rst implemented by Rehg [8]. For the assessment, the threshold between skin and non-skin colors was set, so that of skin pixels were accepted. Then each method was applied at the testing dataset and the false positive error was determined. This test responds to calculating single point of ROC curve showed in Figure 2.1. The ration method was extended by evaluating another ratios and , which seemed to be signicantly inuenced by skin color. The lowest false acceptance error of was achieved using the probability maps in comparison to for tailored transform and single dimension threshold and for ratio discrimination. The extended ratio approach provided poor improvement resulting in false positive error.
2.6 Tracking Interacting People [9]

McKenna describes a system for tracking multiple people in relatively unconstrained environment. Tracking is performed at three levels of abstraction: regions, people and groups. The color and gradient information is used to extract 11
E D i xC
E xC
H 7E
Qj&
E xC
H xC
I& s
f@
the moving objects. Color information is then used to disambiguate occlusions and to estimate depth of ordering and position during occlusion. Since the system is rather robust than specic (like in [17]), it can be viewed as complementary to tracking methods that use more specic humans models. The basic assumption is that the camera is stationary and that the changes in the environment are relatively slow to the motion of the people in the scene. The camera color channels are also assumed to have Gaussian noise and three are estimated for the given camera. To variance parameters model micro-motions in the scene like moving leaves etc., a color model based on Gaussian distribution for each pixel is estimated. The stored background model for a pixel is . The model is adapted on-line using simple recursive updates in order to cope with slow environmental changes. Adaption is only performed in regions which the higher-level grouping process labels as background. Given new color vector of a background pixel, the updates are performed using update parameter such as
for all color channels . Before the color pixel is considered as background, it is compared to the model. If for any color channel is true, then the pixel is set to the foreground. The assumption that illumination changes slowly is violated when the change is due to a shadow cast by people moving in the scene. To deal with shadows, RGB color space is replaced with chromaticities and to reduce the impact of the intensity change. But often there could be no difference in chromaticity between foreground and background (e.g dark green coat in front of grass).To cope with such cases, gradients at every position in the image are estimated using the Sobel masks in each direction. Each pixels gradient is modeled using gradiand magnitude variances . ent means In addition, the average variances are computed. If the gradient change magnitude exceeds certain value, exactly if , for any color channel, the pixel is set to the foreground. As the foreground is extracted, the tracking of moving people is performed at three levels of abstraction: Regions are connected components that are tracked consistently over time. A person consists of one or more regions grouped together. Each person has a support map (mask), a timestamp an appearance model based on color. People join into a group if they share a region. A person is initialized 12
g ('R g (' " " #
g # # A @ b)' 0 % g % g s g aR (' % g s '
'
'
" " A " " 91 9w
c# (" ( 4 5 ' P % P %
# " " % %)
s p Gi p
0 1F 4 5
when one or more regions satisfy set of rules like close proximity, overlap in the x-axis projection etc. In order to track people consistently as they enter and leave groups, each persons appearance must be modeled. A color model is build and being adapted for each person being tracked. The model is adapted only if the person is alone, since the segmentation is not reliable in the group. Color distributions were modeled using both Histograms and Gaussian mixtures. Color distribution of person is updated adaptively by
for each person in the given group and for each pixel , from the groups mask, the probability is computed. When a group of several people splits, the individual peoples color models are used to determine who belongs to each new group. Histogram color models are matched using histogram computed from newly created group and model for person . People are allocated to group by maximizing the normalized match values:
Authors claim that the described scheme of background subtraction is quite robust even in relatively unconstrained environment and the use of adaptation even allows to cope with brief camera motion without complete failure. The weakness is in the model of person, which is based only at the color (which is mostly color of the clothes). Tracking would fail in grouping people that would dress in similar manner.
2.7 Segmentation and Tracking of Faces in Color Images [14]

Another segmentation method that uses color information for segmentation and tracking faces is proposed by Sobottka and Pittas. In addition, they use a simple model of the shape features of the face. First skin-like regions are determined based on the color attributes hue and saturation. Then regions of elliptical shape are selected as face hypotheses. They are veried by searching for facial features in their interior. After the face is detected, it is tracked over time using active contour model. We focus mostly at the segmentation method, and face candidate verication. 13
( ` 'Ad g s ` ' P ` 'A R `
f y P ` G f y (E f E
1(`A `
The face segmentation is done simply by thresholding the color input image using predened domains of hue and saturation, that describe the human skin color. The use of HSV color space is motivated with it is similarity with human perception of colors. The luminance (Value) component is discarded to obtain robustness towards changes in illumination and shadows. Having the segmented skin image, the connected component analysis is performed and for each component shape information is evaluated. For shape analysis, best-t ellipse is computed based on moments and then the distance between the ellipse E and the component is determined by:
where denotes the indicator function of the component C at the position . computes the holes inside the ellipse and pixels from the component which are outside the ellipse. Connected components which are well approximated with the best-t ellipse are then veried by searching for facial features. The facial features extraction is based on the observation, that the facial features differs in brightness from the other parts of the face (the eyes and mouth are usually darker than the surrounding skin). To analyze face features positions, rst the y-projection is determined by computing the mean grey-level values of every row of the connected component. Then the minima and maxima are searched in the smoothed y-relief. For each signicant minimum of the y-relief, an xrelief is computed by averaging the grey-level values of 3 neighbored rows. Then the x-relief is smoothed and minima and maxima are searched in it as well. For every minimum in the y-relief, the spatial positions of minima and maxima in x-relief are analyzed to nd facial features. As result, a set of facial features candidates is obtained. The candidates are clustered according to their left and right x-coordinates to reduce the number of candidates. Since the face candidates are available, the all possible constellations are build and each one is assessed based on vertical symmetry of the constellation, the distance between facial features and the assessment of each facial feature. Incomplete constellations are considered as well. The best constellation is chosen to represent the face features and the components contour is considered to be the face contour. After the segmentation step, a method for tracking the face contours known as active contours (snakes) is applied. An active contour is a deformable curve, which is inuenced by it is interior and exterior forces. The interior forces impose the smoothness constraints on the contour while the exterior forces attract the contour to signicant image features. 14
s y P R 0 c70 # y d hg s y
0
The segmentation was tested on the M2VTS database. About 90 % face features were correctly detected including eyebrows, nostrils and chin. However, images in the database contains only one face with quite limited environment (single color background, limited illumination conditions etc.) and thus the segmentation method can not be examined very well.
2.8 Detecting Human Faces in Color Images [18]

Yang and Ahuja propose a method for face detection in unconstraint color images. The detection method is based on color and shape information. Multi-scale segmentation is utilized to cope with occlusions (e.g. hands and arms) and scale problem (detecting faces at different scales). The skin color model is build to capture chromatic characteristics of skin using CIE LUV color space discarding luminance value and it is approximated by a Gaussian distribution. The Gaussian is based on a histogram build from about 500 images containing human skin of different races and was accepted using a test. A pixel is identied to have skin color if the corresponding probability is greater than a threshold. Multi-scale segmentation is used to extract a hierarchy of regions for matching. The segmented region is classied as skin region if most (above ) of the pixels inside are classied as skin pixels. From the coarsest to the nest scale, the regions of skin color are merged until the shape is approximately elliptic. The orientation of the elliptic region is computed using moments of inertia and the region becomes a face candidate if the ratio of major axis to minor axis is less than a threshold (1.7). If a candidate region has some darken regions or some holes inside the merged region, it is classied as a human face. Few examples of correct detection for different scales, rotations and background conditions are shown.
15
Chapter 3 Training Dataset

Since it is hard to a priori dene the skin color in the color space we use the training dataset to build the skin color model. The skin and non-skin histograms were build using Compaq image database. To examine the detection method performance, a set of testing pictures, that can well represent the range of possible occurrences in the practical application, is necessary. The extent of the required set can be dened by the illumination conditions and camera restrictions, background complexity restrictions and occlusions. The detection performance can vary a good deal through the different datasets. Although results obtained by using a set with high restrictions (e.g. XM2VTS database) can by very promising, the method can easily fail by analyzing a set of pictures from a less restricted database. We have used three sets of images to test our detection methods performance. Usually the XM2VTS database was used to examine rst results and test the method correctness. Then the other ones were used to test the performance for unconstraint pictures.
3.1 XM2VTS Database [10]

This database of images is a product of video sequences which contain single of the whole image. The background frontal face view occupying around consists from single color tone, which is quite distinct from the face color. The illumination source is in front of the object and thus the distribution of the intensity of the face color in the image is very narrow. The database contains 371 faces in 2448 images with different postures, illumination and clothing.

16
Figure 3.1: XM2VTS database examples
17
Figure 3.2: Our WWW database examples
3.2 Compaq Database [7]

An completely unrestricted set of images containing human faces comes from WWW. In [8] a large image database was collected from World Wide Web to build a comprehensive model for human skin color. From the whole set of 18,696 photographs, 13,640 were used to build skin and non-skin color model. In the 4675 images containing skin, the skin pixels were segmented by hand to produce a mask for every picture. Authors made the dataset available to the academic research community1. This dataset was used to build skin and non-skin color models and to test the skin detection methods performance.
3.3 WWW Face Database

A small set of images of was collected from Web sides to analyze skin detection and region growing technique and to present the results of these methods. Images for region growing testing are small in size to advance the speed of testing but the color distribution of the pictures represents a variety of lighting and camera conditions.
Unfortunately, we are not allowed to publish the pictures from the database
18
Chapter 4 Statistical approaches to skin detection

Statistical approach to pattern recognition have been widely used in many applications. For the pixel-level skin detection, we have noticed several approaches in previous work. First we will have a brief demonstration of the more frequent of them and then the correct way of statistical approach will be described. The simplest one is based on the assumption, that the skin color distribution has a small continuous extent in the used color space and this extent is a priori dened (e.g. [11, 14, 17]). The skin blob in the color space is described by set of thresholds in individual channels (n-dimensional rectangles), or e.g. an ndimensional ellipse. Then the pixels which has color that fall inside of the dened region in the color space are considered as skin pixels. To ensure the continuity of skin color, all kind of color space transformations are performed. which denotes the probability Other approach uses a skin color model of appearance of the color within the set of possible skin colors . Then the skin color model is used for skin detection regardless of background colors. The skin region is dened by thresholding the skin color model: ([18]) or the probabilities are directly used for subsequent processing ([5]). The third approach builds specic color models and for skin and background color respectively (e.g [9, 13, 19]). These two models are build using an a prior knowledge about appearance of skin in the image. In addition, the prior probability of skin appearance is approximated using same a prior knowledge. Then the Bayes rule is applied. The rst approach is a coarse approximation of the skin color model and it 19
A A x P
A
is depending on the used color space. In the second approach, the background color is not considered, which can lead to mistakes in classication. To reduce the number of such mistakes, some additional restrictions to the skin color are being made ([5]). The third approach is used in more specic applications, where a priori assumptions can be made. Since we have no a priori knowledge about the skin appearance in an image, we need another approach, which arises from statistical pattern recognition theory.
4.1 Bayesian Theory

Bayesian decision theory provides minimal possible error rate as far as we have some knowledge about statistical models of possible classes. First we introduce few basic concepts we are going to use and then dene the simplied Bayesian decision rule. Let be a set of possible states of the nature of an object and let be a set of possible observations (features) of the object. Let us consider both and as discrete random variables that are correlated with each other by a set of join probabilities , which denotes the fraction of objects falling into the class and having the feature x . Since we know the join probabilities, we also know the prior probabilities , which denotes a priori knowledge about the occurrence of the class and conditional probabilities which means the fraction of observations x within the class because

We can denote our decision rule as which is from some set of decision rules . The decision rule is a function that maps the feature vector x into one of the classes . Now we can dene the average probability of a wrong decision using the decision rule
3 0 ( % 5 F f 421)&' c&$ 67 6( ! P A $ GC 3 0 ( % f 421)&' &$ P @GC A
(4.1)
Now the aim is to nd such a strategy
that minimizes
20
f A
f f P f A f f P A f
f
f f
#

f
f
! "
From the last term, we can see, that the decision will be such that maximal. Formally written such that
' c&$ % % f ' c4$ f f Ad f AadP A 9b!P f P
will be the decision strategy. Since
the result correspond to the intuitive solution which is to maximize cause is constant for individual .

For the skin detection problem , where and denotes the skin respectively non-skin class and the color vector represents the observation . If we knew the prior probabilities and , we would use 4.2 to classify a pixel as skin if and as non-skin otherwise. But in our case of skin detection, we only have two conditional probabilities and , because appearance of the skin within an image is not an random event. Some of the applications assume , and still use decision 4.2 but a better solution of this problem had been found by Neyman and Pearson and will by shortly stated in the following section.

4.2 The Neyman-Pearson Strategy

For the explanation of the Neyman-Pearson problem, we will use our skin detection notation, stated in the previous section. Let be the color of the pixel from a color space . Let the probability distribution of depend on the . The probability is known and is dened through state of the pixel conditional probabilities and . 21
D E
w (
f
i P C P AA
8 9
8 1 A 67 8 1 A 8 1 ( A 7 g ' 1 A 7 g f
f f f A P A
vBAdv AW AAd vBA AA @r P f
v
67 7 5 A( ! 5 F 6( ! 5 F 6( ! ! F 6 6 ( 5 D
g s7 67 7
f f
v 'A
P P P P $('
is
(4.2)
(4.3) be-
The aim is to divide the color space into two subsets and , so that the pixel is considered to be skin if and non-skin if . Since some of the colors have both probabilities and non-zero, there exists no perfect solution. If we use histograms to model the conditional probabilities, the overlap between skin and non-skin classes will not change by applying any color space transformation. Thus the use of e.g YUV or HSV space instead of RGB space, will not improve the classiers performance. The quality of the decision strategy is measured by two numbers: probability, that the skin pixel will be considered as non-skin (missed skin alias false negative error) and probability, that non-skin pixel will be considered as skin (false skin alias false positive error). The decision strategy can be found such that the probability of false skin detection error
W U 1VFT A
and and . The subsets and the optimization problem are dened by the decision rule
In other words, a threshold can be found, such that the pixels with likelihood ratio less than are considered to be skin color and the rest of colors for which the likelihood ratio is higher than belongs to non-skin colors. The algorithm that nds the threshold for a given is described in section 4.4.
4.3 Non-random intervention

In the unrestricted images such as WWW database, all possible races, illuminations, camera conditions etc. can appear and such intervention can affect the actual color models. In practice, the color of a pixel is then dependent on the class and the intervention. The dependence can be expressed by a probability distribution , where denote the intervention. We could model color distribution within a class using
f w 1 `w` f "A @ )x9P "A f f `1 ` ` 11` "
22
H eD
is minimized, while the missed skin error is at most an

X Y U `VT p A
that are solutions of
H ID
QGD# F GD
SR
F aD
9 e u u 3 3 F yywTxxH wT0 0 vts
u A
X
F D GP
i p
q r
P $
H D f F ehgaD
H D b F edcaD
(4.4)
(4.5)
(4.6)
(4.7)
but for a given image, there is no a reliable method for identifying such intervention available up to date. If the occurrence of intervention could be modeled with a probability distribution and the prior probability was known, we could use
But since the intervention is not random, the distribution can not be found. Practically, it means, that e.g. pixels color is inuenced by illumination and class together, but the illumination color is independent from the observed class and it is not even random variable. To reduce the error arising from the non-random intervention, the color model should be carefully build using subsets of training data, with every subset containing pictures under known conditions and covering the whole range of the conditions. Since this is a very liberal task, we only use the approximation of skin and non-skin color models obtained from Compaq database.
4.4 Experiments
We have used the Compaq database collected by Rehg an Jones (see section 3.2) and using histograms. As they recomto build two color models mend in [8], we use RGB color histograms with 32 bins in each color channel.
In our case means the number of pixels with color falling into the bin which is dened through an interval in every dimension. Let and be the skin histogram and non-skin histogram respectively. Then the probability distributions are approximated by (4.10)
H
where denotes the total number of pixels in a histogram. To get a coarse idea about appearance of the skin and non-skin color histograms see Figure 4.1. Given the maximal missed skin error , the false detection error is minimized using algorithm 1.
X
23
H $AH P v A
f f hf 1 hf @ge
f f y P $ F%F P A
` s f P $
i $
Histogram integer
is a discrete function, which maps an input vector .
f f f A Ad f f ` A P f 1 " f f f f v
(4.8)
into an (4.9)
32 RGB skin color histogram from full Compaq database, cut through B=16
32 RGB noskin color histogram from full Compaq database , cut through B=16
x 10 10 8 6
x 10 8
4
4 2 0 35 25 20 30 25 15 20 15 10 10 5 5 0 0 35 30
2 30 0 35 25 20 30 25 15 20 15 10 10 5 5 0 0
35
Figure 4.1: RGB histograms of skin(left) and non-skin(right) color obtained from Compaq database. Histograms are plot in 2D cut for Blue channel = 16 The key value of the skin detection problem is the missed detection error , which is correlated with the false detection error. As the missed detection is lowered, the false detection will increase and vice versa. We have tested the classiers performance for all our databases. An example of the results obtained from XM2VTS images can bee seen in Figure 4.2. Algorithm 1 Bisection of the color space into skin and non-skin classes 1. sort the color bins ascendantly according to the likelihood ratio
X
2. rst rgb bins, these with lowest ratio, vote for non-skin, n is such that the sum over the number of skin pixels in rgb bins is maximal but less or equal to
and 4. bins with both ing images vote for non-skin.
zero i.e. these not encountered in train-
Since the conditions in the XM2VTS images are strongly restricted, the results are quite fair. Using our WWW database, the trade off between false positive and false negative error is stronger, since the color overlap in the images is higher (see Figure 4.3). 24
3. remaining bins with non-zero number of pixels in
$AH $F
jkv A A
v A
A
l mX
vote for skin.
Figure 4.2: An example of skin detection applied on images from XM2VTS database. The left column shows, the original images, the second images produced by setting and the right .
Figure 4.3: An example of skin detection applied on images from our WWW database. The left column shows, the original images, the second images produced by setting and the right .
25
i P
f s P
g s P s P
X X
0.5 0.45 0.4 0.35 False positive error 0.3 0.25 0.2 0.15 0.1 0.05 0
Compaq 100 skin, 100 nonskin images M2VTS 20 images M2VTS data Compaq data
Minimal error Fn + Fp = 26.7%
Minimal error Fn + Fp = 4.5%
0.1
0.2
0.3
0.4 0.5 0.6 False negative error
0.7
0.8
0.9
Figure 4.4: The ROC curves obtained from Compaq and XM2VTS databases. The pointers shows the minimal errors for both ROCs which are produced by summing missed skin and false skin errors. The correlation of the two errors can be expressed by the receiver operating characteristic (ROC) curve. The curve plots the false positive and false negative error regarding to the threshold . The performance of the classier can be qualied by integrating the area under the ROC curve. As the classier has better performance, the area under the curve is smaller and vice versa. We have computed two ROC curves using Compaq and XM2VTS databases. The computation of the error for a given was different for both databases. Compaq database involves non-skin pictures and these were used to get false skin detection error. The images containing skin were used to count the missed skin error. Non skin pixels in these images were discarded, since the masks marking the skin areas do not include all the skin pixels. A fraction of 100 pictures containing skin and 100 pictures without skin were used. Because there are no images without skin in the XM2VTS database, we had to use them for both the false negative and the false positive error computation. To eliminate the error, we painted very carefully all the skin areas including lips to mark all the skin pixels. Then all the pixels marked as skin and classied as non-skin increase the false negative error, while non-marked pixels classied as skin are involved in the false positive error. The comparison of the ROC curves is in the gure 4.4.
X
26
Chapter 5 A Multi Color Model for Face

To move forward from skin to face detection we considered the face as multi colored object, with a specic color model. The aim was to build a statistic color model for face, similarly as we did for skin, including lips and hair color. Then we would build the classier, which divide pixels in an image into three classes: skin, lips and hair. First we briey describe the statistic model and the proposed classier and then we will discuss the results.
5.1 Face color model

We considered two statistical approaches to face color detection. First one is to model skin, lips and hair color individually and build an classier using these three models with the model of other colors. Since the non-skin color model was build from images containing no humans, it can be considered as non-face color model. , and are unknown, we Because, the prior probabilities cannot resolve this task using Bayes rule. In addition, we can not use NeymanPearsons solution, since it works only for two classes and to extend this solution for more classes is not trivial task. Hence we come up with an simple approach, which couples Neyman-Pearson task and Bayesian rule. We divide the task of face color detection into two subtasks. First we detect be all face colors regardless of the classication in the face. Let set of face the set of two classes: face and non-face colors and components: skin, lips and hair. Providing, we have both and color models, we can use Neyman-Pearson strategy described in section 4.2. To build the face color model, we need skin, lips and hair color models dened 27
v A A P p r q 4 lsg1r P o @
`2 &9BXA
5 $# 9X c`nA `
by the conditional probabilities , , and the prior1 probabilities , , which denote the fraction of skin, lips and hair within the face. If we assume that the appearance of skin lips and hair within a face is an random event, we can learn both the prior and conditional probabilities and assemble the face color model
The second subtask is to divide obtained face pixels into the three face components. Since we know the conditional and priors , this can be easily done using the Bayes rule from 4.2
5.2 Experiments
We have build three histograms for skin, lips and hair color using XM2VTS database. Within, the same training dataset, we have found the prior probabilities , , . Using 5.1, we build face color model and together with non-skin color model , Neyman-Pearson solution is applied. Then we divide face pixels into three classes employing 5.2. Our rst experiment was focused on the XM2VTS database. Except relatively sufcient results (see Figure 5.1), there are several problems we will discuss now.
Figure 5.1: An example of face features detection based on face color model applied on XM2VTS database. The reddish areas show skin, bluish areas show lips and green areas show hair class First one arises from the face color model . We have extended the skinpixel detection to face-pixel detection using this model. The skin-pixel detection is build upon a condition, that the skin color occupy only small area in the color
1
they are actually conditional too, but we can call them prior within a face color distribution
28

f t u
wd A v &$ P p A f t f ' %
r A q A A
p
% f t f ' & $ f wd A AQ!P
f A
A
Ar u Aq u x w t t t
Ar u q u A u t t t
(5.1)
(5.2)
space. If we add the lips and especially the hair color, the extend is increased and so is the overlap with other colors. Since the extend is increased, the probability distribution is atten and this leads to increase in the false negative error (see Figure 5.2).
Figure 5.2: Increased extent of color model leads to higher missed detection error To force the classier to extract the face pixels we can set a low threshold , which means to reduce the false negative error. On the other hand, this results into increase in the false positive error (many pixels are wrong marked as hair), because of the extend of the model (see Figure 5.3). In addition, the hair color itself in contrast to skin color has a big overlap with other colors.
Figure 5.3: Face color model produces big amount of false hair detection Another problem is the overlap between colors within the face. Since the lips have small prior probability , and a big overlap with skin color, the detection is poor. Especially at a low resolution, where the difference between skin and lips color is negligible. The overlap is apparent especially in WWW images, due the amount of different lighting and camera conditions (see Figure 5.4). We could accept the simple face color model performance in the restricted images like XM2VTS database pictures. However, applying the method on WWW 29
q y t
Figure 5.4: Big overlap between skin and lip color disadvantages lips images, the results are not satisfactory and would be probably a bad starting point to a higher level face analysis. Thus we have left the pixel-level image analysis at skin and non-skin view.
30
Chapter 6 Connected Components

As we consider the performance of our skin pixel classier, we realize, that the skin detection performance established at a single pixel color (detection per-pixel), is strictly limited by the overlap of the skin and non-skin color models. To consider the face as a connected component can either help to skin pixels classication as well as to face detection. In this chapter, rst the reasons and basic presumptions leading to connected component analysis will be explained, than the proposed region growing method will be presented and nally some results will be shown.
6.1 Single face color distribution

Our model of skin color distribution covers different faces and different environmental conditions like illumination color, camera etc. But in a single image the distribution of the color of a single face is much narrowed then the whole skin color model. In other words, the color of skin can vary a lot within different people and pictures (see Figure 6.1), but it is rather stable in a single face and image. The change of illumination color within a face can be a problem, however, in most images there is only single light source and its color is usually around white. The second assumption is that the face pixels are clustered in a particular place in the image. So we could consider the region that consists from connected skin pixels as the face candidate. To ensure, the majority of the face pixels to be detected we need to set up a low false positive error in the Neyman-Pearson rule. The trouble is, that by doing so, we include surrounding pixels from background, clothing etc. which have skin color into the region and disallow any simple veri31
Figure 6.1: Examples of skin taken from different people and environment conditions cation of the region being face or not (see Figure 6.2). We would like to divide the connected component pursuant to color and spatial distribution to get individual scene objects which we could analyze further. This task is known as segmentation and although many approaches were proposed, there is no robust solution available up to now. However, the advantage of our task is the restriction to the skin regions only and thus the number of segmentation options is much less than the number of options in a full image.
Figure 6.2: Examples of connected components. Left image shows region which includes hair, right image is the corrupted face region with no occlusions.
6.2 Region Growing

To isolate homogeneous regions, we have proposed a method based on region growing. The object of the method is to build a color model of the region and following this model, to allocate the regions spatial distribution. At the beginning the region is initialized in a homogeneous place in the image to ensure the start inside the sought object. The initial color model is build from the initial region, which includes only few pixels. In the growing phase, the region is being updated by appending a part from the boundary (e.g. one pixel). For every added part of the boundary, a loss can be estimated. The loss can be used either to select the 32
right piece from the boundary to add as well as to estimate the regions limitation. First we introduce the region growing general concepts and than the particular application to our task will be described . denotes the growing region in time which is dened through the set of pixels . The time is equivalent to the size of the region. is the regions boundary. We can consider the boundary two ways: the maximal set of pixels such that for and its neighboring pixels states which we denote , or the set of the lines between pixels from and which is called . means the increase of the region in one optimizing step. It means that states
cb `
is the loss function dened for the region. To denote the loss function in time , we simply write .
There are several features we can use to estimate the loss function that is dependent on region and implicitly on the regions boundary, since the boundary is function of the region: Internal feature homogeneity can be any function where the feature can be color, texture etc.
Contrast at the boundary using the average or variance of the size of the contrast in comparison to the pixels neighboring the boundary, or to the average color . Spatial homogeneity might be considered as normalized boundary size , or e.g. a t in an a priori known contour (e.g. ellipse). Other knowledge about the region, e.g. probability that the pixel is from some particular class (skin).

The part of the boundary, that is being added could be a pixel, the median of a pixel neighborhood, or a small block of pixels. We have tested the pixel and the median only and thus in the text below, we only speak about pixels. The segmentation is proposed as a optimization process minimizing the change of the loss function . The loss function should be designed such that it could be evaluated in the constant time . That means if the loss function consists from several parts, this condition must be satised by each of 33
A 3 A
s
e & s R e g
e
b u
i { P
e
b ~
P s R e
e
z |{
i
R i
P
i i i i i i
them. However the speed of the method is still strongly dependent on the image size, since we select the added pixel from all the pixels in the boundary. To reduce the amount of computational time, we choose a random subset from the boundary pixels and only this subset is used for the pixel selection.
6.3 Experiments
Our region growing solution is only pixel-oriented and the color is used as the observed feature. The pixels are selected from the boundary using the loss function which has the form
`
where the sum is performed over different features of the region, is the weight of the th feature and is the loss for the region regarding to the th feature. First we have used four features: is the color variability of the region. We discuss two ways of modeling this variability in section 6.3.1.
B(
which is the average color contrast at the boundary. We have tried to use either the average difference on the average color of the region and . Results from these two the average difference of pixels neighboring approaches are shown in section 6.3.2.
s P A
The compactness of the region is simply dened as the normalized boundary size and the corresponding loss is . As the boundary, both and were used. See section 6.3.3 for the different. is a loss when the region contains a pixel classied as non-skin using a low false negative error. We have restricted the region growing at the skin-pixels only and this results in
c ` H
These four features are combined using the weights to obtain the loss function which would direct the region growing into reasonable bound. The basic steps of the region growing algorithm are outlined in the algorithm 2, that takes an image, 34
` f
s 1 3 Vw0 x T S 3 Vw0 xH T s F
d b
f f
P
q
(6.1)
f P ! &sv
! &9v i f
f i
i i
Algorithm 2 Region growing algorithm 1. Initialize the region tions . Set .

5. Save the relevant loss functions
, set
. Go to step 2.
the mask marking the pixels classied as skin and starting location and produces growing function and the loss functions . First we show the practical effects that brings each of the individual features used for the evaluation of the loss function and then the detection of the actual region is discussed. In all the experiments below, the results are shown on full images without previous skin detection to test the performance in complex environment.
6.3.1 Color variability of the region

The color variability tends the region to be homogeneous in the color, and thus using only this feature, the region spreads itself along the similar color (see Figure 6.3).
Figure 6.3: An example of the region growing result using the variability feature only. We consider the color variance of the region as the most important sign when enlarging the region. To simplify the computation of the variance we presume, the 35
4. The pixel with the smallest loss function region, .
add to the
3. Take pixels from the boundary compute the loss functions using equation 6.1.
and for each region and subsequently the loss function
b &
e
s R e P e s R ef b u e 4 ! P s R e Cf d e ` ef
2. Evaluate the regions boundary
i
at the starting location. Compute the loss func. If the boundary is empty, return.
i P e e
P s R e
i f
R,G,B are color intensities in RGB color space. We denote as red chroma, as green chroma and as intensity. From now the covariance matrix consists from three numbers only, which we denote , and . Our aim is to keep the covariance matrix preferably stable as we add pixels to the region. Since we do not know the actual covariance of the region, we seek to keep the variance at a low rate. To achieve that, we proposed two approaches to minimize the color variance. First one consists in minimizing the covariance matrix determinant
To simplify this operation we can minimize and to sum instead of multiply. Second approach is to maximize where is the color of the added pixel in the color space. Instead of , we can use the Mahalanobis distance, which is
y
Number of pixels
200
150
Number of pixels
Figure 6.4: An example of face skin color distribution. The left plot shows the histograms of skin color in the RGB color space for each individual channel, the right one shows histograms after the transformation into the Space. The image selection shows, the pixels encountered in the histograms. At the rst sight we considered both two approaches as leading into the same result, minimizing the region variance. However, since the second approach uses 36
y `
f "
f %
where the set of
and
represents the color distribution of the region.

350 RGB color histogram for skin Red Green Blue 300
1500 CrCgI color histogram Red chroma Green chroma Intensity
250
1000
500
100
50
50
100
150 200 Normalized Color Intensity
250
300
50
100
250
y `
300
covariance matrix of the region to be diagonal. To ensure that, the space is used, where
color
" f f f s P g f ' P S P
s RHVWSSC P R E R s RVWSSC P H R E R E s RHVRWSSC P E R C
(6.2)
(6.3)
(6.4)
the Mahalanobis distance, it assumes the distribution of the face skin color to have Gaussian shape. In some images, this is an adequate approximation (see Figure 6.4) but if the illumination changes within the face, the distribution is correlated to that change. This shows the plots in the Figure 6.5, where are the color distributions of skin in both color spaces. In the RGB space, the intensity inuences all the color dimensions, while in the only intensity and red chroma are inuenced.
350 RGB color histogram for skin
1400
300
250
Number of pixels
200
Number of pixels
150
100
50
50
100
250
Figure 6.5: An example of face skin color distribution inuenced by the illumination change within the face. Since the Mahalanobis distance solution seems to keep the Gaussian shape of the color distribution too strong, we have used the rst approach which minimizes the variance obtained from 6.3, because it only keeps minimal change of the variance in every dimension. Look at the gure 6.6 to compare the results obtained from using the two approaches.
Figure 6.6: First row shows the region growing results obtained by minimizing the variance (see equation 6.3) while the second row shows the result obtained by minimizing the Mahalanobis distance (see equation 6.4)
37
y
300
C C I color histogram for skin

r g
Red chroma Green chroma Intensity 1200
1000
800
600
400
200
50
100
250
300
6.3.2 Difference at the boundary

The difference at the boundary is used to signal the edges of the actual region and we have proposed two basic forms to model this feature. First one counts the average difference in color between the pixels at the boundary. Formally we can write
c 9 H f S f g
where the
and denotes the spatial coordinates of a pixel and function equals one if the pixels are neighbors and zero otherwise. This function is independent from the color distribution of the region and so even the regions with multiple colors can be detected. The second function uses the mean color of the region and all the boundary pixels rely to that mean:
` T f H g
where the denotes the mean color vector of the region. Since this function is depending at the mean color of the region, it can be biased if the region has a non-homogeneous color distribution. However we seek to nd only regions with homogeneous color and unlike for the rst function, even fuzzy edges can be detected, if the color distributions of the two neighboring regions are distant enough. Now the question arises if the difference at the boundary, shall be minimized or maximized. Our observation was, that by minimizing the difference at the boundary, the region keeps inside the edges as long as possible. If we tried to maximize the difference, the region sought to glue to the boundary anyhow from inside or outside (see Figure 6.7). However, when we added other features (such as variaT
Figure 6.7: An example of the region growing results using the difference at the boundary feature only. First row shows results obtained by maximizing the difference, while the second by minimizing the difference at the boundary tion), the difference between the maximizing and minimizing was not noticeable. 38
% y
v y % P Bf ( P Cf (
(6.5)
f f
(6.6)
We did not examine this feature further and we did not use it for region growing, which is outlined in section 6.3.4.
6.3.3 Compactness of the region

Since the aim of our loss function is to segment all kinds of image regions, we have only little a priori knowledge about the compactness of the region. Thus we just use the normalized size of the boundary, which prevents the region from being scattered over the similar colors like in the second image in the Figure 6.3. Using and brings similar results, both tends to keep a square region (see Figure 6.8). We have chosen minimizing , since the penalty for the
A
Figure 6.8: Two examples of region growing results. Left image shows product from minimizing the the second image shows result from minimizing the . pixels not encountered in region, although surrounded by region pixels, is much higher. This result in better modeling the actual color distribution of the region.

6.3.4 Homogeneous regions detection

In the rst paragraph of the section 6.2, introduction to region growing, we have qualied the loss function as an indicator which can be used either in the region growing process itself and yet to determine the regions actual boundary. But during the experiments, we did not nd any signicant points in the loss function which could signal the region to be the one we were looking for. For individual components of the loss function, these signs ought to have following properties:
Color variability of the region is supposed to be small and thus we assume, that when the region exceeds the actual boundaries, there is a high increase of the variability. We could detect a high rst or second derivation of the variability to select the region or just to prefer a low variability. The difference at the boundary should be high for both functions and dened in the equations 6.5 and 6.6 respectively. If the region exceeds 39
B(
c H
c H
e
c H

B(
the boundaries, the difference ought to lower again. This assumption is the same for both functions, however, for different reasons. For the , we suppose that the color difference in the neighboring region is lower than the difference at the boundary. In other words, the regions in the image are relatively homogeneous in the color in comparison with the edges. For the , the decrease arises, because by adding the pixels from the neighboring region, the color mean starts to shift against the regions color. Using these assumptions, we can accept the local maxima as the sings signaling the actual regions. For the compactness, we can only assume a high value, since the regions ought to be relatively homogeneous in the spatial extend. Our observation is, that the regions compactness increases when the actual region is fullled and it is lowered again after owing out, thus we could localize the maxima as well.
60 Minimizing the region color dispersion Boundary difference Region color variance Compactness loss 50
40 Loss function
30
20
10
2000
4000
6000 8000 Number of pixels
10000
12000
14000
Figure 6.9: Loss functions obtained using only the region variance as the pointer for growing. The actual regions are depicted in the images on the left, which are the regions stages for 7000 and 11000 pixels. We can see, that these requests are consistent with the demands on the loss function used for region growing, except the difference at the boundary, which is not that obvious. Hence it seems that applying the same loss function for region detection is reasonable. But during the experiments we have found out, that the more we restrict the growing process with a loss component , the less noticeable is the sign for the given component described above. The restriction is made by setting the weight in the equation 6.1; strong restriction means high weight. For example using the color variance as the only feature for growing, produces the variance loss function which is slowly raising with only insensible changes (see Figure 6.9). On the other hand, both the boundary difference and the compactness loss signicantly point out the actual regions (face and face with hair and 40
B(
Bf (
60 Bounadry difference Region color variance Compactness loss 50
Minimizing boundary diffrence
40 Loss function
30
20
10
1000
2000
3000
6000
7000
8000
Figure 6.10: The plot shows the loss functions obtained by minimizing the boundary difference. The image shows the region stage for 6000 pixels, just after it ew out the face. clothing). When using the boundary difference as the only feature, there is a big increase of either the color variance and the compactness loss (see Figure 6.10). When a strong restriction to the compactness was made, two noticeable breaks in the variance loss function signaled the regions, but surprisingly, the difference at the boundary was a poor region detector (see Figure 6.11). From the observed be70 Boundary difference Region color variance Compactness loss Maximizing the compactness
60
50
Loss function
40
30
20
10
2000
4000
6000
12000
14000
16000
Figure 6.11: The plot shows the loss functions obtained by maximizing the compactness with a little restriction to color variance. The images on the left show growing examples for 2000 and 7000 pixels. havior of the loss functions, we have come to a conclusion, that there is only weak dependence between the individual features. The consequence is that by choosing the pixels regarding one features loss, the other loss functions takes the average change within all the boundary pixels and can be used as the actual boundary detectors. To obtain reasonable region growing results, we have mixed up the individual loss functions with empirically set weights using the equation 6.1. Look at the Figure 6.12 to see an example of the region growing. 41
Growing with w 90 Boundary difference Region color variance Compactness loss Result loss function
diff1
= 0.1, w
comp
= 0.2, w
var
= 0.7
80
70
60 Loss function
50
40
30
20
10
2000
4000
6000 Number of pixels
8000
10000
12000
Figure 6.12: Region growing results obtained using the loss function from 6.1. The region was expanded over the face within about pixels. The region is showed in the size of 9 2500 pixels with the step of 500 pixels. From the result loss function we would hardly determine the face region that is approximately at 1200 pixels. Thus we have employed another function, to nd homogeneous regions which we call prot function, since the sought regions shall occur at the maxima (6.7) where is the difference at the boundary, is the compactness and is the color homogeneity of the region. Choosing the multiplication, we stress the relative differences, rather than the absolute differences in the individual loss functions. Then the prot function can be evaluated as the logarithm (6.8)

For the same region growing example as in the Figure 6.12, the prot function is shown in the Figure 6.13.
0.16 Profit function
0.14
0.12
Profit function
0.1
0.08
0.06
0.04
0.02
2000
4000
6000 Number of pixels
8000
10000
12000
Figure 6.13: Prot function for region growing depicted in the Figure 6.12 42
i i s P e
g g yBf ( P P o
s yBf ( pP s s
yB (
Unfortunately, usually the sought region is not to be found at the global maximum of the prot function . The prot function only reaches a local maximum, when the region spreads itself over a homogeneous area and we can not a priori dene, the parameters of the homogeneous region. Thus instead of one maximum, we nd a set of signicant local maxima . This set assigns a set of face candidates for a single starting location. To extract the local maxima, we use a simple algorithm, that takes the prot function as the input, and returns a set of local maxima positions . There are three
Algorithm 3 Searching from local maxima set
in the prot function
1. Take every th value from the and use them to approximate the prot function with a spline function 2. from the , generate two functions and , which are the shifted
parameters in the algorithm: denes the accuracy of the spline function and together with determine the sharpness and height of the peaks. All the parameters are empirically assessed from the length of the prot function. An example of the face candidates evaluating is shown in the Figure 6.14.
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 profit Function smooth profit Function local maxima intervals
Profit function analyse
profit function
2000
4000
6000 8000 number of pixels
10000
12000
Figure 6.14: Face candidates obtained from the prot function. Images show the original picture and all the regions at the local maxima. The plot shows the prot function together with the spline function and the local maxima intervals.
43
3. for every interval from such that nd the global maximum position
e u e u u h " u dh # u R ` R `
functions. .
and and put it in the
e e } d e u C u e u e w e e u e ` R ` e tP e h u " u dh " u g vP u u C u
e Aad P e e
e w e u
h " u `
h " u `
14000
6.3.5 Starting points

To locate the starting points for region growing, we have used two assumptions:
The rst assumption leads to a simple algorithm 4 which stemmed from [6, pages 96-97], which takes a color image and produces a texture map, where the high values means high color contrast (texture) and vice versa. The texture mask stress Algorithm 4 Computation of the texture mask T from a color image I 1. create a new image M of the same size as the image I 2. for each color plane c in I and for each pixel position (x,y) in I compute the mean value of the pixels in the neighborhood of (x,y) and insert the mean value into the image M at the location given by (x,y) and c. 3. compute the image of absolute values of differences for each color plane separately 4. create the intensity image DI from D 5. create the texture image T of the same size as DI 6. for each pixel position (x,y) in DI compute the mean value of the pixels in the neighborhood of (x,y) and insert the mean value into the texture T at the location given by (x,y). the edges in the image and thus we can avoid the region growing to start at any of the boundaries between the actual regions. The second assumption shall reduce the number of the regions containing only pixels with small probability of being skin pixels. For example color which is very bright can occur in a shiny brow, but not at the whole face. This assumption can help to accept the low frequented pixels in the face and reject the same pixels in other areas (e.g. bright yellow wall). Using test this assumption, we have proposed simple connected component analysis algorithm. We classify all the pixels with two thresholds and into skin and non-skin classes. We as the provisory skin pixels, while call the pixels denoted as skin with the
B f
44
g P
i i
the region is homogeneous in the color the region is skin region, only if it contains any pixel with a high probability being skin.
the pixels classied with the as the high probability skin pixels. Then we mark as skin all the connected components from the provisory skin pixels, that include any of the high probability pixels. We have evaluated a set of ROC curves for the same dataset as in the section 4.4. Every ROC curve was computed for a high threshold and a set of . The best result was obtained for , but the benet was only negligible (see Figure 6.15).
0.5 0.45 0.4 0.35 False positive error 0.3 0.25 0.2 0.15 0.1 0.05 0 ROC curves
simple skin detection conected components for = 0.6
0.1
0.2
0.3
0.4 0.5 False negative error
Figure 6.15: ROC curves obtained using 200 images from Compaq database. Simple skin detection is the same plot as in the Neyman-Pearson classiers testing. Now the starting point is located using only pixels with the rate higher than and choosing the lowest value from the texture map (see Figure 6.16) and region growing is limited within the provisory skin pixels only.
(a)
(b)
Figure 6.16: Starting point evaluation, a) Original image, b) texture mask, c) all the provisory pixels, d) connected components containing high probability skin pixels and the starting location
45
f o
j f o i
f X
0.6
0.7
0.8
0.9
f o
(c)
(d)
Chapter 7 Face detection

Providing, we have extracted skin regions only and we have segmented these regions to homogeneous parts, each segment could be validated and accepted or rejected in virtue of the validation result. But in our case, we have more face candidates for each starting point, that means, the candidates can overlap each other. If a candidate is marked as a non-face region, it can still cover the actual face region and thus the rejection of the whole region would mean the drop of the face. This problem arises, if the region growing starts outside the face and during the process spreads itself over the face either. To solve this problem and reduce the number of false positive detections, we have proposed a simple algorithm, which is based on several assumptions. First we introduce few more concepts which formally describe the process of region growing applied to face detection:
p p
is dened as in the previous chapter, but we consider which is signicant for the face verication. It means that the number of pixels is generally higher.
p
46
f e UR
running into the face is a
such that and . is such that the rejected part of the face disallows the face validation or other future processing.
UQ s g 3f e
i
d qR @ e f A@ e%R e e
owing out of the face is a

p @
such that and . is such that the neighboring region disallows the face validation or other future processing.
e
@ s g e
i
f3 e
1f e f w R f 1 e
e
i i i i
is a projection of a human face into the picture coordinates. is represented by a set of pixels in the spatial coordinates. Other parts of the image can be denoted as
To process all the connected components from the image using the homogeneous regions from the region growing process, we have made these assumptions:
the region, which is spread over the face can be detected by a local maximum obtained from the prot function. Formally, if , then . if the growing process starts outside the face, then the moment right before the running into the face can be detected by a local maximum in the prot function. Formally, if and , then .
p j s g e
To validate a face candidates, we have used rather a rejection than an acceptance scheme, since the number of variations of the face appearance caused by different rotations and scales is still high. We reject the face candidate if the given region does not satises any of these assumptions:
the face is approximately elliptic with a limited rate of the size of the axes.
the face involves small darker areas such as eyes and lips, that can be considered as local minima in the intensity image.
number of these dark spots does not exceed certain threshold.
To apply the rst assumption, we approximate the region with an ellipse , using the moments of inertia. Then we assess, how well the region is approximated with the ellipse. For that purpose, the following measure is evaluated:
The t of the ellipse is than limited with a threshold as well as the rate of the axes length . For the second and third assumption, the number of signicant local minima in the intensity image is evaluated. The face is rejected if this number is zero or higher than a certain threshold.
f s
47
where the function
indicates that the pixel belongs to the region
3f e
9 e s s ! 0 g s 6 y
f e
i s q P 0 0 y P R 0 4v y
q i
@ s g e
i
i i i i i i
if the starting position of the region growing process is located in the face, than the region rst spread itself over the face, before it ows out. Formally, if , then .
(7.1) :
Using the assumptions for region growing result and for the face candidates we can build an algorithm, which would gradually analyze the face candidates. In the case of rejection of all of the candidates for one starting location, only the rst candidate can be cut out from the connected component to reduce the input space. The input of the algorithm 5 is the image and two masks marking the pixels with high probability of being skin and the mask marking the provisory skin pixels . The output is a set of disjunct regions which represents the face candidates in the image. Algorithm 5 Face nding 1. if the mask
f f o f D
2. using the mask and the texture mask, nd the starting location for the region growing. 3. apply the region growing algorithm which produces the growing function and the loss functions , and . 4. compute the prot function using the equation 6.7. Find the ascending ordered set of the local maxima positions in . 5. for all the face candidates from , compute the t , axes rate and number of dark holes. Using the thresholds, pick all the regions that satisfy the criteria for face, and put them into a new set .
7. nd the such that the t of the corresponding region into the ellipse is the greatest, put the region into the nal set . Remove the region from both masks and and go to the step 1.
D f o
7.1 Experiments
To evaluate the face detector performance, we have used 40 pictures from the Compaq database containing together 49 human faces in different positions, rotations and scales. In most images, if the connected component analysis described in section 6.3.5 is applied, the face in the image is in a component with other ob48
e ! Ge
6. if the set region
is empty, take the rst entry from both masks and
, remove the corresponding and go to the step 1.
e
s e A
e yBf ( e
B s f o e
e A
is empty, end.
e
i
jects having skin color like hair, clothing, furniture, walls etc. We have applied two algorithms. First one is described in the previous section and it nds all the faces in the image. This algorithm has correctly found 32 faces and incorrectly classied 37 regions as a face. The same test was performed for 20 images from XM2VTS database. From 20 faces, all the faces were correctly detected, while 3 false detections were made. We have observed, that the starting location of the region growing process is very often placed inside the face region. This is caused by the high probability of pixels inside the face being skin and by the high color homogeneity in the face especially the brow. We have used this observation to design a modied algorithm based on the algorithm from the previous section. In this algorithm only the rst face is found, that means the end after the step 7. Using the same images from Compaq database, 26 faces were correctly detected. The same test applied on 20 images from XM2VTS database, all the faces were correctly detected. Since the face model is very simple, we consider the results from the face detection as promising. Usually the wrong classication stems from the acceptance of a non-face region. There are three cases of the false acceptance which lead to a false detection or false rejection.
Figure 7.1: False detection example. Images from left to right: Face detection result, rejected candidate with the , accepted candidate with the
49
P (h`
s P h`
i i i s
Any small part of the face (e.g brow) is accepted, then the result is a division of the face into several parts which are detected as individual faces. These small areas has usually very high t into the ellipse. Region that ows out of the face is accepted and has even better t into the ellipse than the actual face (see Figure 7.1). Region that runs into the face is accepted as a face. If the preponderance of the face is involved (and subsequently cut out), the face is lost (see Figure 7.2).
Figure 7.2: False detection caused by the accepting a non-face region. Images from left to right: Face detection result, rejected candidate with zero number of holes, accepted candidate with non-zero number of holes Since the set of the face candidates usually involves the actual face, more sophisticated method for the face validation may be examined to reduce the number of wrong detections. Despite the poor model, we can still present number of relatively sufcient detection results in the pictures with a high level of occlusions by other objects (see Figure 7.3).
Figure 7.3: Results from the face detection algorithm
50
Chapter 8 Discussion and conclusions

We have presented a method for the human face detection in a complex scene environment based on the skin and non-skin color, modeled by histograms, segmentation method using region growing technique and a simple verication scheme for the obtained face candidates. Since the skin detection is a low-level and relies entirely on the local information, the results will never be completely reliable, but should be considered as providing an useful information for the subsequent higher-level process, based on the spatial distribution. Since we use histograms to model the skin and nonskin color, the performance of the skin classier per-pixel, cannot be increased by any color space transformation. This is possible only for the other models (like Gaussian mixture), which need to cluster the skin respectively non-skin colors to a continuous blob. Providing we have a large training dataset (38.7 million skin pixels in 33 thousand bins), the histogram approximation of the distribution is quite accurate. The problem arises with different races, since the appearance within the WWW is not an random event. We have observed a poor skin classication for the Africans with a very dark skin, which is provoked by signicantly lower presence in the web images. This drawback described in section 4.3 could be xed by selecting carefully the training dataset regarding the different human races, or by any color space transformation which unies the human skin color within different races. We have tried to increase the skin classier performance by the assumption that most of the skin regions includes pixels with high probability of being skin, although the rest of the pixels may have a small chance to be labeled as skin. We classify the skin pixels with a high tolerance and then we reject the connected 51
components which do not include any of high probability skin pixels. We have computed a set of ROC curves and found the false positive error as the best for the high probability skin pixels. However the improvement compared to the per-pixel detection was only negligible. The multi color model of the face described in chapter 5 cannot be applied in the low-level classication process as we have found out, but it might be used as a tool in the verication scheme. However, color of lips are dependent at least on the skin color, moreover in a low resolution images, lips are often only negligible. Since the trade off between the false detections and false rejections in the per-pixel classication has a xed limitation given by the ROC curve, we have employed the spatial attributes of the pixels to propose a segmentation method, to extract the homogeneous regions within the pixels classied as skin, which could be passed to a high level verication process. The segmentation method is based on the region growing technique, proposed as a randomized optimization process, using a loss function computed from the given regions color and spatial attributes. Because we can not a priori dene the characteristics for a homogeneous region, the output from the region growing process, with a single starting point, is usually a set of regions, that are signaled by the local maxima in the prot function that is computed in similar manner as the loss function used for growing. The proposed region growing method is designed to be quite robust since there are no a priori dened limitations of the homogeneous region. Hence if the growing process starts inside the face region, the actual face is usually involved in the face candidates set. Although, we have observed strong dependency in the intensity of the color. Especially if one half of the face is in the shade (most light comes from the side), owing out of the face supervenes after only one half of the face is covered. Another problem is the detection of the local maxima in the prot function if the edges around the face are only negligible (e.g. hair color similar to the skin color or no shading under the chin). To improve the performance, the different color space transformations and additional loss functions could be examined. For a proper examination, any appropriate measure of the performance should be proposed. To validate the face candidates, we have proposed a very simple scheme based on several assumptions about the face t into an ellipse and the presence of some dark patches. Although the simple face model works in a number of cases, mostly, some wrong detections appear, which is produced by several drawbacks in the model. First, the appearance of the dark holes is depending on the image resolution and on the intensity of the skin color. Since the appearance of a face is strong dependent on the size, a normalization into a low resolution should be made. Second problem is the choose of the region which has the best t into an ellipse, 52
ij
which is not always the actual face. Since the actual face is usually quite well extracted and involved in the candidates set, any more sophisticated method could be used for the validation. Since the region growing is a randomized process, the results are often different for subsequent calls, even that 10 % from boundary pixels are evaluated, which might be a well representing fraction. Here the next drawback should be remarked and this is the speed of the detection process. Even that the loss func, the speed is dependent on the size of the tions are evaluated in constant time boundary of the region. We choose only a fraction from the boundary and keep the regions compactness at a high rate to reduce the number of pixels evaluated. However, the whole process can be very time consuming. For example the evaluation for an image pixels with pixels marked as skin, takes about 3 minutest on two processors Pentium 3 750MHz computer.
53
s
i i i qk i
Bibliography
[1] International Conference on Automatic Face & Gesture Recognition, Grenoble, France, 2000. [2] R. Auckenthaler, J. Brand, J. Mason, C. Chibelushi, and F. Deravi. Lip signatures for automatic person recognition. In IEEE Workshop, MMSP, pages 457462, 1999. [3] Christopher M. Bishop, editor. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, Great Britain, 3th edition, 1997. [4] Jaroslav Bla ek. Detekce osob z barevn informace. Masters thesis, Cesk z e e vysok u en technick , Fakulta elektrotechnick , Katedra rdc techniky, e c e a Praha, Cesk republika, 1999. a [5] Gary R. Bradski. Computer vision face tracking for use in a perceptual user interface. 1998. [6] David Forsyth and Jean Ponce. Computer Vision: A modern approach. http://woska/Forsyth CV Book/. [7] M. Jones and J. Rehg. Compaq skin database. http://www.crl.research.digital.com/publications/techreports/abstracts/98 11.html. [8] Michael J. Jones and James M. Rehg. Statistical color models with application to skin detection. Compaq Cambridge Research Lab. Technical Report CRL 98/11, 1998. [9] Stephen J. McKenna, Sumer Jabri, Zoran Duric, and Harry Wechsler. Tracking interacting people. In International Conference on Automatic Face & Gesture Recognition [1], pages 348353. [10] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vts db: The extended m2vts database, 1999. In International Conference on Audioand Video-Based Person Authentication, pp. 72-77, 1999. 54
[11] Rein-Lien Hsu Mohamed. Face detection in color images. seer.nj.nec.com/392009.html.
http://cite-
[12] Michail I. Schlesinger and V clav Hlav c. Deset pedm sek z teorie statia a r a stick ho a strukturnho rozpozn v n. CVUT, Prague, Czech Republic, e a a 1999. [13] Karl Schwerdt and James L. Crowley. Robust face tracking using color. In International Conference on Automatic Face & Gesture Recognition [1], pages 9095. [14] J. Sobottka and I. Pittas. Segmentation and tracking of faces in color images. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, pages 236 241, 1996. [15] Moritz Storring, Hans J. Andersen, and Erik Granum. Estimation of the illuminant colour from human skin colour. In International Conference on Automatic Face & Gesture Recognition [1], pages 6469. [16] Jean-Christophe Terrillon, mahdad N. Shirazi, Hideo Fukamachi, and Shigeru Akmatsu. Comparative performance of diffrent skin chrominance models and chrominance spaces for automatic detection of human faces in color images. In International Conference on Automatic Face & Gesture Recognition [1], pages 5461. [17] Jochen Triesch and Christopher von der Malsburg. Self-organized integration of adaptive visual cues for face tracking. In International Conference on Automatic Face & Gesture Recognition [1], pages 102107. [18] M. Yang and N. Ahuja. Detecting human faces in color images. In International Conference on Automatic Face & Gesture Recognition [1], pages 446453. [19] Xiaojin Zhu and Jie YangAlex Waibel. Segmentin hands of arbitrary color. In International Conference on Automatic Face & Gesture Recognition [1], pages 446453.
55
Appendix A Applications description

Two applications were build: one for the evaluation of the individual face detection subtasks and the second one for handling the masks and relevant experiments. The applications were implemented in the Matlab environment, using m les and mex les written in C++. The rst tool is called Face detector, while the second one is called Selector. In this appendix, rst the basic information needed to run and control the applications is outlined and then the coarse description of individual functions is given.
A.1 Face detector

Face detector is a tool, that applies a detection method on a single image and visualize the result. All the source les can be found in the home directory xnovakv1/matlab/facedetect. To run the application, run Matlab, set your working directory or add the path to the searching paths with addpath(path). Then simply type facedetect to open the application window. To set the initial image database directory to a given path, use facedetect(<database>). The default image database is set to our WWW testing image database.
A.1.1
Application control
There are two control panels that can be used to handle the tool. The panel on the left controls the image database, which is the list of image les in a single directory and the histogram les of skin and non-skin colors used for detection. The text eld enables to insert the name of a directory and load the content, which 56
is a list of images names, that can be browsed by the buttons Previous and Next. The new histogram le names can be inserted into the bottom text-elds and then loaded. The right panel controls the detection subtasks and the threshold parameter . There are ve detection subtasks available: Skin detection: Low level skin detection using the false detection error . Face features: Face features detection, using the multi color model of the face and the false negative error in face/non-face pixel classication. Connected components: Simple connected component analysis using the high probability skin pixels to accept the provisory skin regions.
X X X
The button Apply can be used to call the given method or the Auto apply can be set on to call the method after any change (e.g. change of the input image). The check-box Fast mode ensures the big images to be scaled before sending to the face nding method. The default value is on and we recommend not to change it. The threshold can be changed with the slider or the text eld. If any face detection or connected component method is set up, the threshold is used and it is automatically set to 60 %. For more detailed for estimating the instructions see the le facedetect.html in the same directory.
A.1.2
Functions description
In this section, a short description of each function is given. The functions are sorted in alphabetic order and the extension .m or .cpp denotes the m les or C++ les respectively. allReg.m takes the image, mask of provisory skin pixels and high probability skin pixels and returns the mask, marking the face regions together with the set of ellipses around the faces. classf.cpp takes the image and classication table and produces a mask marking pixels regarding to the table (e.g. skin). 57
f o
i i i i i i
Starting point: Shows the starting location for the region growing algorithm. First face: Finds only one face in the image, if exists any. All faces: Finds all the faces in the image.
conComp2.m applies simple connected component analysis using two masks: mask marking the provisory skin pixels and the mask marking high probability skin pixels. createTable.m creates the classication tables. If the histogram names are passed, the histograms are loaded from the disk. darkHoles.m computes the number of the dark holes in the image part given by the given mask. ellipse.m computes the best t ellipse for the pixels given by a mask. Returns polygon marking the ellipse and the length of the axis. facedetect.m initializes the tool environment and create the tool window. Handles the callbacks received from the graphics components. firstReg.m searches for a face in the image. If a face is found, the function returns the mask marking the corresponding region and the best t ellipse. floodSeg*.cpp region growing algorithm. Takes the image, two masks and the starting location and produces mask marking the last region and the boundand the growing function . There is more ary, loss functions versions of this algorithm, see facedetect.html for details. change image.m updates the image in the image axes to the actual image. load db.m loads the content of the given directory and shows the lenames of all the images in the le list. localMax.m nds the set of local maxima in a function. mixtab.cpp creates a classication table using the Bayes rule. The input are couples - histogram, prior probability. N is maximally 20. neyman.cpp takes two histograms of positive and negative classes and returns the 3D table of rates for individual colors and the table of s for a given false positive error. There is numbers in the table for . panel.m sets up all the graphics components and its callbacks. bins and returns the readhist.cpp reads the histogram le *.hst with histogram table and the total sum of pixels in the histogram. selvis.cpp visualizes the face features classication result using the image and the mask with the values 1 for skin, 2 for lips and 3 for hair. startGrow.m nds the start location for the region growing algorithm view class.m applies actual detection method and shows the result. view image.m shows the actual image in the image axes. If the image is not present in the memory, it is loaded from the disk.
R
58
s i P
e
yBf (
ii s
A.2 Selector
Selector is a tool, that can be used to select regions of pixels in the image and use them for some subsequent functions, like histogram building, color distribution visualization etc. All the m les and mex les can be found in the home directory xnovakv1/matlab/selector. To run the application, run Matlab, set your working directory or add the path to the searching paths using addpath(path). Then simply type selector to open the application window. To set the initial image database directory to a given path, use selector(<database>). The default image database is set to our WWW testing image database.
A.2.1
Application control
There are two control panels that can be used to handle the tool. The panel on the left controls the image database, which is the list of image les in a single directory and saving the histogram les a given class. The text eld enables to insert the name of a directory and load the content, which is a list of images names, that can be browsed by the buttons Previous and Next. The right panel enables to set the selection method, that can be either, a polygon selection or the ood ll selection. The ood ll selection is controlled by two parameters (sliders on the right), that set the maximal difference between neighboring pixels and the maximal difference to the starting pixel. The selection target can be one of three classes: skin (default), lips or hair. The change of the selection target changes the name of the histogram lename to the relevant class. The actual selection target determines the subsequently selected pixels class. For the detailed instructions see the le selector.html in the given directory.
A.2.2
Functions description
In this section, the short description of each function is given. The functions are sorted in alphabetic order and the extension .m or .cpp denotes the m les or C++ les respectively. change image.m changes the image in the image axes to the actual image. countDisp.m evaluates the color variance within the skin masks. Prints the mean, variance, maximal and minimal value of the color variance in the individual images. 59
floodSelect.m calls the floodsel.cpp with the actual image and given starting location, refresh the image. floodsel.cpp selects the pixels using recursive algorithm, that takes the image, starting location and two thresholds. key handle handles the shortcut keys. loadMasks.m loads all the available masks from the directory imageData/masks. The masks are stored in hdf format and have the same names as the images. load db.m loads the image database from the actual directory and shows the rst image. panel.m sets up all the graphics components and its callbacks. saveHst.m generates the histograms from all the images and masks and save the appropriate one in the le in the current directory. saveMasks.m saves all the non-empty masks into the mask directory imageData/masks. The masks are stored in hdf format and have the same names as the images. selector.m initializes the tool environment and create the tool window. Handles the callbacks received from the graphics components including the callbacks used for polygon selection. showDist.m shows the color distribution of the skin pixels selected in the actual image. updthist2.cpp adds pixels to a histogram using image, mask and a given class of pixels. view image.m shows the actual image. writehist.cpp saves the given histogram into the le with a given name.
60

Detection of Humans Using Color Information: V It Nov Ak

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detection of Humans Using Color Information: V It Nov Ak

Uploaded by

Copyright:

Available Formats

Detection of Humans Using Color Information

Vt Nov k a May 23, 2001

Face detection 7.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Discussion and conclusions

Chapter 2 Related published work

2.1 Segmenting Hands of Arbitrary Color [19]

over the image pixels

2.2 Robust Face Tracking using Color [13]

2'  3$10 $('

      H E IFGFDC  &

, which is learned from the training data.

on the image, is the probability map

The estimated target position

is the point yielding the highest result

When the position ities

is found, the prototypes are updated alike the reliabil-

2.4 Statistical Color Models with Application to Skin Detection [8]

D D P %B&9Y'A  `2X   ('

p no p t )on t no p s no { ~ s no  p } r no { ~  r |}|~ no { Yzy p q no q no p n no n YY B9 ) s ) Y Y B 9 B ) )s B BY Y B 9 B ) ()s

 D%B B&@9)XX22`` $(' Y '

A H E FGFDC G  `2   (' G P %B&9)X$('

2.6 Tracking Interacting People [9]

  g ('R  g (' "  "  #

g  #  # A @  b)' 0   %  g % g s g  aR ('  %  g s '

 "  " A "  "     91   9w    

2.7 Segmentation and Tracking of Faces in Color Images [14]

 (  ` 'Ad g s ` ' P ` 'A R  `

2.8 Detecting Human Faces in Color Images [18]

Chapter 3 Training Dataset

3.1 XM2VTS Database [10]

Figure 3.1: XM2VTS database examples

Figure 3.2: Our WWW database examples

3.2 Compaq Database [7]

3.3 WWW Face Database

Chapter 4 Statistical approaches to skin detection

4.1 Bayesian Theory

Now the aim is to nd such a strategy

will be the decision strategy. Since

4.2 The Neyman-Pearson Strategy

   vBAdv AW  AAd   vBA  AA @r P   f    

4.3 Non-random intervention

f   w 1  `w` f "A @ )x9P   "A f  f  `1 `   `  11`  "  

is minimized, while the missed skin error is at most an

that are solutions of

9 e u u  3 3 F yywTxxH wT0 0 vts

is a discrete function, which maps an input vector .

and 4. bins with both ing images vote for non-skin.

zero i.e. these not encountered in train-

3. remaining bins with non-zero number of pixels in

vote for skin.

Minimal error Fn + Fp = 26.7%

Minimal error Fn + Fp = 4.5%

0.4 0.5 0.6 False negative error

Chapter 5 A Multi Color Model for Face

5.1 Face color model

v A  A  P p r  q  4 lsg1r P o @

  wd  A v &$ P  p A f t f  ' % 

% f t f  ' & $ f   wd  A AQ!P 

Chapter 6 Connected Components

6.1 Single face color distribution

6.2 Region Growing

Algorithm 2 Region growing algorithm 1. Initialize the region tions . Set .

2' 3$10 $('

H E IFGFDC &

D D P %B&9Y'A `2X ('

p no p t )on t no p s no { ~ s no p } r no { ~ r |}|~ no { Yzy p q no q no p n no n YY B9 ) s ) Y Y B 9 B ) )s B BY Y B 9 B ) ()s

D%B B&@9)XX22`` $(' Y '

A H E FGFDC G `2 (' G P %B&9)X$('

g ('R g (' " " #

g # # A @ b)' 0 % g % g s g aR (' % g s '

" " A " " 91 9w

( ` 'Ad g s ` ' P ` 'A R `

vBAdv AW AAd vBA AA @r P f

f w 1 `w` f "A @ )x9P "A f f `1 ` ` 11` "

9 e u u 3 3 F yywTxxH wT0 0 vts

v A A P p r q 4 lsg1r P o @

wd A v &$ P p A f t f ' %

% f t f ' & $ f wd A AQ!P

" f f f s P g f ' P S P

s RHVWSSC P R E R s RVWSSC P H R E R E s RHVRWSSC P E R C