Professional Documents
Culture Documents
Face Detection Evaluation
Face Detection Evaluation
Face Detection Evaluation
called false negative (FN). Precision gives an idea of how well the detected area match with the ground truth area. Recall on the other hand gives an idea of how well the ground truth overlaps with the detected area. All the measures described have the recall and precision counterparts so that the FP and FN errors are accounted for.
calculating overlaps, it considers spatial union of boxes, which makes sure that overlapped areas are not counted twice. 2.1.1 Object Count Accuracy This measure compares the number of ground-truth objects in the frame with the number of algorithm output boxes. It penalizes the algorithm both for extra or fewer boxes than the ground truth. Let G be the set of ground truth objects in the image and let D be the set of output boxes produced by the algorithm. The Accuracy is defined as:
if NG + ND = 0 undefined Minimum (NG , ND) Accuracy = otherwise NG + ND 2 where NG and ND are the number of ground-truth objects and output boxes, respectively in the image. The measure does not consider the spatial information of these boxes. Only the count of boxes in each frame is considered. This measure could be useful in evaluating algorithm performances like correctly identifying the number of objects in a given image irrespective of how close they are with respect to the ground truth object. Consider a scenario in which there are 10 ground truth objects and algorithm A finds 8 boxes (say) and algorithm B finds 2 boxes, then A is obviously better than B as for identifying the number of objects in the image. To measure the accuracy in-terms of overlaps with respect to area there are other measures.
Figure 1: Recall and Precision Concept The measures are organized in growing level of complexity and accuracy. The first measure, the Object Count Accuracy (Sec 2.1.1) is a trivial measure and it simply counts the number of detected objects with respect to the ground truth objects without checking for how accurate they overlap with each other. Next, the pixel-based measures, which check for the raw pixel overlaps between the object and ground truth boxes are defined in Sec 2.1.2 and 2.1.3. Here the entire frame is considered as a bit-map without any distinctions made between the different objects. If there were a detected box overlapping another detected box, then this measure would not make any distinction as it considers the union of the areas. Here, bigger boxes have an advantage over smaller boxes. The measures discussed in Sec 2.1.4 and 2.1.5 are area-thresholded measures. If the overlap between the ground truth and detected box is greater than a threshold, then full credit is given for the particular box pair. Next, the area-based measures are discussed in Sec 2.1.6 and 2.1.7, where the measures treat the individual boxes equally regardless of the size, in contrast to the pixel-based measure, which treats bigger boxes differently than smaller boxes. The area-based measure takes into account the individual objects as opposed to not making such distinction in the case of pixel-based measure. In Sec 2.1.8, the fragmentation measure is discussed. This measure penalizes algorithms if they break individual ground truth box into multiple detected boxes. We also propose a set of measures, which are based on a requirement that there is a one-to-one mapping between each ground truth box and the detected box. We measure the positional accuracy of the detection output to the ground truth in Sec 2.2.1. A size-based measure is discussed in Sec 2.2.2, while Sec 2.2.3 discusses an orientation-based measure. Finally, we propose a composite measure in Sec 2.2.4, which is area-based and takes into account the recall, precision and fragmentation.
2.1.2 Pixel-based Recall The measure measures how well the algorithm minimizes false negatives. This is a pixel-count-based measure. Let UnionG and UnionD be the spatial union of boxes in G and D. i=NG UnionG = U Gi i=1 where Gi represents the ith ground truth object in the image.
i=ND UnionD = i=1 where Di represents the ith detected object in the image. We define Recall as the ratio of the detected areas in the ground truth with the total ground truth:
if UnionG = undefined UnionG I UnionD Recall = otherwise 1UnionG where | | operator denotes the number of pixels in the given area. This measure treats the frame not as collection of objects but as a binary pixel map (object/non-object; output-covered/not-outputcovered). So the score increases as the overlap increases and will be 1 for complete overlap.
Di
2.1.3 Pixel-based Precision The measure measures how well the algorithm minimizes false positives. This is a pixel-count-based measure. Let UnionG and UnionD be the spatial union of boxes in G and D. i=NG UnionG = U Gi i=1
i=ND UnionD = i=1 We define Precision as the ratio of the detected areas in the ground truth with the total detection:
if UnionD = undefined UnionD I UnionG Precision = otherwise 1UnionD where | | operator denotes the number of pixels in the given area. This measure treats the frame not as collection of objects but as a binary pixel map (object/non-object; output-covered/not-outputcovered). So the score increases as the false positives reduce and will be 1 for no false positives.
Box Pr ecision( D )
i
Di
ND
where,
Di I UnionG 1 if > OVERLAP_MIN BoxPrecision(Di) = Di 0 otherwise Here OVERLAP_MIN is the minimum proportion of the output boxs area that should be overlapped by the ground truth in order to say that the output box is precise. Again, in this measure the output boxes are treated equally regardless of size similar to the previous measure.
Di
2.1.4 Area-thresholded Recall In this measure, a ground-truth object is considered detected if the output boxes cover a minimum proportion of its area. Recall is computed as the ratio of the number of detected objects to the total number of ground-truth objects. Let Area-thresholded_Recall be the number of detected objects in an image: ObjectDetect (Gi )
2.1.6 Area-based Recall This measure is intended to measure the average area recall of all the ground-truth objects in the image. The recall for an object is the proportion of its area that is covered by the algorithms output boxes. The objects are treated equally regardless of size. We define Recall as the average recall for all the objects in the ground truth G: ObjectRecall(Gi) Gi Recall = NG where,
undefined if Gi = 0 ObjectRecall(Gi) = Gi I UnionD otherwise Gi where | | operator denotes the number of pixels in the given area. All the ground-truth objects contribute equally to the measure, regardless of their size. On one extreme, if an image contains two objects a large object that was completely detected and a very small object that was missed, then Recall will be 50%.
Gi
NG
where,
Gi I UnionD 1 if > OVERLAP_MIN ObjectDetect(Gi) = Gi 0 otherwise Here OVERLAP_MIN is the minimum proportion of the groundtruth objects area that should be overlapped by the output boxes in order to say that it is correctly detected by the algorithm. Here again, the ground-truth objects are treated equally regardless of size.
2.1.5 Area-thresholded Precision This is a counterpart of measure 2.1.4. The measure counts the number of output boxes that significantly covered the ground truth. An output box Di significantly covers the ground-truth if a minimum proportion of its area overlaps with UnionG. Let Area-thresholded_Precision be the number of output boxes that significantly overlap with the ground-truth objects:
2.1.7 Area-based Precision This is a counterpart of the previous measure 2.1.6 where the output boxes are examined instead of the ground-truth objects. Precision is computed for each output box and averaged for the whole image. The precision of a box is the proportion of its area that covers the ground truth objects. We define Precision as the average precision of the algorithms output boxes D: BoxPrecision(Di) Di Precision = ND where, undefined if Di = 0 BoxPrecision(Di) = Di I UnionG otherwise Di where | | operator denotes the number of pixels in the given area.
In this measure the output boxes are treated equally regardless of size. 2.1.8 Average Fragmentation Detection of objects is usually not the final step in a vision system. For example, extracted text from video will go through enhancement, binarization and finally recognition by an OCR system. Ideally, the extracted text should be in one piece, but a detection algorithm could produce several boxes (e.g. one for each word or character) or multiple overlapping boxes, which could increase the difficulty for the next processing step. The measure is intended to penalize an algorithm for multiple output boxes covering a ground-truth object. Multiple detections include overlapping and non-overlapping boxes. For a ground-truth object Gi, the fragmentation of the output boxes overlapping the object Gi is measured by:
if ND I G = 0 undefined Frag(Gi) = 1 1 + log10(ND I G) otherwise where ND I G is the number of output boxes in D that overlap with the ground-truth object Gi. For an image, Frag is simply the average fragmentation of all ground-truth objects in the image where Frag(Gi) is defined. This is a particularly useful metric for face detection.
However, each setting is associated with a cost. While, the default settings might detect faces most of the cases, it might also declare a face when there is none in the image. On the other, though the other setting might not detect faces when there is no face in the image, it might also not detect if there are faces in the image. Figs. 2 and 3 explain this possibility.
(2c) Results with setting 1 (default) Figure 2: Results showing the better results of default settings.
1. Convert the RGB image into the YCbCr color space. The reason behind this is that that segmentation of skin colored regions becomes robust only if the chrominance component is used in analysis. Therefore, the luminance component is eliminated as much as possible by choosing the CbCr plane (chrominance) of the YCbCr color space to build the model. 2. Regions of interest are carefully extracted from the image as training pixels. Regions containing human skin pixels as well as nonskin pixels are collected. The mean and covariance of the database characterize the model. It is modeled as a unimodal Gaussian. The mean and covariance is estimated using EM algorithm. (EM algorithm was implemented as part of another project by me in Fall 2003. The same code was used). It can be seen in Figure 4 that the color of human skin pixels is confined to a very small region in the chrominance space, which is distinct from the non-skin region.
Figure 4: CbCr plane of skin and non-skin regions Let c = [Cb Cr] T be the chrominance vector of an input pixel. The probability that the given pixel lies in the skin distribution is given by
1 exp (c s )T 2 p (c / skin) = 2
1 s
(c s )
where
s and
(3c) Results with setting 1 (default) Figure 3: Results showing the false positives produced by default settings. Though, the default settings produced false positives, the fact that it detects the face accurately when present, motivated its usage as the settings. The Haar face detection results shown from now on were obtained using the default settings.
matrix respectively of the training pixels. This gives the probability of a pixel occurring given that it is a skin pixel. Similarly, we calculate p(c/non-skin). The posterior probability that a pixel represents skin given its chrominance vector c, p(skin/c) is evaluated using Bayes theorem.
p(c / skin) p(c / skin) + p(c / non skin) An input image is analyzed pixel-by-pixel evaluating the skin probability at each pixel. This results in a gray level image where the gray value gives the probability of the pixel representing skin. This image is thresholded to get to obtain a binary image. A correct choice of threshold is critical. Increasing the threshold will increase the chances of losing certain skin regions exposed to adverse lighting conditions, during thresholding. Also, the extra regions that get p( skin / c) =
retained in the image because of the lower threshold can be removed using connected component operators. 3. The resulting image after stage 2 contains lot of noise. So the image is opened using a disk shaped structuring element. The effect of using area open is removal of small and bright regions in the thresholded image. The size of the structuring element should not be more than that of the smallest face the system is designed to detect. A set of shape based connected operators are applied over these remaining components to decide whether they represent a face or not. These operators make use of basic assumptions about the shape of the face. 4. Compactness It is defined as the ratio of its area to the square of its perimeter. A Compactnes s = 2 P This criterion is maximized for circular objects. Face component exhibits a high value for this operator. If a particular component shows a compactness value greater than this threshold, it is retained for further analysis, else discarded. 5. Solidity For a connected component, solidity is defined as the ratio of its area to the area of the rectangular bounding box. A Solidity = Dx D y It gives a measure of area occupancy of a connected component within its min-max box dimensions. The solidity assumes a high value for face components. If a particular component shows a solidity value greater than a threshold, it is retained, else discarded. 6. Aspect ratio It is assumed that normally face components have an aspect ratio well within a certain range. If a components aspect ratio falls out of this range, the component is eliminated. Dy Aspect Ratio = Dx 7. Normalization The remaining unwanted components are removed using Normalized Area. It is the ratio of the area of the connected component to that of the largest component present in the image. Connected components that are less than this threshold are eliminated. The connected components that remain at this stage contain faces. Figure 5 walks through all the steps for the image shown in Figure 5a.
The parameter setting that was used for the above image is: Probability threshold 0.1 Size of SE (for opening) 17 x 17 pixels, disk shaped (Image size 816 x 616 pixels) Compactness threshold 0.025 Solidity threshold 0.5 Aspect ratio 0.9 2.1 Normalization threshold 0.35 It is worth mentioning here as to how the parameter settings were finalized. This is derived from the scatter plots of the Recall and Precision measures. Again, this is one of the primary uses of empirical evaluation. This is explained in detail in a later section.
5. Ground truth
The following guidelines were used while ground-truthing the images for evaluation. A face is bounded by a rectangle, where the area includes the eyebrows, eyes, nose mouth and chin. There should be a small but clear space between these facial features and the bounding box. The ears and top hair are not included in the face. For clear visualization, the ground truth images are shown in section 7 along with the results of each of the methods on the test images. One of the major issues with evaluation is the quality of ground truth. How reliable the ground truth is? To account for this ambiguity, care was taken to make the evaluation insensitive to ground truthing errors. Measures that use area overlaps, were made a little lenient in the sense that their contribution to the final score had less weight as against measures such as fragmentation, object count accuracy etc. The approach of weighing the different metric values will also be helpful in extending the evaluation protocol to different domains.
(5i) Final result Figure 5: Various stages in face detection using skin-color and connected component operators.
(7b) Results with image specific settings (Though one of the faces is fragmented, it is localized properly) Figure 7: Results explaining the tradeoff between global setting and per image setting of parameters. The result in Fig. 7a) is obtained with the global setting, while the result in Fig. 7b) is obtained with the following settings. Probability threshold 0.1 Size of SE (for opening) 5 x 5 pixels, disk shaped (Image size 400 x 276 pixels) Compactness threshold 0.025 Solidity threshold 0.5 Aspect ratio 0.3 2.1 Normalization threshold 0.05 However this is acceptable because one cannot set the parameter on a per image basis. The whole process of face detection should be an automated process with no user intervention with the parameter settings.
Figure 6: Scatter plot of ATR/ATP against Normalized Area Threshold A value of 0.35 is chosen for Normalized Area Threshold because this is the value at which the ATR does not go very low for most of the images, while the ATP is maintained at best possible value. A similar plot was made for the Aspect Ratio range against ATR/ATP. On the same lines of deciding the value for Normalized Area threshold, the value for Aspect Ratio range was decided as 0.9 2.1. Since it has two values lower threshold and higher threshold, there have to be two plots to show the effect of each. The reader might not be able to appreciate the decision through two different plots. Hence, the plot has not been shown here. It is important to note that the performance of the algorithm might not be the best for all images. With a different setting, the performance might increase. This is shown in Fig. 7.
7. Evaluation Results
Based on the performance metric values (Refer last page for the evaluation results Fig. 9), one can conclude the following: 1. The Haar face detector is more robust in detecting a face in the image. This is apparent from the fact that the Area Thresholded Recall of Haar face detector is always higher than or equal to the skin color-based face detector. 2. Both the algorithms produce false positives. However, based on the values of Precision, we can infer that, often Haar face detector produces lesser false positives than skin color-based face detector. 3. On this dataset, both the algorithms, when they detect a face, they detect it in whole. This can be seen from the average fragmentation measure for the test images. Again, this is specific to the test data set used. The skin color-based face detector is expected to be prone to errors in terms of fragmentation. In fact when tried on a different image from a database from which no images were used in training, the results of face detection showed fragmentation.
where,
Gi U Di i=1 Here, min (Gi,Di) , indicates the maximum number of one-to-one mapping between ground truth objects and detected boxes. However, work has to be done in checking the failure cases of the measure on boundary conditions. Initial results have been promising in that it successfully captures the aspects stated above. Finally, the essence of evaluation is to improve the performance of the algorithm. Here we have noticed that the skin color based face detector does not perform as good as the Haar face detector. In fact, even tweaking the parameters does not yield best results. This shows that the method based on skin color is not robust.
Figure 8: Results of skin color-based face detector. The detected face is fragmented. Also, there are false positives. This image was not included in the test set because color is sensitive to the camera used. Since this image was taken from a different data set on which the classifier was not trained on, it wouldn't be genuine to test the algorithm on this data. Again this is one of the major drawbacks skin color-based face detector. It has to be trained with skin and non-skin pixels from images taken from a camera whose images will be present in the test set. The Haar face detector is not limited by any such constraints. From these, we can declare that the performance of Haar face detector is better than the skin color-based face detector in all the aspects.
Overlap Ratio =
min (NG,ND) Gi I Di
9. Conclusion
Two face detection algorithms one based on Haar-like features and the other based on skin color have been implemented. Both the methods have been empirically evaluated and their performance quantified. Based on the results, we can declare that the Haar face detector outperforms skin color-based face detector in almost all the aspects of evaluation. Efforts on improving the performance of the skin-color based method have proven to be futile. This probably is due to the inability of the method in handling challenging situations. Even a cursory investigation of the method reveals the fact that color is not a good feature to rely on. It can vary due to different lightings, cameras and other factors such as shadow etc Since, the evaluation is not subjective and the performance has been quantified, there is no ambiguity with the conclusion.
8. Future Work
Effort has to be directed in making the performance evaluation insensitive to ground truthing errors. This is an extremely difficult task. However, measures such as Area Thresholded Recall and Precision are efforts in this direction. However, there is still scope for improvement. This aspect has to be explored. Another point is the fact that there are probably too many measures. Considering the fact that they cover different aspects of performance, this is acceptable. However, there have to be measures that comprehensively cover all aspects of an algorithm. To this end, we have developed a comprehensive measure that accounts for fragmentation (splits), merges, area overlap and false positives. This measure is mainly intended to comprehensively cover many aspects in one measure. However, this measure requires a one-to-one mapping of ground truth and detected objects. This is an area-based measure, which penalizes false detections, missed detections and spatial fragmentation. For a single image, we define CAM, the detection composite measure; (given that there are NG ground-truth objects and ND detected objects in the image) as,
CAM = Overlap Ratio NG + ND 2
References
[1] Kasturi R, Goldgof D, Soundararajan P, Manohar V, Performance Evaluation Protocol for Text and Face Detection & Tracking in Video Analysis and Content Extraction (VACE-II), Report Submitted to Advanced Research and Development Activity, March 2004. [2] Viola P and Jones M.J Robust real-time object detection. In Proc. of IEEE Workshop on Statistical and Computational Theories of Vision, 2001. [3] Kuchi P, Gabbur P, Bhat S, David S, Human Face Detection and Tracking using Skin Color Modeling and Connected Component Operators, IETE Journal of Research, Special issue on Visual Media Processing, May 2002.
Evaluation Results
OCA PBR PBP ATR ATP ABR ABP AF 9.1 (a) 9.1 (b) 9.1 (c) OCA PBR PBP ATR ATP ABR ABP AF A-1 1 .97 .81 1 1 .9 .73 1 A-1 .8 .8 .87 .67 1 .67 .82 1 A-2 .67 .61 .92 .5 1 .49 .92 1 A-2 .8 .71 .94 .67 1 .58 .96 1 A-1 1 1 .98 1 1 .98 .96 1 A-2 1 0 0 0 0 0 0 ND
9.2 (a)
9.2 (b)
9.2 (c)
9.3 (a)
9.3 (b)
9.3 (c)
9.4 (a)
9.4 (b)
9.4 (c)
Figure 9: Results of Evaluation (a) Ground Truth Image (b) Results of Haar face detection (c) Results of skin color-based face detection A-1 Haar face detector; A-2 Skin color-based face detector OCA Object Count Accuracy; PBR Pixel Based Recall; PBP Pixel Based Precision; ATR Area Thresholded Recall; ATP Area Thresholded Precision; ABR Area Based Recall; ABP Area Based Precision; AF Average Fragmentation OVERLAP_MIN was kept at 40% for all the ATR/ATP measurements.