Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Reliable Object Recognition using SIFT Features

Florin Alexandru Pavel #1, Zhiyong Wang #*2, David Dagan Feng #*3
#
School of Information Technologies, University of Sydney, NSW 2006 Australia
*
Department of Electronic and Information Engineering, Hong Kong Polytechnic University, Hong Kong, China
alexandruyo@yahoo.com
zhiyong@it.usyd.edu.au feng@it.usyd.edu.au

Abstract— SIFT (Scale Invariant Feature Transform) features model improves recognition precision by creating the
have been one of the most efficient descriptors for object possibility to differentiate between a false positive match and
recognition. However, the excessive number of key points and a partial occlusion problem. The measurement of density of a
high dimensionality has limited its capacity in object recognition. matching region of features as well as viewpoint
In this paper we present a novel method based on SIFT features
for reliable object recognition. At first, a matching tree is
approximation gives, with a certain degree of confidence, a
constructed to eliminate non-essential key points. In order to solution to the problem of partial occlusion vs. false positive.
achieve viewpoint independence, a 3D model is constructed for Experimental results show that using the technique described
each object in the filtered SIFT feature space. Experimental in this paper an increase in object recognition precision as
results on both Caltech 101 and COIL 100 datasets indicate the well as a decrease in processing time can be achieved.
effectiveness of our proposed algorithm.
II. SIFT FEATURE FILTERING
I. INTRODUCTION A. SIFT (Scale Invariant Feature Transform)
SIFT [1] is a local feature detection and description
technique. Local image features extracted using this technique SIFT – Scale Invariant Feature Transform [1] is one of
are invariant to rotation and scale changes and are robust to several computer vision algorithms aimed at extracting
other transformation that can appear in an image as well as distinctive and invariant features from images. Features
minor viewpoint changes. SIFT has been proven to be one of extracted using the SIFT algorithm are invariant to image
the most reliable techniques to extract invariant features from scale, rotation, and partially robust to changing viewpoints
images [5], a number of object recognition systems adopting it and changes in illumination. SIFT algorithm applies a four
with great success [7, 8]. Also a number of improvements to stage approach in extracting features from an image.
the technique have been proposed. Most of the research in SIFT applies a four stage approach to in order o extract
enhancing the effectiveness of SIFT [2, 3, 4, 6] has revolved local features. The first two stages are focus on extracting
around better acquisition and description of low level image feature location information and more importantly ensuring
features on a per image basis. Other research has combined repeatable and reliable feature locations. The last two stages
low level image features with high level information in order are focused on creating a feature description. First an
to achieve better performance [9]. Local features obtained orientation is assigned, making the feature invariant to
using SIFT are not always useful for characterizing an object rotation, and then the descriptor is created. A set of 16
for recognition purposes. In real word images features may be histograms with 8 orientation bins is used in calculating the
generated from background responses or other not in focus feature descriptor resulting in a feature vector 128 elements
objects which included in the object model will increase long.
computational times and degrade recognition performance.
Therefore, there is a stringent need to identify and eliminate The invariance and robustness of the features extracted
features that can negatively influence the performance of the using this algorithm makes it an extremely good candidate for
recognition task. object recognition and it is an often used in detection /
In this paper we propose a new technique to reduce the description schemes, achieving one of the best performance
number of key points so as to improve the matching efficiency figures from all current feature extraction techniques [5].
and recognition performance. In addition, in order to further However, due to the amount of features extracted and
achieve viewpoint independence for reliable recognition, 3D variations present in images building a reliable object
object models are constructed to accurately capture spatial recognition system based on SIFT features is still a
relationships between features. The filtering step identifies challenging task. SIFT approach transforms an image or an
object features that present a higher level of reliably and image dataset into a large collection of local feature vectors.
consistency throughout the training dataset and also Each feature vector possesses invariance to scale, translation,
significantly reduces the overall object feature space which orientation and noise. Usually between 1000 to 2000 features
decreases the size of the problem, making the system more are extracted from an average size image giving the possibility
computationally efficient. to recognize an objects with substantial levels of occlusion
Capturing strong spatial relationships between features [5,1] but this also creates a set of problems over large datasets
extracted from different images and constructing a 3D object
MMSP’09, October 5-7, 2009, Rio de Janeiro, Brazil.
978-1-4244-4464-9/09/$25.00 ©2009 IEEE.
of images due to the amount of features and false positive generated by background clutter which heavily impact
matches that can be generated. recognition performance.
Experimental results show that only a small percentage of
B. Feature filtering by longest matching trees features can in fact be used in describing an object
Real-world images, in their vast majority, always include consistently and reliably. Considering that SIFT can generate
content that is not related to, or has only a weak relationship a few thousand features per image, depending on image size
with the objects of interest present in the image. Feature and content, over a large image dataset the elimination of
extraction techniques are indiscriminative and have no background clutter and unrepresentative features is of great
knowledge of image content therefore background responses concern as it significantly affects both recognition precision as
are identified as features and consequently used in the creation well as computational efficiency.
of the model. For small datasets of images this proves not to The creation of the object feature space from a large image
be a major problem but when larger datasets are used, dataset with cluttered background poses its own challenges
background response can cause performance degradation as when aiming to construct an object feature space with
the dimension of the datasets grows. minimal interference from object surroundings.
Using SIFT the amount of features generated from an The method proposed requires analysing the entire image
image is in the order of thousands. Furthermore, a large part dataset and has the following steps:
of the features generated are due to background clutter and not  Feature extraction. For each image SIFT features are
all of the features can be considered highly descriptive and extracted and each set of features is annotated with the
consistent for the object contained in the image over a large image id consequently building a raw feature space.
image dataset. In order to address the problems generated by  Matching tree construction for every feature. The matching
the amount and quality of extracted features we are proposing tree for a feature is constructed from the raw feature space,
a feature filtering technique called Longest Matching Trees, extracted in the previous step, by performing a top-down
aimed at identifying features that have a higher degree of and than a bottom-up similarity matching. Each new feature
consistency and distinctiveness throughout the object space. set from a new object representation is matched with the
current features contained in the tree and features that match
are introduced in the tree. A bi-directional approach is used
so that all features that share the same level of similarity are
included in the tree with the smallest number of iterations.
 Filtering. The longest matching trees are selected, where the
length of a tree is given by the number of object
representations that the features in the tree are part of, and
the resulting collection of features is then used in the
creation of the object model.

III. OBJECT MODEL CONSTRUCTION


Constructing a 3D model of an object presents its very own
sets of advantages given by capturing strong spatial relations
between features and more importantly capturing spatial
relations between features extracted from different object
Figure 1. Visual representation of matching trees
views. A 3D object model enables the creation of feature
clusters or feature regions that were captured from different
This technique identifies features that are only part of the object views at different scales, noise levels, orientations and
object space and can describe that object consistently and possible deformations. We created such a model in order to
distinctly throughout the training set. This approach measures capture strong spatial relations between features extracted
the propagation of a set of features throughout the object from different representations of the object. Such a construct
space in order to select features that have a certain degree of requires a relative object-to-object scale, orientation and
repeatability and consistency. In Figure 1. a visual position estimate in order for extracted features to be located
representation of matching trees in an object training dataset is as close to their actual position on the real object as possible.
shown. We can observe that different sets of features are part Features are accurately mapped to the object model which
of longer or shorter matching trees, the aim of this technique further improves the detection of partially occluded vs. false
being to construct and select matching trees that have the positive matches by calculating the density of a matching
longest length. The length of a tree is given by the number of region of features when computing the similarity between the
object representations that the features contained in the tree model and a new image. This is of great importance because
are part of, this giving a very good measurement of the only measuring the number of matching features between the
repeatability and descriptive power of a set of object features. object model and a new image does not allow differentiating
This step eliminates less descriptive features and features between a partial occlusion problem and a false positive
match. Part of the novelty of the technique is given by not
only being discriminative in selecting features but actually used in the calculation of the polygonal shape of the object.
using the viewpoint to achieve viewpoint independence. Having an object shape allows the estimation of object
dimensions. Using the angle information given by the
A. Pre-processing dataset and the objects dimension a relative rotation of the
The creation of the object model requires the following pre- Z axis is computed considering that the object space has
processing steps: spherical properties.
 Single scale approximation. The creation of a three
dimensional object model requires a scale approximation In order to propagate the scale, orientation and position to
between object representations in the training set in order all image representations and have a certain confidence on the
to accurately map object features on the object model. In values extracted an integrative approach is used where the
our approach scale is computed relative to an object decision to approximate scale, orientation and position
representation in the training set which is selected and between two images is given by computing a similarity
given a scale value of 1. The object representation that will distance between every two images of the training set . The
give the overall scale of the object model is chosen using similarity distance between two images is extracted from the
the information provided by the filtering step. In order to data provided by the feature filtering technique as the
approximate scale between two object representations, we information contained in the longest matching trees of features
compute the Euclidian distance between all the matching is the best indicator on how many features two images share.
pair of features, generating a high enough number of
values on which outlier detection can be performed B. Object model
successfully. The technique consists in calculating the Computing scale, orientation and position information for
ratio between the distances of every two features from the every object representation the feature locations on the object
same image that match another two features from another model can be calculated to match the location on the real
image. On the values obtained outlier detection is object. The model construction allows positioning features
performed and the relative scale between two object based on their positioning on the real object, therefore creating
representations is extracted as a percentage. strong spatial relationships between features. In real-world
 Orientation approximation. In order to accurately create a images an object‟s viewpoint has an important role due to the
3D model all object representations have to poses the same fact that object description can change completely if the
orientation. This is particularly important in order to assure viewpoint change is large enough.
that features describing the same region of the real object
also describe a similar positioned region on the object
model. Approximating a relative orientation between two
object representations is calculated by approximating the
angle offset between two segments that are described by
matching pair of features. We divide the space for each
object representation into 4 regions (0, π/2, π and 3π/2)
and using the centroids of the two feature sets and
matching feature locations we calculate a third point. The
3 points calculated describe a right triangle from where an
angle can be extracted. The angle offset between every
matching pair of features from the two feature sets is
calculated and outlier detection is performed. The average
of the remaining values describes the relative rotation
Figure 2. . Object model representation - dataset images
between two feature sets and following this approach we
In Figure 2. visual representations of two object models from
can compute the relative angle between two object
the dataset used to validate the systems are presented as well
representations.
as part of the training dataset with a viewpoint change of 90
 Position approximation. Position approximation on X and degrees; red dots represent the location of features. The
Y axes is calculated between two object representations by resulting model is a spheroid used to preserve features
calculating the centriods for each one and then calculating location, consequently preserving the viewpoint.
the offset between centroids. Following the iterative
approach used in scale and orientation approximation we IV. OBJECT RECOGNITION
ensure that the most similar representations and their
For matching between the object model and a new object
feature sets are used in approximating the position.
representation The Best-Bin-First (BBF) [1] algorithm is used
Position on the Z axis is calculated by first computing
which is a more computationally efficient approach to the
bounds, height and width of the object. This is possible
nearest neighbor problem. Spatial relationships between
because of the filtering step which ensures, with a certain
features, captured in the object model, are used to further
degree of confidence, that features are part of the object.
improve the reliability of this task by first eliminating false
First features on the boundary are indentified which are
positive matches based on viewpoint and then approximating
the density of a matching region of features on the object common, especially in the case of more plastic objects like
model. Accurately positioning features of an object in a 3D animals, they tend to be topological in nature, preserving the
space can improve recognition performance, considering that proximity or locality.
most deformations of objects are plastic in nature and that
properties such as proximity are kept. V. EXPERIMENTS AND RESULTS

A. Viewpoint false positives filtering A. Experimental results on Caltech


The 3D model allows for an „a posteriori‟ feature filtering Experimental results using the Caltech 101 [11] have
based on viewpoint when matching features between the shown a reduction in the global feature space by
model and a new object representation. This is based on the approximately 40% to 90 %, depending on image content, the
fact than no more than one hemisphere of the object model amount of background clutter and similarity distance between
can be visible from a single viewpoint; therefore all matching object views. Feature space filtering is of great concern as
features have to be included in one hemisphere. matching between the features contained in the object model
This viewpoint feature filtering technique is based on and the features extracted from a new object representation is
approximating the viewpoint of a new object by computationally heavy. By identifying consistent features that
approximating the centroids of the matching set features on are part of the actual object space the precision of the
the model space and verifying if the matching features are part recognition task is increased and more importantly it remains
of the described hemisphere. constant over large datasets.
The elimination of false positive matching features is of great
concern in order to reliably perform the task of object
recognition. The false positive feature elimination proposed in
this paper is part of the recognition process and is based on the
assumption that object features are viewpoint dependent. As
the description of the object can totally change with the
viewpoint this has proved a valid assumptions and
experimental results have validated this assumption.
B. Deformable patches of features
Positioning features on a 3D model with a high degree of
accuracy creates the possibility of identifying patches on the
object models generated by matching a new object
representation to the object model. The density of the patch is a) b)
Figure 4. Unfiltered and filtered features.
used in the decision if matching features are part of a partially a) Unfiltered features extract from the image (in blue).
occluded object or are false positives. This is done be b) Filtered features based on longest matching tree method (in yellow)
computing the density of a patch in regards to the object In Figure 4, two representations of the same object are shown,
model. The density is given by approximating the area of the where the left image has depicts all the features extracted
polygon described by the location of matching features on the using SIFT and the right image depicts all the features that
object model and dividing it by the number of features. remain after the filtering step. We can observe that after the
Lower densities of features in a region on the object model filtration method is applied the amount of features is reduced
can be mainly attributed to false positives whereas high considerably and most of the features are concentrated around
density patches are most likely to be representing a partially the object in focus (in this case stop sign and face).
occluded object. In order for an object recognition system to be used in real
world applications it has to provide a way to compensate for a
wide variety of transformations that can occur in an image,
but more importantly it has to be able to learn an object model
from low quality data. From the object recognition task point
of view low quality data is represented by sets of images that
present a wide array of transformations and where the object
in focus does not represent the dominant region or part of an
Figure 3. Partial occlusion problem. High density patch of matching features image. Learning and object model over this type of datasets is
on the object model. Red dots represent the position of matching features very challenging and significantly affects recognition
In Figure 3. an example of a high density patch on the object precision as features generated by background content account
model is shown, generated by a partially occluded object. The for a large number of false positive matches. The filtering
density of a region of matching features has proven, technique proposed in this paper is aimed at addressing this
throughout many experiments, to be a very good solution in problem by identifying features that are only part of the object
differentiating between a false positive set of features and a space and have a certain degree of consistency. Processing
partial occlusion problem with eliminating any true positive time, one of the most influenced measures by a variety of
results. Although deformations of an object in an image are factors, also decreases linearly with the decrease in the
number of features used in the object recognition
process.

Figure 6. COIL dataset – 100 images from each of the 100 object categories

For every set of images representing an object 25 % are


chosen as training data and 75% are chosen as testing data. An
Figure 5. Unfiltered vs. filtered features
object having 72 representations results 18 object
In order to better examine the performance results of the representations being used for training and 54 used to perform
feature filtering technique (longest matching trees) we choose test.
not to average the results as this will eliminate spikes in the In Table 1. recognition results on the COIL-100 dataset are
number of unfiltered and filtered features, which are of great shown. We achieve a precision of 99% with a recall of 90%
importance when analyzing the results. In Figure 5 the resulting in an F-measurement of 94%. These values show
performance results of the filtering technique over 51 and 21 that the object recognition system proposed in this paper can
images are shown from the Caltech 101 image dataset. be reliably and successfully used for the task of object
Because most of the images have an average size and present recognition over large datasets.
a high amount of background clutter the number of raw or
unfiltered features has a very high variance from image to Table 1. Recognition accuracy
image. This is not an indicator on how many features are Method Precision
extracted from an image for an object but a clear indicator of Snow[13] 92%
the influence of background clutter. The number of filtered G-SVM[13] 97%
features presents less variance as the number of features of an KL-SVM[13] 98%
object per image tends to remain constant, this being Proposed approach 99%
dependent on object scale and the viewpoint from which it In Figure 7. the performance results of the filtering
was captured. technique, longest matching trees, are shown which resulted in
On the two object categories in Figure 5 we achieved a a reduction of approximately 40% of the object feature space
reduction in the object feature space of approximately 80%, significantly increasing computational efficiency of the
removing most of the features generated by background system. The filtering technique apart from allowing an
clutter. Identifying only reliable features of an object can accurate construction of the object model by providing the
significantly increase the computational efficiency of the similarity distance between object representations also
system and make the object recognition task more reliable. improved the performance of the proposed system as only
features that have a higher degree of repeatability and
B. Experimental results on COIL-100 consistency are included in the object model.
The COIL-100 dataset [10] consists of 7200 images from
100 objects which have been placed perpendicularly in front
of the camera resulting in a total of 72 representations of an
object with a 5 degree variation between them.
In Figure 6 a representative image for each object in the
dataset is shown. The dataset presents a diverse range of
objects that have also a wide range of textures, shapes and
colors.

Figure 7. Unfiltered vs. filtered number of features – COIL-100 dataset


VI. DISCUSSIONS creating a region based approximation of a matching set of
The analysis of pairs of objects from different categories that features between the model and a new image by
contributed mostly to the drop in the Recall value resulted in approximating the density of those features on the model‟s
the identification of interesting characteristics of the model. surface. This also allows the filtering of matches as an outlier
In Figure 8 a comparison between two object categories is detection problem. This is done by approximating the
shown, these categories registering between them almost 60 % viewpoint and building on the fact that matching features must
false positive identifications. Because the problem of partial be included in the hemisphere described by that particular
occlusion in the model is addressed as an identification of viewpoint. The experimental results on two widely used
matching patches of features, this results in cataloguing datasets have shown that the proposed object recognition
images from these categories as a partial occlusion problem. method can improve both recognition precision and
computational efficiency by focusing the recognition process
on relevant object features and using feature location, in a 3D
space, as a decisional factor.

ACKNOWLEDGEMENT
The work presented in this paper is partially supported by
ARC and Hong Kong Polytechnic University grants.

REFERENCES
Figure 8. Comparison between two object categories that generated the [1] D. G. Lowe, “Distinctive image features from scale-invariant
most number of false positives
keypoints,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.
In Figure 9 a comparison between two objects from
different categories is presented and we can observe that [2] A. Suga, K. Fukuda, T. Takiguchi, and Y. Ariki, “Object recognition
and segmentation using sift and graph cuts,” in 19th International
partially occluding the objects results in completely Conference on Pattern Recognition. Dec. 2008, pp. 1–4.
eliminating the possibility of distinguishing between them,
[3] F. Liu and M. Gleicher, “Region enhanced scale-invariant saliency
even from the human recognition capabilities perspective. The detection,” IEEE International Conference on Multimedia and Expo, vol.
generation of false positives on these objects categories is not 0, pp. 1477–1480, 2006.
necessarily affecting the future performance measurements on [4] Y. Ke and R. Sukthankar, “Pca-sift: a more distinctive representation for
other datasets as it indicates good generalization capabilities. local image descriptors,” in 2004 IEEE Computer Society Conference
This perspective is has been included in future work as it on Computer Vision and Pattern Recognition, vol. 2, 2004, pp. 506–513.
indicates the possibility of learning an object model from two [5] K. Mikolajczyk and C. Schmid, “A performance evaluation of local
or more categories and creating an inheritance tree of sub- descriptors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp.
models to distinguish between objects that share a set of 1615–1630, October 2005.
features but are from different categories. [6] Y. Cui, N. Hasler, T. Thormählen, and H.-P. Seidel, “Scale invariant
feature transform with irregular orientation histogram binning,” in
ICIAR ’09: Proceedings of the International Conference on Image
Analysis and Recognition. Springer, 2009.
[7] Q. Fan, K. Barnard, A. Amir, A. Efrat, and M. Lin, Matching slides to
presentation videos using sift and scene background matching,” in
Figure 9. Comparison between two object categories that generated the most MIR ’06: Proceedings of the 8th ACM international workshop on
number of false positives (only matching regions are shown). Multimedia information retrieval. New York, NY, USA: ACM, 2006,
pp. 239–248.
VII. CONCLUSIONS [8] J. Chen and Y.-S. Moon, “Using sift features in palmprint
We presented a novel approach for reliable object authentication,” 19th International Conference on Pattern Recognition.
recognition by filtering local features to reduce irrelevant ICPR 2008, Dec. 2008, pp. 1–4.
features and to reduce computational complexity, and utilizing [9] H.-H. Jeon, A. Basso, and P. Driessen, “A global correspondence for
3D models created from filtered SIFT feature space to achieve scale invariant matching using mutual information and the graph
search,” in IEEE International Conference on Multimedia and Expo,
viewpoint independence. Feature filtering allows for the July 2006, pp. 1745–1748.
detection of features that can reliably describe that object
[10] S. Nene, S. K. Nayar, and H. Murase, “Columbia object image library
throughout the overall feature space and can eliminate much (coil-100),” Tech. Rep., 1996.
of the background clutter. Features generated by background
clutter or have low quality features can generate, over large [11] F.-F. Li, F. Rod, and P. Pietro, “Learning generative visual models from
few training examples: An incremental bayesian approach tested on 101
datasets, false positives and reduce the certainty of a match. object categories,” in CVPR ’04: Proceedings of the Computer Vision
Experimental results show that the proposed filtering strategy and Pattern Recognition Workshop, 2004, p. 178.
is very effective in identifying reliable features of an object [12] Libsift - scale-invariant feature transform implementation. [Online].
and further improving the performance of the object Available: http://user.cs.tu-berlin.de/ nowozin/libsift/
recognition system as well as computational efficiency. [13] T. Pham and A. Smeulders, “Sparse representation for coarse and fine
Constructing a 3D space for the object model where features object recognition,” IEEE Transactions on Pattern Analysis and
are mapped accurately onto it introduces the possibility of Machine Intelligence, vol. 28, no. 4, pp. 555–567, April 2006

You might also like