Professional Documents
Culture Documents
Wseas TransOnSystems
Wseas TransOnSystems
LIVIU VLADUTU
School of Computing
Dublin City University
Glasnevin, Dublin 9, DCU
IRELAND
lvladutu@computing.dcu.ie http://www.computing.dcu.ie
Abstract: - The recognition of human activities from video sequences is currently one of the most active areas
of research because of its many applications in video surveillance, multimedia communications, medical
diagnosis, forensic research and sign language recognition. The work described in this paper describes a new
method designed to precisely identify human gestures for Sign Language recognition. The system is to be
developed and implemented on a standard personal computer (PC) connected to a colour video camera. The
present paper tackles the problem of shape recognition for deformable objects like human hands using modern
classification techniques derived from artificial intelligence.
100
300
where
400 polar coordinates and Vnm ( , ) is the ART basis
500
100 200 300 400
function of order n and m. The basis functions are
separable along the angular and radial directions,
and are defined as follows:
1
Vnm ( , ) exp( jm ) Rn ( )(1)
Figure 1: Example of an image from real signer 2
(above) and the equivalent one from a Poser-
generated video (below) 1,if n 0
Rn ( ) (2)
2cos( n ),if n 0
Two MPEG-7 visual descriptors are used in the
experiments. The Colour Layout descriptor (CLD)
provides information about the spatial colour The default region-based shape descriptor
distribution within images. After an image is has 140 bits. It uses 35 coefficients (n=10, m=10)
divided in 64 blocks, this descriptor is extracted quantized to 4 bits per coefficient. I have used in the
from each of the blocks based on Discrete Cosine object description all the 35 resulted coefficients.
Transform. We can evaluate the distance between The region based shape descriptor expresses pixel
two CLD vectors using the formula with luminance distribution within a 2D object region; it can
and 2 chrominance channels information: describe complex objects consisting of multiple
disconnected regions as well as simple objects with
or without holes (Figure 2). Some important features
of this descriptor are:
SCLD (Q, I ) i wy (YQ
i i
YIi ) 2
● It gives a compact and efficient way of describing
i wCb (CbQ i i
CbIi ) 2 properties of multiple disjoint regions
simultaneously;
i wCr (CrQ
i i
CrIi ) 2 ● It can cope with errors in segmentation where an
object is split into disconnected sub regions,
provided that the information which sub regions
where wi represents the weight associated with
contribute to the object is available and used during
coefficient i. There are 12 coefficients extracted for the descriptor extraction,
the colour layout descriptor (6 for Y, and 3 each for
Cb and Cr). There is a more detailed description in ● The descriptor is robust to segmentation noise.
the MPEG-7 ISO schema files, see [7] and:
http://standards.iso.org/ittf/PubliclyAvailableStan Also the classification with this descriptors
dards/MPEG-7 schema files/. The region based outperforms the results obtained with other
shape descriptor belongs to the broad class of shape descriptors, like SIFT, described in [30] and [32].
analysis techniques based on moments. The information from CLD and RS (the region
It uses a complex 2D Angular Radial shape) descriptors is stored in XML (Extensible
Transformation (ART) , defined on a unit disk in Markup Language) a Metadata Interchange format,
polar coordinates. The ART coefficients were which is further processed by the classifier.
The feature space for shape retrieval and {v1, v40 , v75 , v120 ,...} ,
then {v40 , v75} can be
classification consisted of up to 47 coefficients (12
used to visually represent the segment.
from CLD and 35 from ART descriptors) and they
In order to have a clear understanding of how
were passed further on to the classifier after the
gesture’s RFrames are selected a simple figure 4
selection scheme based on Fuzzy C-Means.
shows a relative smooth transition from neutral
phase (where signer keep hands down) to the active
phase (the region in green from the middle) and
again in the neutral position.
0.4
0.2
0
1
0.8 1
0.6 0.8
0.6
Figure 4: The figure shows how the images in a
0.4
0.4
0.2
0.2
0 0 video stream can be clustered in 3 simple regions,
therefore Fuzzy C-Means is applicable
Figure 3: A representation of a video-gesture 'a' in
the 3-dimensional PCA-space
In this selection process, the images from the all
In the figure above, there are 3 clusters and gesture's video stream, are represented in the
their centers (represented as red circles) and we Principal Components (PC) space,[3]. The
selected the frames (RFrames) from the middle- representation in PC-space has previously revealed
cluster (the one from the left hand of the us very interesting aspects of motion’s dynamics,
picture). The manifold is a closed one, since, [10][16].
the signer's hands are starting and ending in the
Since we are not dealing with crisp transitions (i.e.
same position, as emphasized in the Figure 4. an image may belong to 2 subsets), I considered
A simple algorithm, was chosen, (see also [8]): for mandatory to use Fuzzy Clustering.
example, if a logical video segment is v10100 , and The basics of the algorithm, called Fuzzy C-Means
the Rframe set from the whole video is (FCM) were introduced by Dunn [4] and improved
by Bezdek, [5] a classic fuzzy clustering m
j 1 (uij )
N
algorithms. xj
ci m
(5)
N
(u
j 1 ij
)
Fix the number of clusters C;
In the equation above m [1, ] is the fuzzifier
Fix the fuzzifier m;
Do { (m=2 in this case).
Therefore, in the end the algorithm looks like in the
Update membership using equation (4)
pseudo-code description depicted in Table 1.
Update center using equation (5)
} Until (center stabilize)
The idea can be expressed in a formal way as: the 2.4 The proposed classifier
goal is to:
1 l The number of training examples is denoted by l.
minimize ( w w C i )(9) In our case l was always 280 (10 frames for each of
2 i 1 the 28 one-handed gestures of ISL). is a vector
of l variables, where each component i ●the Laplacian kernel:
2
corresponds to a training example (xi , yi ) . ( x, xk ) x, xkT exp{ x xk }
2
xi represents the features vector which is formed as expressed in an excellent new reference: see
by either 35 (only the Region-shape coefficients) or [35].
47 coefficients corresponding to both the Region-
shape (RS) and the CLD descriptors.
Classifier results using diff. Kernels
A fast Windows implementation (.dll’s) for the
extraction of the video descriptors was chosen,
([15]) that can be included in our final real-time 100.00%
90.00%
Sign-Language understanding system. Although 80.00%
70.00%
there are many available implementations in several C l a ssf i c a t i on 60.00% polynomial
50.00%
programming languages (like Matlab, C++, Java, e r r or 40.00% dot
30.00%
Lisp a.s.o.), I have used a Java version ([13]) of an 20.00%
radial
10.00% Anova
implementation of the Support Vector Machine 0.00%
RS 1 R S 1& C LD 1 RS 2
based on the optimization algorithm of SVMlight as
described in [14]. mySVM can be used for pattern
recognition, regression and distribution estimation.
In order to cope with the relatively small number of
examples, a cross-validation (see, [17]) with a factor Figure 5: Classification performance of SVM with
of 25 was chosen. some of the most common kernels
Several types of kernels were tested (neural,
polynomial, Anova, Epanechnikov, gaussian-
combination, multiquadric, based on radial-basis
functions) but, our experience [9][24] is once more
confirmed, that (most probably) the SVM-classifiers 3 Experimental results
based on polynomial kernels are the best for
classification problems. The experience acquired in the group has shown that
Therefore, all the results expressed in the table 2 there are many factors that can influence the quality
correspond to the polynomial-kernel (the one with of image understanding, like: the differences
the smallest overall classification error for the 3 between signers clothing, between the lighting
experiments described in Table 2) supervised sources, between the skin of the humans or due to
learning. In the graph from Figure 4 below are the motion blur. Therefore the first step was to
described the classification errors (on ordinate) for 4 compare the classification performance of our
of the most commonly used kernels; algorithm for two classes of input data, only 35
- Anova; coefficients (of the RS descriptor), or 47 coefficients
- polynomial; (of the RS and CLD descriptors).
- radial; The performance vector (overall) resulted from the
- dot. confusion matrix of the results is presented in the
The SVM with regular kernels as above are table 2, and it shows that by adding the 12-extra
expressed by the following equations: coefficients corresponding to CLD, the
classification error is only slightly increased (by
0.35%). That will allow to gather more information
●linear SVM: ( x, xk ) xk , x
T
in our training and testing database (more real and
●the polynomial SVM of degree d: virtual signers) and to quantify the differences
enumerated above in only few coefficients at
( x, xk ) (a xkT x 1)d virtually no expenses. The second step of our
experiment used the same number of images but, for
●the RBF (Radial Basis-Function) SVM: the same letter/ sign expressed were used ten static
2
( x, xk ) exp{ x xk } images and ten images extracted from the Poser-
2 generated video-stream- corresponding to the same
sign, like in the Figure 1. The performance is given
in the third line of the table 2.
Implementations were built around Matlab
(Mathworks ©) which is a powerful matrix-based Acknowledgements:
software package having a lot of toolboxes, and The research was supported by the Science
which was used as a 'wrapper'. Foundation of Ireland –SFI ( to whom I am deeply
thankful) but the author is also thankful to:
Table 2: Generalization Performance of mySVM - Lecturer Alistair Sutherland, Dr. George Awad,
classifier with polynomial kernels. Dr. Junwei (Jeff) Han for the SST (Skin-
Current experiment Performance Description of segmentation and tracking) contribution;
Vector the experiment
- Dr. Sara Morrisey and Tommy Coogan for the
RS1 6.03% Region-shape only
database of synthetic video streams generated in
RS1 & CLD1 6.39% Region-shape and CLD Poser (all from Dublin City University) and,
coefficients real images respectively, for the camera collected video-streams.
RS2 7.64 % Region-shape
coefficients for real References:
and virtual signers
[1] J. Han, G. Awad, A. Sutherland and H. Wu,
Automatic Skin Segmentation for Gesture
Recognition Combining Region and Support
3 Conclusion Machine Active Learning, Proceedings of the
7th International Conference on Automatic
The results explained in the previous sections show Face and Gesture Recognition, 2006, pp. 237-
that a limited of gestures (executed by a human or a 242.
robot...) can be learned and understood by [2] G.M. Awad, A Framework for Sign-Language
combining the shape recognition (hand shapes Recognition using Support Vector Machines
playing the role of letters in an alphabet) detailed in and Active Learning for Skin Segmentation and
the current work with an understanding of the Boosted Temporal Sub-units, Dublin City
gesture dynamics represented in the feature space University- Ireland, 2007 (PhD Thesis).
(like PCA). In this latter approach, the images are [3] I. T. Jolliffe, Principal Component Analysis,
represented in the PCA-space and the gestures are Springer-Verlag, 2002.
represented in a nonlinear manifold. The fast [4] J.C. Dunn, A Fuzzy Relative to the ISODATA
procedure exposed- it takes approximately 10 Process and It's Use in Detecting Compact
milliseconds (average classification time), and Well-separated Clusters, Cybernetics and
approximately 1 second for the VDE-feature Systems: An International Journal, Vol. 3,
extraction on a PC (having dual-2.4 GHz processor) Issue 3, 1973, pp. 32-57.
it’s considered to be a good choice for other [5] J.C. Bezdek, Pattern Recognition with Fuzzy
researchers in the related fields, which are Objective Function Algorithms, Plenum-Press,
enumerated in the Introduction section. New-York, 1981.
Future envisaged work involves: [6] J. Kovac, P. Peer and F. Solina, Human Skin
- background modelling at the beginning of Colour Clustering for Face Detection,
the video-frame analysis, [29]; Proceedings of EUROCON 2003, Turku,
- eventually tackling occlusion problems Finland, pp. 144-148.
with the help of a-priori defined 3D models [7] Shih-Fu Chang, T. Sikora and A. Puri,
of the involved body parts (head and hands), Overview of the MPEG-7 Standard, IEEE
using some of the available software Transactions on Systems and Circuits for Video
environments, either: Technology, Volume 11, No. 6, June 2001, pp.
● ITK/VTK (http://www.vtk.org/, [31] 688-695.
and http://www.itk.org/) ; [8] A. Joshi, S. Auephanwiriyakul and R.
● using Computational Geometry Algorithms Krishnapuram, On Fuzzy Clustering and
Content Based Access to Content Video
Library (http://www.cgal.org/) or related, see
Databases, Proceedings of the Workshop on
also [33], [36]. Research Issues in Databases Engineering,
1998, pp. 42-47.
[9] S. Papadimitriou, S. Mavroudi, L. Vladutu and
A. Bezerianos, Ischemia Detection with a Self-
Organizing Map Supplemented by Supervised
Learning, IEEE Trans. on Neural Networks, Conference on Automatic Face and Gesture
Volume 12, Issue 3, pp. 503-515. Recognition, FGR 2006, pp. 591-596.
[10] W. Hai and A. Sutherland, Irish Sign Language [24] L. Vladutu, Computational Intelligence
Recognition using Hierarchical PCA, Irish Methods on Biomedical Signal Analysis, VDM-
Machine Vision and Image Processing Verlag Publishing House, 2009.
Conference (IMVIP 2001), National University [25] J. Krumm, S. Shafer and A. Wilson, How a
of Ireland, Maynooth, 5-7 September 2001. Smart Environment Can Use Perception,
[11] V. Vapnik, The Nature of Statistical Learning Workshop on Sensing and Perception, (part of
Theory, 2nd Edition, Springer Verlag, 2000. ACM UbiComp 2001), September 2001.
[12] The National Association for Deaf People, [26] Y. Wu and T. Huang, Vision-Based Gesture
Ireland, The Standard Dictionary of Irish Sign recognition: A Review, Lecture Notes in
Language, (CD/DVD), by microBooks Ltd., Computer Science, Springer Verlag, Volume
2006. 1739, 1999, pp. 103-115.
[13] Open Source Data Mining with the Java [27] A. Corradini and H-M Gross, Camera-Based
software, RapidMiner, Gesture Recognition for Robot Control,
http://rapid-i.com/content/blogcategory/38/69 Proceedings of the IEEE-INNS-ENNS
[14] Joachims Thorsten, Making Learning Large- International Joint Conference on Neural
Scale SVM Learning Practical, Advances in Networks, (IJCNN 2000), Como, Italy, 2000,
Kernel Methods, chapter 11, MIT Press, 1999. pp. IV 133-138.
[15] G. Tolias, Visual Descriptors Applications, [28] T. Starner, J. Auxier, D. Ashbrook and M.
Semantic Multimedia Analysis Group, NTUA, Gandy, The Gesture Pendant: A Self-
Athens, Greece, Illuminating, Wearable, Infrared Computer
http://image.ntua.gr/smag/tools/vde/ Vision System for Home Automation Control
[16] L. Vladutu, A. Sutherland, Gesture analysis of and Medical Monitoring, Proceedings of the4th
deaf people language using nonlinear manifolds IEEE International Symposium on Wearable
analysis, SFI Conference, Dublin, Ireland, July, Computers, ISWC 2000, pp. 87-94.
2007, presented by. A. Sutherland. [29] D. Gutchess, M. Trajkovics, E. Cohen-Solal, D.
[17] R. Kohavi, A study of cross-validation and Lyons and A.K. Jain, A background model
bootstrap for accuracy estimation and model initialization algorithm for video surveillance,
selection, Proceedings of the14th International Proceedings of the 8th IEEE International
Joint Conference on Artificial Intelligence, Conference on Computer Vision, ICCV 2001,
2(12), Morgan Kaufmann, San Mateo, 1995, Volume 1, July 2001, pp. 733-740.
pp. 1137-1143. [30] D. G. Lowe, Distinctive Image Features from
[18] V. N. Vapnik, Statistical Learning Theory, Scale-Invariant Keypoints, International
Wiley-Interscience, 1998. Journal of Computer Vision, Volume 60,
[19] C. Cortes and V. Vapnik, Support Vector Number 2, November 2004, pp. 91-110.
Networks, Machine Learning, Volume 20, [31] B. Preim, D. Bartz, Visualization in Medicine:
Number 3, September 1995, pp. 273-297. Theory, Algorithms and Applications, The
[20] X. L. Xie, G. Beni, A Validity Measure for Morgan Kaufmann Series in Computer
Fuzzy Clustering, IEEE Trans. Pattern Graphics, July 2007.
Analysis and Machine Intelligence, Volume 13, [32] D. G. Lowe, Object Recognition from Scale-
Number 8, 1991, pp. 841-847. Invariant Features, IEEE 7th International
[21] W. T. Freeman and C. Weisman, Television Conference on Computer Vision, (ICCV '99),
Control by Hand Gestures, International Volume 2, 1999, pp. 1150-1156.
Workshop on Automatic Face and Gesture [33] A. Fabri, G-J. Giezeman, L. Kettner, S.
Recognition, IEEE Computer Society, Zurich, Schirra, On the Design of CGAL, a
Switzerland, June 1995, pp. 179-183. computational geometry algorithms library,
[22] C. Graetzel, S. Grange, T. Fong and C. Baur, A Software- Practice and Experience, Volume
non-contact mouse for Surgeon-Computer 30, Issue 11, pp. 1167-1202.
Interaction, Technology and Health Care, IOS [34] M. Bober, MPEG-7 Visual Shape Descriptors,
Press, Volume 12, Number 3, 2004, pp. 245- IEEE Transactions on Circuits and Systems for
257. Video Technology, Volume 11, Number 6, June
[23] J. Carreira and P. Peixoto, A Vision Based 2001, pp. 716-719.
Interface for Local Collaborative Music [35] S. Amiri, D. von Rosen, S. Zwanzig, The SVM
Synthesis, Proceedings of the 7th International Approach for Box-Jenkins Models, REVSTAT
Statistical Journal, Volume 7, Number 1, April
2009, pp. 23-26.
[36] J. E. Goodman and J. O'Rourke, Handbook of
Discrete and Computational Geometry, 2nd
Edition, Chapman & Hall/ CRC, 2004.