Lip Reading Using Image Processing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

1192 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 6, NO.

8, AUGUST 1997

[3] J. R. Jian and A. K. Jian, “Displacement measurement and its application


in interframe image coding,” IEEE Trans. Commun., vol. COMM-29,
pp. 1799–1808, Dec. 1981.
[4] T. Komarek and P. Pirsch, “Array architectures for block matching
algorithms,” IEEE Trans. Circuits Syst., vol. 36, pp. 1301–1308, Oct.
1989.
[5] L. De Vos and M. Stegherr, “Parameterizable VLSI architectures for the
full-search block-matching algorithm,” IEEE Trans. Circuits Syst., vol.
36, pp. 1309–1316, Oct. 1989.
[6] S. E. Budge and R. L. Baker, “Compression of color digital images using
vector quantization in product codes,” in Proc. Int. Conf. Acoustics,
Speech, and Signal Processing, Apr. 1985, pp. 129–132.
[7] K. N. Ngan, K. K. Sin, and H. C. Koh, “HDTV Coding Using Hybrid
MRVQ/DCT,” IEEE Trans. Circuits Syst. Video Technol., vol. 3, pp.
320–323, Aug. 1993.
[8] T.-C. Chen, B.-J. Sheu, and Z. Zhang, “An adaptive vector quantizer
based on gold-washing method for image compression,” IEEE Trans.
Circuits Syst. Video Technol., vol. 4, pp. 143–156, Apr. 1994.
[9] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer
design,” IEEE Trans. Commun., vol. COMM-28, pp. 84–95, Jan. 1980.

Lipreading from Color Video Fig. 1. Color video recognition using Snakes, KLT, and HMM’s.

Greg I. Chiou and Jenq-Neng Hwang


KLT to guide the snake search (the so-called active shape models
[6]) on grayscale video for lipreading. Nevertheless, it has been
Abstract— We have designed and implemented a lipreading system reported that the variation of the gray levels around human lips is
that recognizes isolated words using only color video of human lips small [11]; therefore, contour-based processing on the monochrome
(without acoustic data). The system performs video recognition using images around lips cannot produce satisfactory contours. Chiou and
“snakes” to extract visual features of geometric space, Karhunen–Loève
Hwang [4] developed one of the first systems that took advantage of
transform (KLT) to extract principal components in the color eigenspace,
and hidden Markov models (HMM’s) to recognize the combined visual color information in independently deriving and combining both the
features sequences. With the visual information alone, we were able to contour-based and area-based features for lipreading.
achieve 94% accuracy for ten isolated words. We used KLT to extract features of mouth images in the eigenspace.
Index Terms— Active contour model, hidden Markov model, The extracted features can then be combined with the snake radii from
Karhunen–Loève transform, lipreading, snake, visual phoneme. the snake algorithm for temporal sequence recognition by hidden
Markov models (HMM’s), which are the most effective tools for
speech recognition [13], [15]. Without the eigenfeatures, snake radii
I. INTRODUCTION alone are not enough to represent a viseme [7] (visual phoneme). We
Petajan is one of the first researchers who built a lipreading system used KLT to complement the contour-based features in representing
to improve the performance of the acoustic speech recognition [12]. visemes. A recent paper by Silsbee and Bovik [14] provides a
Goldschen et al. [8] and Petajan et al. [12] both used oral-cavity wonderful survey of the state-of-the-art lipreading research.
features for visual speech recognition. Mase and Pentland [10] used Our lipreading system performs color video recognition by inte-
the optical-flow analysis to examine the mouth motion rather than grating snakes, KLT, and HMM’s. As shown in Fig. 1, the snake
shape for lipreading. Chiou and Hwang [2], Stork et al. [16], and algorithm and KLT are used to extract contour-based and area-
Yuhas et al. [19] used neural networks in their lipreading systems for based visual features from every frame (image) in a video. The
image sequence recognition. snake algorithm looks for contour features in the geometric space,
There are two major types of features useful for lipreading: while KLT seeks principal components in the eigenspace. An HMM
contour-based and area-based features. The active contour models [9], recognizer is used to train and recognize a sequence of the combined
also known as “snakes,” provide a good example of contour-based visual features. Our lipreading system is different from others in the
features, which have been applied to object contours finding in many following ways.
image analysis problems [3], [5]. Karhunen–Loève transform (KLT), • The snake algorithm looks for contour features by using the
a typical area-based method, has been successfully used for principle color information to systematically derive the external energy
feature extraction in pattern recognition problems [10], [18]. Bregler’s function of the snake.
lipreading system [1], an early attempt in using both features, used • The snake search does not depend on the success of the KLT.
Manuscript received September 23, 1995; revised January 7, 1997. The • Constraints are imposed on the snake model to have more
associate editor coordinating the review of this manuscript and approving it effective radial vectors uniformly spread in 360 .
for publication was Dr. A. Murat Tekalp. • KLT seeks principal components in the color eigenspace for the
The authors are with the Information Processing Laboratory, Department whole jaws (not just lips).
of Electrical Engineering, University of Washington, Seattle, WA 98195 USA
(e-mail: hwang@ee.washington.edu). This correspondence is organized as follows. Section II discusses
Publisher Item Identifier S 1057-7149(97)04721-0. the modified snake algorithm, which extracts more consistent visual

1057–7149/97$10.00  1997 IEEE


IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 6, NO. 8, AUGUST 1997 1193

(a) (b)

Fig. 2. Snake contour described with eight radial vectors.

features from video frames. Section III describes our complete


lipreading environment, which integrates snakes, KLT, and HMM’s
for video recognition. Section IV describes simulation results from
applying the lipreading system on the vocabulary used for a voiced
activated car audio system. Section V gives concluding remarks. (c) (d) (e)
Fig. 3. (a) Sample video frame. (b) Corresponding external energy profile.
II. SNAKES—ACTIVE CONTOUR MODELS (c) Initial snake contour for the video frame. (d) Intermediate snake contour
for the video frame. (e) Converged snake contour for the video frame.
A snake is an open or closed elastic curve represented by a set of
control points [9]. Finding the contour of distinct features (specified
by the users a priori in an energy formulation) is done by deforming TABLE I
and moving the elastic curve gradually from an initial shape residing SIMULATION RESULTS WITH DIFFERENT SIZES OF FEATURE VECTORS
on the image toward the distinct features. This deformation process
is guided by iteratively searching for a nearby local minimum of an
energy function, which consists of the internal energy that imposes
smoothness constraints on the snake curve and the external energy
that indicates the degree of matching for the target features.

A. Our Snake Model


We have modified the original snake algorithm by adding new
forces to push and bend “outward” the closed snake curve like an been tested on many oriental faces [11].) For every pixel in a color
inflating balloon [3], [5]. The modified snake algorithm is used to image, we simply divide the red pixel value by the green pixel value
extract the lip contour on every video frame. Because the contour (if not zero). The goal is to generate a binary image by thresholding
feature should be consistent and comparable among all video frames, the lip pixels having large enough red component (>1 ) and proper
we select the feature to be a snake contour, which is described using red-to-green ratio (between 2 and 3 ): This thresholding process
m radial vectors (in our implementation, m = 8). The radial vectors avoids mislabeling dark pixels not on the lips to be lip pixels. After
are uniformly spread in 360 , and each of them originates from the the thresholded binary image, fI (x; y )g; is generated, each point at
centroid of the snake contour and points to a snake point (see Fig. 2). (x; y ) uses a block of neighboring binary pixels to define the external
The shape of the contour is deformed by varying the lengths of radial energy for that point, as follows:
vectors and the centroid of the contour moves during the deformation.
Eext (x; y ) =
01 n n
I (x + k1 ; y + k2 ) (1)
Now, not only the initial snake points no longer have to be near (2n + 1)2
the solution, but also the new snake model is less likely to be trapped k =0n k =0n
by spurious local minima because the effect of all forces can help the where n = 3 in our simulations.
snake curve get out of the weak local minimum. The balloon forces Therefore, all snake points work together to “absorb” as much
push the snake point outward until the point resides on the potential negative external energy [defined in (1)] as possible. At the same
target contour (i.e., when the external energy stops the pushing time, each snake point is constrained by the internal energy so it
forces) and try to keep snake points (i.e., heads of radial vectors) does not “abandon” others. After the snake algorithm finishes, the
uniformly spread in 360 . Although the forces help keeping snake converged snake (i.e., heads of eight radial vectors) should lie right
points uniformly spread, further constraints are needed to guarantee on top of the lips for every color image and eight snake radii define
that every two radial vectors are 45 apart. the contour feature of the lips. Fig. 3 shows the process of deforming
and moving the elastic snake contour in an external energy profile
B. External Energy—R/G Threshold associated with human lips.
The formulation of the external energy for attracting a snake
contour is crucial for a successful deployment of the snake algorithms. III. OUR LIPREADING SYSTEM
The idea of using color images instead monochrome ones was As a preliminary study, our lipreading system is speaker dependent
motivated by Okada’s method [11] for lip extraction. (The method has and recognizes isolated words. Each isolated word is represented by
1194 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 6, NO. 8, AUGUST 1997

TABLE II
CONFUSION MATRIX FOR CASE 3 OFTABLE I WITH 17 SAMPLES FOR EACH WORD

a color video (a sequence of mouth images sampled at 20 frames/s). Each utterance contains only one word. Due to the use of HMM’s
Our snake model was applied on every frame to extract snake radii classifiers, there is no need of human removal of silent video portion
as contour features, and KLT was applied on every frame to extract of the entire utterance since the silence states are automatically
projection weights of principal components in the eigenspace. A absorbed in HMM’s. During recording, the speaker roughly positions
sequence of combined feature vectors is then classified by the HMM’s his mouth in front of the camera [so the frame would look like
to perform the isolated word recognition. To emphasize the temporal Fig. 3(a)], and keeps the head motion to the minimum. For each
change of moving lips, we actually used the difference between two video recording session, small variations of speaker’s position and
original feature vectors on consecutive frames. orientation are allowed. The current system requires fixed lighting
and does not perform any preprocessing to center and scale the mouth
A. Motion Tracking of Lips image. (The KLT is performed on the whole frame.)
As lips move, the snake contour can shrink or expand with lips by The system was not optimized for real-time processing. It took
about 4–6 s to recognize a word on an UltraSparc 2 running
tracking the local minimum of the energy function. Because of this
tracking capability, the initial contour for each frame is determined Solaris 2.5. Neither was the system trained for a speaker-independent
using the converged snake from the previous frame. Due to the purpose, which requires a much more complete data base and is
beyond our scope of study.
outward pushing forces used in our snake model, the initial contour
of each new frame is determined by shrinking the converged snake
obtained from the previous frame. Only outward pushing forces are IV. SIMULATION RESULTS
used because, for every frame, the algorithm assumes the initial One of applications can benefit from the lipreading technology
contour (after shrinking) to be inside the target contour. But, for the is the command-and-control application under highly noisy environ-
first frame, the initial contour is determined by placing a small circle ments, for example, a voiced activated car audio system [17]. The
at the centroid of the binary thresholded image, which is generated following noise sources make the task of acoustic speech recognition
by applying R/G threshold (see Section II-B) on a color image. difficult for a voiced activated car audio system: road noise, engine
noise, jittering noise from outside of the car, audio sound (music or
B. Recognition human speech) from the car audio system, conversation between the
After both KLT and HMM’s are trained, the lipreading system driver and the passengers. We chose ten words that can be used for
is ready for visual speech recognition. When the speaker “talks” in such systems and collected the following data from a single speaker:
front of the camera, a silent color movie is captured. Then the snake • ten isolated words—on, off, yes, no, up, down, forward, rewind,
algorithm and KLT extract two sets of independent feature vectors radio, tape;
for every frame of the video (see Fig. 1). These feature vectors are • each word has 17 examples (video sequences);
combined and concatenated to form a sequence. Finally, the HMM • each video sequence has about 30 to 70 color video frames;
recognizer identifies the word by searching the best HMM model • each image frame contains R, G, B components of size 80 2
with the highest likelihood for the sequence. 60 (i.e., an 80 2 60 2 3 array).
The following steps show how we performed the leave-one-set-
C. Capabilities and Limitations out cross validation to train and test 170 video sequences on the
The current system is speaker-dependent and recognizes isolated lipreading system.
words. When recording a video for a word, the speaker “talks” 1) Train KLT with 1350 color images (randomly chosen three
normally in front of the camera at a normal rate and without training sets with 30 video sequences).
any exaggeration on lip motions. The speaker hits keys to start 2) Use the snake algorithm and KLT to extract feature vectors for
and stop recording for a word (i.e., word segmentation by hand). all 170 sequences.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 6, NO. 8, AUGUST 1997 1195

3) Construct 170 sequences of feature vectors. [11] K. Okada, C. Ohira, and H. Nakamura, “A method for lip shape
4) Pick one test set that has ten video sequences representing all extraction,” Trans. Inst. Electron., Inform., Commun. Eng. D-II, vol.
ten words in the vocabulary. J72-D-II, pp. 1582–1583, Sept. 1989.
[12] E. Petajan, B. Bischoff, D. Bodoff, and N. Brooke, “An improved
5) Train the HMM recognizer, using the temporal difference of automatic lipreading system to enhance speech recognition,” ACM
feature vectors, with the remaining 16 sets (160 sequences). SIGCHI, pp. 19–25, 1988.
6) Perform recognition task on the test set. [13] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition.
7) Fill in the confusion matrix (see Table II). Englewood Cliffs, NJ: Prentice-Hall, 1993.
[14] P. L. Silsbee and A C. Bovik, “Computer lipreading for improved
8) Repeat from Step 4) to Step 7) for all 17 sets. accuracy in automatic speech recognition,” IEEE Trans. Speech Audio
9) Calculate the overall recognition rate using the confusion Processing, vol. 4, pp. 337–351, Sept. 1996.
matrix. [15] T. Starner and A. Pentland, “Visual recognition of American sign lan-
guage using hidden Markov models,” in Proc. Int. Workshop Automatic
We were able to achieve 94% recognition accuracy based on
Face and Gesture Recognition, June 26–28, 1995.
the temporal difference vectors with eight snake radii and 32 KLT [16] D. G. Stork, G. Wolff, and E. Levine, “Neural network lipreading system
weights. Table I shows our simulation results with different sizes of for improved speech recognition,” Int. Joint Conf. Neural Networks, pp.
featurevectors. Table II shows the confusion matrix corresponding to 285–295, 1992.
the best performance. We also noted that if the word, “radio,” which [17] S. Tsurufuji, H. Ohnishi, M. Iida, R. Suzuki, and Y. Sumi, “A voice
activated car audio system,” IEEE Trans. Consum. Electron., vol. 37,
has the lowest recognition accuracy, is discarded from the vocabulary, pp. 592–597, Aug., 1991.
the overall recognition rate is increased to 97.4% for nine isolated [18] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognit.
words. Due to the hardware limitation, the video was sampled at 20 Neurosci., vol 3, pp. 71–86, 1991.
frames/s. We believe that if the sampling rate was increased to 30 [19] B. P. Yuhas, M. H. Goldstein, Jr., and T. J. Sejnowski, “Integration
of acoustic and visual speech signals using neural networks,” IEEE
frames/s, our lipreading system could achieve even higher recognition
Commun. Mag., pp. 65–71, Nov. 1989.
rate. Since the main focus of this system is for single speaker system,
there is a significant performance degradation when the system is
tested on different speakers (for a very inconclusive experiment, the
recognition rate dropped to about 70% to 80%).

V. CONCLUSION
Segmentation of Handwritten Interference Marks
Using Multiple Directional Stroke Planes
In this work, we showed that visual speech recognition is feasible
and Reformalized Morphological Approach
with color video (without any acoustic data) for certain vocabulary.
Unlike other lipreading systems, ours does not require the speaker Su Liang, M. Ahmadi, and M. Shridhar
to put on any special marker or lipstick. Although the system
was designed for lipreading, we believe the framework, with minor
modification, can be applied to other applications such as eyelid Abstract—A new algorithm for the extraction of words from printed
motion recognition (for detecting a sleepy driver), heart movement documents that have interference marks and other strokes, cutting across
classification, etc. text, is presented. Morphological operations based on multiple direction
projection planes and skeleton images are adopted here to prevent the
“flooding water” effect of conventional morphological operations. Test
REFERENCES results indicate the feasibility of the proposed approach.

[1] C. Bregler and Y. Konig. “’Eigenlips’ for robust speech recognition,” in


Proc. Int. Conf. Acoustics, Speech and Signal Processing (ICASSP’94), I. INTRODUCTION
Adelaide, Australia, 1994, pp. II- 669–II-672.
Document images that are frequently degraded due to human
[2] G. I. Chiou and J. N. Hwang, “Image sequence classification using
a neural network based active contour model and a hidden Markov induced interference strokes that cut across words are challenging
model,” in Proc. IEEE Int. Conf. Image Processing, Austin, TX, Nov. problems for machine recognition (see Fig. 1). The extraction of
1994, pp. III-926–II-930. text from various kinds of images in which it touches and inter-
[3] , “A neural network based stochastic active contour model (NNS- sects linework, scratch, noisy background, or geometric background
SNAKE) for contour finding of distinct features,” IEEE Trans. Image
Processing, vol. 4, pp. 1407–1416, Oct. 1995. patterns have become subjects of extensive research [1]–[4].
[4] , “Lipreading from color motion video,” in Proc. Int. Conf. on To segregate lines and curves from a binary image, thinning
Acoustics, Speech, and Signal Processing, Atlanta GA, May 1996, pp. and direction contribution of segments are of importance in pattern
IV-2156–IV-2159. analysis. However, after thinning, noise and the shape of characters
[5] L. D. Cohen and I. Cohen, “Finite-element methods for active contour
can seriously affect the shape of the thinning skeleton (see Fig. 2).
models and balloons for 2-D and 3-D images,” in IEEE Trans. Pattern
Anal. Machine Intell., vol. 15, pp. 1131–1141, Nov. 1993. Therefore, using the thinning technique alone will fail to accomplish
[6] T. F. Cootes and C. J. Taylor, “Active shape models—‘smart snakes’,” successful segmentation of interference strokes. Extraction of direc-
in Proc. Brit. Machine Vision Conf., 1992, pp. 266–275. tion segments [5] can be based on the original images without the
[7] C. G. Fisher, “Confusions among visually perceived consonants,” J.
Speech Hearing Res., vol. 11, pp. 796–803, 1968. Manuscript received January 27, 1995; revised May 22, 1996. The associate
[8] A. J. Goldschen, O. N. Garcia, and E. Petajan, “Continuous optical editor coordinating the review of this manuscript and approving it for
automatic speech recognition by lipreading,” in 28th Asilomar Conf. publication was Dr. Jun Zhang.
Signals, Systems, and Computers, Pacific Grove, CA, 1994, pp. 572–577. S. Liang and M. Ahmadi are with the Department of Electrical Engineering,
[9] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active contour University of Windsor, Windsor, Ont. Canada N9B 3P4.
models,” Int. J. Comput. Vis., vol. 1, pp. 321–331, Jan. 1988. M. Shridhar is with the Department of Electrical and Computer Engineering,
[10] K. Mase and A. Pentland, “Automatic lipreading by optical-flow anal- University of Michigan, Dearborn, MI 48128 USA (e-mail: mals@umich.edu).
ysis,” Syst. Comput. Jpn., vol. 22, pp. 67–76, 1991. Publisher Item Identifier S 1057-7149(97)05481-X.

1057–7149/97$10.00  1997 IEEE

You might also like