Professional Documents
Culture Documents
Lip Reading Using Image Processing
Lip Reading Using Image Processing
Lip Reading Using Image Processing
8, AUGUST 1997
Lipreading from Color Video Fig. 1. Color video recognition using Snakes, KLT, and HMM’s.
(a) (b)
TABLE II
CONFUSION MATRIX FOR CASE 3 OFTABLE I WITH 17 SAMPLES FOR EACH WORD
a color video (a sequence of mouth images sampled at 20 frames/s). Each utterance contains only one word. Due to the use of HMM’s
Our snake model was applied on every frame to extract snake radii classifiers, there is no need of human removal of silent video portion
as contour features, and KLT was applied on every frame to extract of the entire utterance since the silence states are automatically
projection weights of principal components in the eigenspace. A absorbed in HMM’s. During recording, the speaker roughly positions
sequence of combined feature vectors is then classified by the HMM’s his mouth in front of the camera [so the frame would look like
to perform the isolated word recognition. To emphasize the temporal Fig. 3(a)], and keeps the head motion to the minimum. For each
change of moving lips, we actually used the difference between two video recording session, small variations of speaker’s position and
original feature vectors on consecutive frames. orientation are allowed. The current system requires fixed lighting
and does not perform any preprocessing to center and scale the mouth
A. Motion Tracking of Lips image. (The KLT is performed on the whole frame.)
As lips move, the snake contour can shrink or expand with lips by The system was not optimized for real-time processing. It took
about 4–6 s to recognize a word on an UltraSparc 2 running
tracking the local minimum of the energy function. Because of this
tracking capability, the initial contour for each frame is determined Solaris 2.5. Neither was the system trained for a speaker-independent
using the converged snake from the previous frame. Due to the purpose, which requires a much more complete data base and is
beyond our scope of study.
outward pushing forces used in our snake model, the initial contour
of each new frame is determined by shrinking the converged snake
obtained from the previous frame. Only outward pushing forces are IV. SIMULATION RESULTS
used because, for every frame, the algorithm assumes the initial One of applications can benefit from the lipreading technology
contour (after shrinking) to be inside the target contour. But, for the is the command-and-control application under highly noisy environ-
first frame, the initial contour is determined by placing a small circle ments, for example, a voiced activated car audio system [17]. The
at the centroid of the binary thresholded image, which is generated following noise sources make the task of acoustic speech recognition
by applying R/G threshold (see Section II-B) on a color image. difficult for a voiced activated car audio system: road noise, engine
noise, jittering noise from outside of the car, audio sound (music or
B. Recognition human speech) from the car audio system, conversation between the
After both KLT and HMM’s are trained, the lipreading system driver and the passengers. We chose ten words that can be used for
is ready for visual speech recognition. When the speaker “talks” in such systems and collected the following data from a single speaker:
front of the camera, a silent color movie is captured. Then the snake • ten isolated words—on, off, yes, no, up, down, forward, rewind,
algorithm and KLT extract two sets of independent feature vectors radio, tape;
for every frame of the video (see Fig. 1). These feature vectors are • each word has 17 examples (video sequences);
combined and concatenated to form a sequence. Finally, the HMM • each video sequence has about 30 to 70 color video frames;
recognizer identifies the word by searching the best HMM model • each image frame contains R, G, B components of size 80 2
with the highest likelihood for the sequence. 60 (i.e., an 80 2 60 2 3 array).
The following steps show how we performed the leave-one-set-
C. Capabilities and Limitations out cross validation to train and test 170 video sequences on the
The current system is speaker-dependent and recognizes isolated lipreading system.
words. When recording a video for a word, the speaker “talks” 1) Train KLT with 1350 color images (randomly chosen three
normally in front of the camera at a normal rate and without training sets with 30 video sequences).
any exaggeration on lip motions. The speaker hits keys to start 2) Use the snake algorithm and KLT to extract feature vectors for
and stop recording for a word (i.e., word segmentation by hand). all 170 sequences.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 6, NO. 8, AUGUST 1997 1195
3) Construct 170 sequences of feature vectors. [11] K. Okada, C. Ohira, and H. Nakamura, “A method for lip shape
4) Pick one test set that has ten video sequences representing all extraction,” Trans. Inst. Electron., Inform., Commun. Eng. D-II, vol.
ten words in the vocabulary. J72-D-II, pp. 1582–1583, Sept. 1989.
[12] E. Petajan, B. Bischoff, D. Bodoff, and N. Brooke, “An improved
5) Train the HMM recognizer, using the temporal difference of automatic lipreading system to enhance speech recognition,” ACM
feature vectors, with the remaining 16 sets (160 sequences). SIGCHI, pp. 19–25, 1988.
6) Perform recognition task on the test set. [13] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition.
7) Fill in the confusion matrix (see Table II). Englewood Cliffs, NJ: Prentice-Hall, 1993.
[14] P. L. Silsbee and A C. Bovik, “Computer lipreading for improved
8) Repeat from Step 4) to Step 7) for all 17 sets. accuracy in automatic speech recognition,” IEEE Trans. Speech Audio
9) Calculate the overall recognition rate using the confusion Processing, vol. 4, pp. 337–351, Sept. 1996.
matrix. [15] T. Starner and A. Pentland, “Visual recognition of American sign lan-
guage using hidden Markov models,” in Proc. Int. Workshop Automatic
We were able to achieve 94% recognition accuracy based on
Face and Gesture Recognition, June 26–28, 1995.
the temporal difference vectors with eight snake radii and 32 KLT [16] D. G. Stork, G. Wolff, and E. Levine, “Neural network lipreading system
weights. Table I shows our simulation results with different sizes of for improved speech recognition,” Int. Joint Conf. Neural Networks, pp.
featurevectors. Table II shows the confusion matrix corresponding to 285–295, 1992.
the best performance. We also noted that if the word, “radio,” which [17] S. Tsurufuji, H. Ohnishi, M. Iida, R. Suzuki, and Y. Sumi, “A voice
activated car audio system,” IEEE Trans. Consum. Electron., vol. 37,
has the lowest recognition accuracy, is discarded from the vocabulary, pp. 592–597, Aug., 1991.
the overall recognition rate is increased to 97.4% for nine isolated [18] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognit.
words. Due to the hardware limitation, the video was sampled at 20 Neurosci., vol 3, pp. 71–86, 1991.
frames/s. We believe that if the sampling rate was increased to 30 [19] B. P. Yuhas, M. H. Goldstein, Jr., and T. J. Sejnowski, “Integration
of acoustic and visual speech signals using neural networks,” IEEE
frames/s, our lipreading system could achieve even higher recognition
Commun. Mag., pp. 65–71, Nov. 1989.
rate. Since the main focus of this system is for single speaker system,
there is a significant performance degradation when the system is
tested on different speakers (for a very inconclusive experiment, the
recognition rate dropped to about 70% to 80%).
V. CONCLUSION
Segmentation of Handwritten Interference Marks
Using Multiple Directional Stroke Planes
In this work, we showed that visual speech recognition is feasible
and Reformalized Morphological Approach
with color video (without any acoustic data) for certain vocabulary.
Unlike other lipreading systems, ours does not require the speaker Su Liang, M. Ahmadi, and M. Shridhar
to put on any special marker or lipstick. Although the system
was designed for lipreading, we believe the framework, with minor
modification, can be applied to other applications such as eyelid Abstract—A new algorithm for the extraction of words from printed
motion recognition (for detecting a sleepy driver), heart movement documents that have interference marks and other strokes, cutting across
classification, etc. text, is presented. Morphological operations based on multiple direction
projection planes and skeleton images are adopted here to prevent the
“flooding water” effect of conventional morphological operations. Test
REFERENCES results indicate the feasibility of the proposed approach.