Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of the 2011 IEEE International Conference on Mechatronics and Automation August 7 - 10, Beijing, China

Real-time Structured Light 3D Scanning for Hand Tracking


Renju Li and Hongbin Zha
Key Laboratory of Machine Perception (Ministry of Education) Peking University Beijing, China {lirenju, zha}@cis.pku.edu.cn
Abstract Hand tracking is widely used in virtual reality systems and it is still a critical problem in computer vision and graphics. In this paper, a structured light scanning based hand tracking system is proposed. The pattern designed in the real-time scanning system is composed of color stripes with quadratic intensity distribution in each stripe. After camera and projector calibration, 3D data of a moving hand can be acquired using the real-time system. For each frame of the sequence, the ngertips are detected by making use of the boundary of the 3D precise data. In comparison with the methods based on passive vision only, the proposed system is more robust and suitable for bare hand tracking. Some experiments have been performed to evaluate the effectiveness of the proposed method. Index Terms structured light; real-time 3D scanning; hand tracking; virtual reality

II. R ELATED W ORK Most motion capture systems can be used for tracking hand position and pose. These systems can be classied into two categories: optical or non-optical methods. Data gloves are used in non-optical methods and optical methods make use of retro-reective markers, color gloves or some image features. Data gloves are utilized to capture 3D data of some key points by using several sensors on the glove. There are some commercial products in the market such as P5 data glove, Immersion Cyberglove and so on. Because the sensors make the devices cumbersome, they cannot be used for bare hand tracking. As a typical optical method, motion capture using markers is widely used. Multiple cameras are synchronized to capture the images with retro-reective markers pasted onto the key points of hand. Based on the real-time 3D data of the markers, hand pose can be estimated. As ambient light usually affects the robustness of the system and markers are necessary, this kind of system is not an ideal human computer interaction tool. Typical markers include retro-reective targets or LED [1]. By detecting and tracking the color blocks designed on the gloves [2] [3], hand motion can also be recovered. Similar to the methods using data gloves, the necessity of gloves gives human uncomfortable experiences. Bare hand tracking is a hot topic in this eld. The commonly used features include edges and boundaries which have been implemented in some gesture recognition systems [4]. Fingertip tracking is a fundamental problem of bare hand tracking and has been studied by several researchers [5] [6] [7]. The accuracy and robustness are limited due to the difculty of 3D calculation. To solve the robustness of bare hand tracking system, depth information obtained using a time-of-ight range camera has been used [8] [9]. The range camera can obtain the 3D information real-time, but its low accuracy and the noise make it difcult to segment the object in which we are interested with the background. The proposed real-time structured light scanning based hand tracking system can greatly improve the matching robustness and accuracy of 3D calculation. A key issue for structured light scanning is the coding and decoding strategy.

I. I NTRODUCTION Hand tracking is widely used in human computer interaction and it is still one of the hot research topics in the elds. Data gloves, retro-reective targets or color markers are utilized in the traditional tracking methods to capture the motion of hands. Some bare hand tracking systems based on passive vision were also developed. However, as 3D estimation is a difcult problem, the robustness is limited to some extend. In this paper, a robust and accurate hand tracking system using real-time structure light scanning is implemented. As no markers, gloves or other auxiliary facilities are needed to put on the human body, the method is suitable for bare hand tracking. With the effective tracking of the ngertip position, the method has been successfully utilized in some virtual input systems. The tracking result is more reliable using structured light scanning in comparison with the methods based on passive vision only. The remainder of the paper is organized as follows. After introducing the related work in Section II, real-time structured light 3D scanning will be described in section III and hand tracking based on 3D data will be given in section IV. Then, some experimental results are presented to illustrate the performance of the method in section V. We draw conclusions and discuss our future work in section VI.
This work was supported in part by the NHTRDP 863 Grant No. 2009AA01Z329 and the NHTRDP 863 Grant No. 2009AA012105.

978-1-4244-8115-6/11/$26.00 2011 IEEE

652

Fig. 1. Pipeline of the structured light scanning based hand tracking system. Fig. 2. Real-time 3D scanning system.

Salvi et al. [10] gave a comprehensive review on pattern codication strategies. For static objects, temporal coding methods are utilized such as gray code, phase shift or the combination of the two methods. Fast 3D acquisition with multiple images has been proposed by Zhang and Huang [11] with the phase-shifting patterns encoded in the three channels of one image. Koninckx and Gool [12] proposed a method based on a set of stripes, and graph cut was used to obtain the relative numbering of the stripes. Sagawa et al. [13] gave a single grid pattern in which vertical and horizontal stripes using two colors are used. Ulosoy et al. [14] proposed a similar codication method based on a grid pattern. A color stripe pattern was proposed in our previous work [15]. The pipeline of the proposed hand tracking system is shown in Figure 1. First, real-time scanning is achieved based on the designed pattern and the calibrated cameraprojector system. In each frame, ngertips are detected using the boundary of the precise 3D scanned data. For manipulating an object in the virtual scene, its digital model can be generated based on static 3D scanning or manual modeling. By aligning the real-time scanned data to the virtual model, the collision detection between the ngertips and the virtual model can give the hand tracking results. III. R EAL - TIME 3D S CANNING The designed real-time 3D scanning system mainly consists of one camera and one projector as shown in Figure 2. The steps of real-time 3D scanning is as follows. First, the parameters of the system conguration can be obtained by calibrating the camera and projector. Afterward, a specially designed pattern is projected onto the object and the images are captured. Following, pattern decoding can give the corresponding points between the pixels in the captured image and their correspondences in the designed pattern. Finally, 3D reconstruction can be performed based on collinearity equation. A. System Calibration The calibration board is a black plane on which some white circles printed as shown in Figure 3. Some special circles with larger radius are designed so that we can obtain the

Fig. 3.

Calibration board

row and column number of each circle in the captured image easily. In the designed system, both the camera and the projector need to be calibrated. The system calibration process is to obtain the projective relationship between the 3D Point Xw = [Xw Yw Zw 1]T with the corresponding image point xc = [uc vc 1]T and the projecting point xp = [up vp 1]T which are dened as follows: xc = Kc [Rc |Tc ]Xw , xp = Kp [Rp |Tp ]Xw , (1) (2)

where Kc , Rc and Tc denote the intrinsic parameter matrix, rotation matrix and translation vector for the camera, and Kp , Rp and Tp are the intrinsic parameter matrix, rotation matrix and translation vector for the projector respectively. Based on 3D and 2D correspondences, camera calibration is performed using the method described in [16]. As the projector can not capture images, it is viewed as an inverse camera for calibration. The coding method using gray code and phase shift is utilized for robust matching between the pattern and the captured image. The patterns with horizontal stripes and vertical stripes are projected onto the calibration board and the images are captured using the camera similar to the method [17]. Figure 4(a) shows one of the patterns using horizontal stripes and Figure 4(b) shows one of the

653

(a)

(b)

Fig. 4. Projector calibration. (a) One horizontal phase-shifting pattern. (b) One vertical phase-shifting pattern.

patterns with vertical stripes. After calculating the phases of the circle centers, their corresponding vertical and horizontal coordinates in the pattern can be obtained. Then, the projector parameters are calibrated using the same method with the camera. B. Pattern Design and Decoding The pattern used in the system combines periodic quadratic stripes and color coding method which is proposed in our previous work [15]. Color coding is used to encode the period number and local quadratic coding is for the peaks calculation in each period. The periodic quadratic pattern with vertical stripes has the same intensity Ip (x, y) along each vertical line and the intensity along each horizontal line is dened as, Xl (x, y) = X(x, y) nT, Ip (x, y) = 255(1 ( Xl (x, y) 2 ) ), T /2 (3) (4)

real-time and ngertip detection is performed based on their 3D data. Pattern decoding is to recover the designed x position in the pattern for each pixel in the captured images. First, we preprocess the input images by segmenting the background and foreground using empirical thresholding. Then, we extract the bright points by comparing the intensity of each pixel with its neighboring pixels. These bright points imply the period distribution of the pattern. Following, we transform the original RGB color space to HIS color space and take hue as the criterion to distinguish the colors. The mean hue for each color can be calibrated by projecting a pattern with the corresponding color onto the object before real-time scanning. Next, graph cut [18] is utilized to obtain the color for each bright point. The data cost is based on the error of the designed color in the pattern and the captured color. The period number of each point can be calculated using the color of three consecutive periods. At last, the sub-pixel positions for the bright stripes can be obtained by quadratic tting in the local area as the pattern has local quadratic intensity distribution. C. Real-time 3D Reconstruction After system calibration, the camera and projector parameters can be obtained. We denote the intrinsic and extrinsic parameters as follows. 0 u0c fc /duc 0 fc /duv v0c , Kc = 0 0 1 0 u0p fp /dup 0 fp /dup v0p , Kp = 110 12 0 13 1 11 12 13 R p Rp R p Rc Rc R c 21 22 23 21 22 23 Rc = Rc Rc Rc , Rp = Rp Rp Rp , 31 32 33 31 32 33 R Rp Rp R p Rc Rc c 1 1 Tp Tc 2 2 Tc = Tc , Tp = Tp , 3 3 Tc Tp where fc , duc , dvc and (u0c , v0c ) are the focal length, pixel size in the horizontal and vertical direction and the principle point of the camera, and fp , dup , dvp and (u0p , v0p ) are the corresponding parameters for the projector. In our method, we only use vertical stripes in real-time scanning. For the pixel (uc vc ) in the captured image, we can obtain its corresponding stripe position up in the designed pattern from the decoding process. Thus, we can rewrite Equation 1 and 2 as, uc = f c v c = fc
11 12 12 1 Rc X w + R c Y w + R c Z w + T c /duc + u0c , (5) 31 32 33 3 R c X w + Rc Y w + R c Z w + T c 21 22 23 2 R c X w + R c Y w + Rc Z w + T c /dvc + v0c , (6) 31 32 33 3 R c X w + R c Y w + Rc Z w + T c

where T is the width of each period, n is the period number, Xl (x, y) denotes the local x position of point (x, y) in the range [T /2, T /2], and X(x, y) is the absolute x position which acts as the coding value designed in the pattern. As the pattern is periodic, its difcult to get the absolute x position using such a pattern for 3D reconstruction. Color coding is utilized to encode the period number. De Bruijn sequence has been used to achieve color coding in several one-shot scanning systems. A De Bruijn sequence B(k, n) of order n is a cyclic sequence of a given alphabet A with size k for which every possible subsequence of length n in A appears as a sequence of consecutive characters exactly once. We take three primary colors and their complementary colors as the alphabet and the order of the sequence is three. In the designed pattern, each stripe has one color and its width is the same which corresponds to one period in the designed pattern. No consecutive color is the same in the sequence. The designed pattern has local maximum intensity and minimum intensity corresponding to the center of each period and the boundary between two periods. We call the stripe with local maximum intensity bright stripe. In the real-time hand tracking system, the bright stripes are reconstructed

654

(a)

(b) (a) (b)

(c)
Fig. 5. Real-time 3D scanning. (a) One captured image. (b) Color decoding result. (c) Reconstructed 3D data.

(c)
11 12 13 1 R p X w + Rp Y w + Rp Z w + T p up = fp 31 /dup + u0p , (7) 32 33 3 R p X w + Rp Y w + Rp Z w + T p

Fig. 6. Real-time hand tracking. (a) A telephone with markers. (b) Virtual model of the telephone . (c) Hand tracking result.

Thus the 3D coordinate (Xw Yw Zw )T can be calculated for all the detected bright points in the image. Using the method described in this section, real-time 3D data can be achieved for each frame of the captured sequence. The color distribution of the previous frame can act as the initialization for the current frame which will make the color decoding much faster. Figure 5 shows the captured fringe image, the recovered color and the reconstructed 3D data of one frame from a real-time scanning sequence. IV. R EAL - TIME H AND T RACKING Depending on the applications, the requirements of hand tracking may vary. Here we focus on ngertip tracking, which is used in several virtual reality system. Different from other ngertip tracking systems in which the ngertip was detected in 2D images and 3D estimation was performed using two or more calibrated cameras, precise 3D data obtained by real-time structured light scanning can make the detection of ngertips more robust. In our implementation, we use hand tracking for the virtual keyboard. The scanned data should be registered to the virtual model before tracking begins and collision detection is performed between the ngertips and the virtual buttons to give the tracking results. A. Data Registration Figure 6(a) shows a telephone with some markers and its virtual 3D model obtained by static scanning is given in Figure 6(b). The 3D coordinates of the markers were also

captured and stored in the virtual scene during the digital model construction, which are illustrated as the red points. In real-time scanning, the scanned data is in a different coordinate system with the virtual 3D model. Before tracking begins, the markers are rst recognized and registered to the virtual model to generate the transformation matrix. Assume we have M markers in the virtual scene and we capture N points when scanning the object. First, based on the distances between the points in the two point sets, the correspondences can be obtained. Then, the rotation matrix and translation vector can be calculated using quaternion method or single value decomposition with the matched point sets. During real-time scanning, the transformation is applied to the scanned data. In Figure 6(c), we can see that 3D data of a hand is registered to the digital model of the telephone. B. Fingertip Detection Fingertip detection using the curvature of the boundary in 2D images was presented in [5]. The shortage of the method is that a threshold to determine the scale should be set for the curvature calculation as the 2D image has no size information. Here we present a ngertip detection method by making use of the 3D data. From a 3D point cloud, we can easily obtain the boundary as shown in Figure 7. For a point P on the boundary, we select P1 and P2 with distance d along the boundary from P . Then, we calculate the angle of the vectors P P1 and P P2 . If the angle is less than a threshold, it is a candidate ngertip point. For ngertip detection, d

655

a virtual phone and the detection result as illustrated in the dialog are shown. Figure 8 shows an example of using the proposed method for virtual keyboard input. We capture a sequence composed of 2158 frames in which different number of buttons were pressed during caputuring. Figure 8(a) shows no button was pressed and Figure 8(b,c), Figure 8(d,e), Figure 8(f,g) and Figure 8(h,i) illustrate one, two, three and four buttons were pressed in that frame respectively. When a button is pressed, it turns gray in the tracking result dialog, as shown in the bottom-left part of each gure. We can see that the tracking results are in correspondence with the 3D data shown on the right of each gure. VI. C ONCLUSIONS AND F UTURE W ORK
Fig. 7. Fingertip detection.

is set to 10mm and the result is the very robust in our practice. For each nger, we can detect several consecutive candidates on the boundary and the centroid is considered as the ngertip point. As the valley between the ngertips may also detected using the method, we lter them based on the distance between the detected candidate points and palm center, as the center of the circle shown in Figure 7. The palm center is marked manually in the rst frame for initialization and updated later using circle tting. C. Collision Detection In our hand tracking system, the ngertip detection result is used for the keyboard input. Collision detection is performed between the ngertip position and the bounding box of the virtual buttons. Based on the collision detection result, it can be determined whether a button is pressed. V. E XPERIMENTAL R ESULTS To test the performance of the system, we conduct experiments in which hand tracking result is used to control a virtual phone composed of twelve keys and a virtual keyboard with ve letters. The system has a color camera DH-HV1303UC with the resolution 1280 1024 and a LG HX300 projector, whose resolution is 1024 768. Multi-thread programming is utilized in the implemented system. The task of one thread is to capture the real-time images and another one is used for calculation and rendering. The presented method was test on an Intel Core 2 Duo with 2.66GHz, 4G RAM and a NVIDIA Geforce 9600GT with 512M memory. The running time is about 100ms for each frame, with 50ms for pattern decoding, 30ms for 3D calculation and ngertip detection and 20ms for rendering. In the virtual phone experiment, one nger is pressed each time. Figure 6(c) shows one frame of the virtual phone sequence, in which the captured image, real-time 3D data,

In this paper, we introduce a hand tracking system based on active structured light scanning. The strength of structured scanning is its robustness to obtain precise 3D information which has been widely used in reverse engineering, quality control, entertainment and medical treatment. The designed real-time 3D scanning system can obtain 3D data of a moving hand at high speed and the ngertip tracking results are robust which can be used for bare hand tracking. The proposed method can be applied in some virtual reality systems such as virtual keyboard, virtual driving and so on. The shortage of the color codication strategy is that the original hand color is missing. The problem can be solved by fast switching between pattern projection or no pattern projection, or utilizing other real-time structured light coding methods. In our implementation, the calculation is performed totally on CPU. Nowadays, GPU has shown its powerful potential to increase the calculation speed especially for image processing. In our future work, we will try to make some computing on GPU to improve the real-time performance. Another application of hand tracking is gesture recognition, which is promising as the gesture can provide information about human intentions with the applications ranging from simulation, robot teaching, graphical interface control and device control. In future work, we will perform nonrigid registration of the real-time point cloud to a predened template and estimate the pose of the hand precisely. Base on that, the gesture can be recognized for different kinds of applications. R EFERENCES
[1] J. Park and Y. Yoon, Led-glove based interactions in multi-modal displays for teleconferencing, In Proc. International Conference on Articial Reality and TelexistenceWorkshops(ICAT), pp. 395399, 2006. [2] R. Y. Wang and J. Popovi , Real-time hand-tracking with a color c glove, ACM Transactions on Graphics, vol. 28, no. 3, pp. 18, 2009. [3] C. Theobalt, I. Albrecht, J. Haber, M. Magnor, and H.-P. Seidel, Pitching a baseball tracking high-speed motion with multi-exposure images, In Proc. ACM SIGGRAPH 2004, pp. 540547, 2004.

656

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Fig. 8. Hand tracking as virtual keyboard input. (a) No buttons pressed. (b,c) One button pressed. (d,e) Two buttons pressed. (f,g) Three buttons pressed. (h,i) Four buttons pressed.

[4] P. Dhawle, M. Masoodian, and B. Rogers, Bare hand 3d gesture input to interactive systems, In Proc. International Conference on Computer-Human Interaction: Design Centered HCI, pp. 2532, 2006. [5] J. Segen and S. Kumar, Human-computer interaction using gesture recognition and 3d hand tracking, In Proc. IEEE International Conference on Image Processing, 1998. [6] K. Oka, Y. Sato, and H. Koike, Real-time ngertip tracking and gesture recognition, Computer Graphics and Applications, vol. 22, pp. 6471, 2002. [7] T. Lee and T. H Hollerer, Handy AR: Markerless inspection of augmented reality objects using ngertip tracking, In Proc. IEEE International Symposium on Wearable Computers (ISWC), pp. 8390, October 2007. [8] P. Breuer, C. Eckes, and S. Muller, Hand gesture recognition with a novel IR time-of-ight range cameraca pilot study, Computer Vision/Computer Graphics Collaboration Techniques, pp. 247260, 2007. [9] Z. Li and R. Jarvis, Real time hand gesture recognition using a range camera, In Proc. Australasian Conference on Robotics and Automation, 2009. [10] J. Salvi, J. Pages, and J. Batlle, Pattern codication strategies in structured light systems, Pattern Recognition, vol. 37, no. 4, pp. 827 849, 2004. [11] S. Zhang and P. S. Huang, High-resolution, real-time threedimensional shape measurement, Optical Engineering, vol. 45, no. 12, pp. 123601, December 2006. [12] T. P. Koninckx and L. V. Gool, Real-time range acquisition by adaptive structured light, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 432445, Match 2006.

[13] R. Sagawa, Y. Ota, Y. Yagi, R. Furukawa, and N. Asada, Dense 3d reconstruction method using a single pattern for fast moving object, In Proc. International Conference on Computer Vision, pp. 17791986, 2009. [14] A. O. Ulosoy, F. Calakli, and G. Taubin, One-shot scanning using debrujn spaced grids, In Proc. the 2009 International Workshops on 3D Imaging and Modeling, pp. 17861792, 2009. [15] R. Li and H. Zha, One-shot scanning using a color stripe pattern, In Proc. 20th International Conference on Pattern Recognition, pp. 16661669, 2010. [16] R. Y. Tsai, A versatile camera calibration technique for highaccuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses, IEEE Journal of Robotics and Automation, vol. RA-3, no. 4, pp. 323344, August 1987. [17] S. Zhang and P. S. Huang, Novel method for structured light system calibration, Optical Engineering, vol. 45, no. 8, pp. 083601, August 2006. [18] Y. Boykov, O. Veksler, and Ramin Zabih, Efcient approximate energy minimization via graph cuts, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 12221239, November 2001.

657

You might also like