Professional Documents
Culture Documents
2022 Ohkawa
2022 Ohkawa
https://doi.org/10.1007/s11263-023-01856-0
Received: 9 June 2022 / Accepted: 12 July 2023 / Published online: 7 August 2023
© The Author(s) 2023
Abstract
In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and
learning. 3D hand pose estimation has been an important research area owing to its potential to enable various applications,
such as video understanding, AR/VR, and robotics. However, the performance of models is tied to the quality and quantity
of annotated 3D hand poses. Under the status quo, acquiring such annotated 3D hand poses is challenging, e.g., due to
the difficulty of 3D annotation and the presence of occlusion. To reveal this problem, we review the pros and cons of
existing annotation methods classified as manual, synthetic-model-based, hand-sensor-based, and computational approaches.
Additionally, we examine methods for learning 3D hand poses when annotated data are scarce, including self-supervised
pretraining, semi-supervised learning, and domain adaptation. Based on the study of efficient annotation and learning, we
further discuss limitations and possible future directions in this field.
Keywords Hand pose estimation · Efficient annotation · Learning with limited labels
123
3194 International Journal of Computer Vision (2023) 131:3193–3206
123
International Journal of Computer Vision (2023) 131:3193–3206 3195
Fig. 2 Formulation and modeling of single-view 3D hand pose estima- hand template model. For modeling, there are three major designs; A 2D
tion. For input, we use either RGB or depth images cropped to the hand heatmap regression and depth regression, B extended three-dimensional
region. The model learns to produce a 3D hand pose defined by 3D heatmap regression called 2.5D heatmaps, and C direct regression of
coordinates. Some works additionally estimate hand shape using a 3D 3D coordinates
shape parameters together (Boukhayma et al., 2019; Ge et For the architecture of the backbone network, CNNs [e.g.,
al., 2019; Mueller et al., 2019; Zhou et al., 2016). In eval- ResNet (He et al., 2016)] are a basic choice while recent
uation, produced prediction is compared with ground truth, Transformer-based methods have been proposed (Hampali
e.g., in the space of world or image coordinates. These two et al., 2022; Huang et al., 2020). To generate feasible hand
metrics are often used: mean per joint position error (MPJPE) poses, regularization is a key trick in correcting predicted
in millimeters, and area under curve of percentage of correct 3D hand poses. Based on the anatomical study of hands, bio-
keypoints (PCK-AUC). mechanical constraints are imposed to limit predicted bone
Modeling. Classic methods estimate a hand pose by find- lengths and joint angles (Spurr et al., 2020; Chen et al., 2021;
ing the closest sample from a large set of hand poses, e.g., Liu et al., 2021).
synthetic hand pose sets. Some works formulate the task as
nearest neighbor search (Rogez et al., 2015; Romero et al.,
2010) while others solve pose classification given predefined
hand pose classes and a SVM classifier (Rogez et al., 2014,
3 Challenges in Dataset Construction
2015; Sridhar et al., 2013).
Task formulation and algorithms for estimating 3D hand
Recent studies have adopted an end-to-end training man-
poses are outlined in Sect. 2. During training, it is neces-
ner where models learn the correspondence between the input
sary to build a large amount of training data with diverse
image and its label of the 3D hand pose. Standard single-view
hand poses, viewpoints, and backgrounds. However, obtain-
methods from an RGB image (Cai et al., 2018; Ge et al., 2019;
ing such massive hand data with accurate annotations has
Zimmermann & Brox, 2017) consist of (A) the estimation of
been challenging for the following reasons.
2D hand poses by heatmap regression and depth regression
Difficulty of 3D annotation. Annotating the 3D position of
for each 2D keypoint (see Fig. 2). The 2D keypoints are
hand joints from a single RGB image is inherently impos-
learned by optimizing heatmaps centered on each 2D hand
sible without any prior information or additional sensors
joint position. An additional regression network predicts the
due to an ill-posed condition. To assign accurate hand pose
depth distance of detected 2D hand keypoints. Other works
labels, hand-marker-based annotation using magnetic sen-
use (B) extended 2.5D heatmap regression with a depth-wise
sors (Garcia-Hernando et al., 2018; Wetzler et al., 2015; Yuan
heatmap in addition to the 2D heatmaps (Iqbal et al., 2018;
et al., 2017), motion capture systems (Miyata et al., 2004;
Moon et al., 2020), so it does not require a depth regression
Schröder et al., 2015; Taheri et al., 2020), or hand gloves
branch. Depth-based hand pose estimation also utilizes such
(Bianchi et al., 2013; Glauser et al., 2019; Wang & Popovic,
heatmap regression (Huang et al., 2020; Ren et al., 2019;
2009) has been studied. These sensors can provide 6-DoF
Xiong et al., 2019). Instead of the heatmap training, other
information (i.e., location and orientation) of attached mark-
methods learn to (C) directly regress keypoint coordinates
ers and enable us to calculate the coordinates of full hand
(Santavas et al., 2021; Spurr et al., 2018).
joints from the tracked markers. However, their setups are
123
3196 International Journal of Computer Vision (2023) 131:3193–3206
123
Table 1 Taxonomy of methods for annotating 3D hand poses
Annotation Dataset Year Modality Resolution #Frame #Subj./ #View Motion Obj. Sensor/engine
#Obj. pose
Manual MSRA (Qian et al., 2014) 2014 Depth 320 × 240 2K 6/– 1 (3rd) × × Creative Senz3D
Dexter+Object (Sridhar et al., 2016) 2016 RGB-D 640 × 480 3K 2/2 1 (3rd) Creative Senz3D
EgoDexter (Mueller et al., 2017) 2017 RGB-D 640 × 480 3K 4/– 1 (1st) × RealSense SR300
Synthetic SynthHands (Mueller et al., 2017) 2017 RGB-D 640 × 480 220K –/7 5 (1st) × × Unity
RHD (Zimmermann & Brox, 2017) 2017 RGB 320 × 320 43K 20/– 1 (3rd) × × Unity
GANerated (Mueller et al., 2018) 2018 RGB 256 × 256 331K –/– 1 (3rd) × × Unity
International Journal of Computer Vision (2023) 131:3193–3206
ObMan (Hasson et al., 2019) 2019 RGB-D 256 × 256 154K 20/3K 1 (3rd) × Blender/GraspIt (Miller & Allen, 2005)
MVHM (Chen et al., 2021) 2021 RGB-D 256 × 256 320K –/– 8 (3rd) × × Blender
Hand marker BigHand2.2M (Yuan et al., 2017) 2017 Depth 640 × 480 2200K 10/– 1 (1st + 3rd) × NDI trakSTAR
FPHA (Garcia-Hernando et al., 2018) 2018 RGB-D 640 × 480 105K 6/4 1 (1st) NDI trakSTAR
GRAB (Taheri et al., 2020) 2020 MoCap – 1624K 10/51 – VICON Vantage 16
Computational
Model fitting ICVL (Tang et al., 2014) 2014 Depth 320 × 240 180K 10/– 1 (3rd) × × Creative Senz3D
NYU (Tompson et al., 2014) 2014 RGB-D 640 × 480 81K 2/– 1 (3rd) × × PrimeSense
FreiHAND (Zimmermann et al., 2019) 2019 RGB 224 × 224 37K 32/27 8 (3rd) × × Basler ace
YouTube3DHands (Kulon et al., 2020) 2020 RGB 640 × 480 47K (109)/– 1 (3rd) × –
HO-3D (Hampali et al., 2020) 2020 RGB-D 640 × 480 103K 10/10 1–5 (3rd)
DexYCB (Chao et al., 2021) 2021 RGB-D 640 × 480 582K 10/20 8 (3rd) RealSense D415
H2O (Kwon et al., 2021) 2021 RGB-D 1280 × 720 571K 4/10 5 (1st & 3rd) Azure Kinect
Triangulation Panoptic Studio (Simon et al., 2017) 2017 RGB 1920 × 1080 15K –/– 31 (3rd) × HD camera
InterHand2.6M (Moon et al., 2020) 2020 RGB 512 × 334 2590K 27/– 80–140 (3rd) × HD camera
AssemblyHands (Ohkawa et al., 2023) 2023 RGB/Mono 636 × 480 3030K 34/– 12 (1st & 3rd) × HD camera/Meta Quest
We categorize the annotation methods as manual, synthetic-model-based, hand-marker-based, and computational annotation
123
3197
3198 International Journal of Computer Vision (2023) 131:3193–3206
tions of 3D coordinates (i.e., 2D position and depth) when et al., 2015)). Ohkawa et al. proposed foreground-aware
hand joints are fully visible. image stylization to convert the simulation texture in the
However, it is not extensively available according to the ObMan data to a more realistic one while separating the hand
number of frames due to the high annotation cost. In addition, regions and backgrounds (Ohkawa et al., 2021). Corona et
since it is not robust for occluded keypoints, this approach al. attempted to synthesize more natural hand grasps with
only allows fingertip annotation, instead of full hand joints. affordance classification and the refinement of fingertip loca-
For these limitations, these datasets provide a small amount tions (Corona et al., 2020). However, the ObMan data only
of data (≈ 3K images) used for evaluation only. Addition- provide static hand images with hand-held objects, not hand
ally, these single-view datasets can produce view-dependent motion. The hand motion simulation while approaching the
annotation errors because a single-depth camera captures the object remains an open problem.
distance to the hand skin surface, not the true joint position.
To reduce such unavoidable errors, subsequent annotation
methods based on multi-camera setups provide further accu- 4.3 Hand-Marker-Based Annotation
rate annotations (see Sect. 4.4).
As shown in Fig. 5, hand-marker-based annotation automat-
ically tracks attached hand markers and then calculates the
4.2 Synthetic-Model-Based Annotation
coordinates of hand joints. Initially, Wetzler et al. attached
magnetic sensors to fingertips that provide 6-DoF infor-
To acquire large-scale hand images and labels, synthetic
mation of the markers (Wetzler et al., 2015). While this
methods based on synthetic hand and full-body models
scheme can annotate fingertips only, recent datasets, Big-
(Loper et al. 2015; Rogez et al. 2014; Romero et al. 2017;
Hand2.2M (Yuan et al., 2017) and FPHA (Garcia-Hernando
Šarić 2011) have been proposed. SynthHands (Mueller et
et al., 2018), use these sensors to offer the annotation of the
al., 2017) and RHD (Zimmermann & Brox, 2017) render
full 21 hand joints. Figure 6 shows how to compute the joint
synthetic hand images with randomized real backgrounds
positions given six magnetic sensors. It uses inverse kine-
from either a first- or third-person view. MVHM (Chen et
matics to infer all 21 hand joints, which fits a hand skeleton
al., 2021) generates multi-view synthetic hand data rendered
from eight viewpoints. These datasets have succeeded in
providing accurate hand keypoint labels on a large scale.
Although they can generate various background patterns
inexpensively, the lighting and texture of hands are not well
simulated, and the simulation of hand-object interaction is
not considered in the data generation process.
To handle these issues, GANerated (Mueller et al., 2018)
utilizes GAN-based image translation to stylize synthetic
hands more realistically. Furthermore, ObMan (Hasson et al.,
2019) simulates the hand-object interaction in data genera-
tion using a hand grasp simulator (Graspit (Miller & Allen,
2005)) with known 3D object models (ShapeNet (Chang Fig. 5 Illustration of a hand marker setup (Yuan et al., 2017)
123
International Journal of Computer Vision (2023) 131:3193–3206 3199
123
3200 International Journal of Computer Vision (2023) 131:3193–3206
123
International Journal of Computer Vision (2023) 131:3193–3206 3201
of hand images and how to design embedding techniques. 5.3 Domain Adaptation
Spurr et al. proposed to geometrically align two features
generated from differently augmented instances (Spurr et Domain adaptation aims to improve model performance on
al., 2021). Zimmermann et al. found that multi-view images target data by learning from labeled source data and target
representing the same hand pose can be effective pair super- data with limited labels. This study has addressed two types
vision (Zimmermann et al., 2021). of underlying domain gaps: between different datasets and
Other works utilize the scheme of self-supervised learning between different modalities.
that solves an auxiliary task, instead of the target task of hand The former problem between different datasets is a com-
pose estimation. Given the prediction on an unlabeled depth mon domain adaptation problem where the source and target
image, (Oberweger et al., 2015; Wan et al., 2019) render a data are sampled from two datasets with different image
synthetic depth image and penalize the matching between the statistics, e.g., sim-to-real adaptation (Jiang et al., 2021; Tang
input image and the one generated from the prediction. This et al., 2013) (see Fig. 12). The model has access to readily
auxiliary loss by image synthesis is informative even when available synthetic images with labels and target real images
annotations are scarce. without labels. The latter problem between different modal-
ities is characterized as modality transfer where the source
and target data represent the same scene, but their modalities
5.2 Semi-Supervised Learning
are different, e.g., depth vs. RGB (see Fig. 13). This aims to
utilize information-rich source data, e.g., depth images con-
As shown in Fig. 11, semi-supervised learning is used to
tain 3D structural information, for inferring easily available
learn from small labeled data and large unlabeled data simul-
target data (e.g., RGB images).
taneously. Liu et al. proposed a pseudo-labeling method
To reduce the gap between the two datasets, two major
that learns unlabeled instances with pseudo-ground-truth
approaches have been proposed: generative methods and
given from the model’s prediction (Liu et al., 2021). This
adversarial methods. In generative methods, Qi et al. pro-
pseudo-label training is applied only when its prediction sat-
posed an image translation method to alter the synthetic
isfies spatial and temporal constraints. The spatial constraints
textures to realistic ones and train a model on generated real-
check the correspondence of a 2D hand pose and the 2D pose
like synthetic data (Qi et al., 2020).
projected from 3D hand pose prediction. In addition, they
Adversarial methods enforce matching two domains’ fea-
include a constraint based on bio-mechanical feasibility, such
tures so that the feature extractor can encode features even
as bone lengths and joint angles. The temporal constraints
from the target domain. However, in addition to the domain
indicate the smoothness of hand pose and mesh predictions
gap in an input space (e.g., the difference in backgrounds),
over time.
the gap in a label space also exists in this task, which is
Yang et al. proposed the combination of pseudo-labeling
and consistency training (Yang et al., 2021). In pseudo-
labeling, the generated pseudo-labels are corrected by fitting
the hand template model. In addition, the work enforces
consistency losses between the predictions of differently aug-
mented instances and between the modalities of 2D hand
poses and hand masks.
Spurr et al. applied adversarial training to a sequence of
predicted hand poses (Spurr et al., 2021). The encoder net- Fig. 12 Poor generalization to an unknown domain (Jiang et al., 2021).
work is expected to be improved by fooling a discriminator The models trained on synthetic images (source) exhibit a limited capac-
that distinguishes between plausible and invalid hand poses. ity for inferring poses on real images (target)
123
3202 International Journal of Computer Vision (2023) 131:3193–3206
not assumed in typical adversarial methods (Ganin & Lem- wearers in a static camera setup will advance the analysis of
pitsky, 2015; Tzeng et al., 2017). Zhang et al. developed a multi-person cooperation and interaction, e.g., game playing
feature matching method based on Wasserstein distance and and construction with multiple people.
proposed adaptive weighting to enable matching only for
features related to hand characteristics, except for label infor-
6.2 Various Types of Activities
mation (Zhang et al., 2020). Jiang et al. utilized an adversarial
regressor and optimized the domain disparity by a minimax
We believe that increasing the type of activities is an impor-
game (Jiang et al., 2021). Such minimax of disparity is effec-
tant direction for generalizing models to various situations
tive in domain adaptation of regression tasks, including hand
with hand-object interaction. A major limitation of existing
pose estimation.
hand datasets is the narrow variation of users’ performing
As for the modality transfer problem, Yuan et al. and
tasks and grasping objects. To avoid object occlusion, some
Rad et al. attempted to use depth images as the auxiliary
works did not capture hand-object interaction (Moon et al.,
information during training and test the model on RGB
2020; Yuan et al., 2017; Zimmermann & Brox, 2017). Oth-
images (Rad et al., 2018; Yuan et al., 2019). They observed
ers (Chao et al., 2021; Hasson et al., 2019; Hampali et al.,
that learned features from depth images could support RGB-
2020) used pre-registered 3D object models (e.g., YCB (Çalli
based hand pose estimation. Park et al. transferred the
et al., 2015)) to simplify in-hand object pose estimation. User
knowledge from depth images to infrared (IR) images that
action is also very simple in these benchmarks, such as pick
have less motion blur (Park et al., 2020). Their training is
and place.
facilitated by matching two features from paired images, e.g.,
From an affordance perspective (Hassanin et al., 2021),
(RGB, depth) and (depth, IR). Baek et al. newly defined the
diversifying the object category will result in increasing
domain of hand-only images where a hand-held object is
hand pose variation. Potential future works will capture goal-
removed. The work translates hand-object images to hand-
oriented and procedural activities that naturally occur in our
only images by using GAN and mesh renderer (Baek et al.,
daily life (Damen et al., 2021; Grauman et al., 2022; Sener
2020). Given a hand-object image with an unknown object,
et al., 2022), such as cooking, art and craft, and assembly.
this method can generate hand-only images, from which hand
To enable this, we need to develop portable camera
pose estimation is more tractable.
systems and robust annotation methods for complex back-
grounds and unknown objects. In addition, occurring hand
poses are constrained to the context of the activity. Thus,
6 Future Directions
pose estimators conditioned by actions, objects, or textual
descriptions of the scene will improve estimation in various
6.1 Flexible Camera Systems
activities.
We believe that hand image capture will feature more flex-
ible camera systems, such as using first-person cameras. To 6.3 Towards Minimal Human Effort
reduce the occlusion effect without the need for hand mark-
ers, recently published hand datasets have been acquired Sections 4 and 5 separately explain efficient annotation and
by multi-camera setups, e.g., DexYCB (Chao et al., 2021), learning. To minimize the effort of human intervention, utiliz-
InterHand2.6M (Moon et al., 2020), and FreiHAND (Zim- ing findings from both annotation and learning perspectives
mermann et al., 2019). These setups are static and not suitable is one of the promising directions. Feng et al. exploited active
for capturing dynamic user behavior. To address this, a first- learning that optimizes which unlabeled instance should be
person camera attached to the user’s head or body is useful annotated and semi-supervised learning that jointly utilizes
because it mostly captures close-up hands even when the user labeled data and large unlabeled data (Feng et al., 2021).
moves around. However, as shown in Table 1, existing first- However, this method is constrained to triangulation-based
person benchmarks have a very limited variety due to heavy 3D pose estimation. As we mentioned in Sect. 4.4, another
occlusion, motion blur, and a narrow field-of-view. major computational annotation is model fitting; thus, we still
One promising direction is a joint camera setup with need to consider such a collaborative approach in the anno-
first-person and third-person cameras, such as H2O (Kwon tation based on model fitting.
et al., 2021) and AssemblyHands (Ohkawa et al., 2023). Zimmermann et al. also proposed a framework of human-
This results in flexibly capturing the user’s hands from the in-loop annotation that inspects the annotation quality man-
first-person camera while taking the benefits of multiple ually while updating annotation networks on the inspected
third-person cameras (e.g., mitigating the occlusion effect). annotations (Zimmermann & Brox, 2017). However, this
However, the first-person camera wearer doesn’t always have human check will be a bottleneck in large dataset construc-
to be alone. Image capture with multiple first-person camera tion. The evaluation of annotation quality on the fly is a
123
International Journal of Computer Vision (2023) 131:3193–3206 3203
necessary technique to scale up the combination of anno- right holder. To view a copy of this licence, visit http://creativecomm
tation and learning. ons.org/licenses/by/4.0/.
123
3204 International Journal of Computer Vision (2023) 131:3193–3206
object scenes. In Proceedings of the IEEE/CVF conference on view end-to-end hand tracking for VR. In Proceedings of the ACM
computer vision and pattern recognition (CVPR) (pp. 5030–5040). SIGGRAPH Asia conference (pp. 50:1–50:9).
Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Ma, J., Kaza- Handa, A., Wyk, K. V., Yang, W., Liang, J., Chao, Y.-W., Wan, Q.,
kos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W. & Wray, Birchfield, S., Ratliff, N. & Fox, D. (2020) DexPilot: Vision-
M. (2021). Rescaling egocentric vision. International Journal of based teleoperation of dexterous robotic hand-arm system. In
Computer Vision (IJCV), early access. Proceedings of the IEEE international conference on robotics and
Doosti, B. (2019). Hand pose estimation: A survey. CoRR, automation (ICRA) (pp. 9164–9170).
arXiv:1903.01013 Hassanin, M., Khan, S., & Tahtali, M. (2021). Visual affordance and
Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). function understanding: A survey. ACM Computing Survey, 54(3),
Vision-based hand pose estimation: A review. Computer Vision 47:1-47:35.
and Image Understanding (CVIU), 108(1–2), 52–73. Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M. J., Laptev,
Feng, Q., He, K., Wen, H., Keskin, C., & Ye, Y. (2021). Active learn- I. & Schmid, C. (2019). Learning joint reconstruction of hands and
ing with pseudo-labels for multi-view 3d pose estimation. CoRR, manipulated objects. In Proceedings of the IEEE/CVF conference
arXiv:2112.13709 on computer vision and pattern recognition (CVPR) (pp. 11807–
Ganin, Y. & Lempitsky, V. (2015). Unsupervised domain adaptation by 11816).
backpropagation. In Proceedings of the international conference He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. B. (2020). Momen-
on machine learning (ICML) (pp. 1180–1189). tum contrast for unsupervised visual representation learning. In
Garcia-Hernando, G., Yuan, S., Baek, S. & Kim, T.-K. (2018). First- Proceedings of the IEEE/CVF conference on computer vision and
person hand action benchmark with RGB-D videos and 3D hand pattern recognition (CVPR) (pp. 9726–9735).
pose annotations. In Proceedings of the IEEE/CVF conference on He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for
computer vision and pattern recognition (CVPR) (pp. 409–419). image recognition. In Proceedings of the IEEE/CVF conference on
Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J. & Yuan, J. (2019). computer vision and pattern recognition (CVPR) (pp. 770–778).
3D hand shape and pose estimation from a single RGB image. In Hidalgo, G., Cao, Z., Simon, T., Wei, S.-E., Raaj, Y., Joo,
Proceedings of the IEEE/CVF conference on computer vision and H. & Sheikh, Y. (2018). OpenPose. https://github.com/CMU-
pattern recognition (CVPR) (pp. 10833–10842). Perceptual-Computing-Lab/openpose
Glauser, O., Wu, S., Panozzo, D., Hilliges, O., & Sorkine-Hornung, O. Huang, L., Tan, J., Liu, J., & Yuan, J. (2020). Hand-transformer: Non-
(2019). Interactive hand pose estimation using a stretch-sensing autoregressive structured modeling for 3d hand pose estimation.
soft glove. ACM Transactions on Graphics (ToG), 38(4), 41:1- In Proceedings of the European conference on computer vision
41:15. (ECCV) (Vol. 12370, pp. 17–33).
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, Huang, W., Ren, P., Wang, J., Qi, Q. & Sun, H. (2020). AWR: Adaptive
R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagara- weighting regression for 3D hand pose estimation. In Proceed-
jan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma, ings of the AAAI conference on artificial intelligence (AAAI) (pp.
J., Wray, M., Xu, M.g, Xu, E. Zhongcong, Zhao, C., Bansal, S., 11061–11068).
Batra, D., Cartillier, V., Crane, S., Do, T., Doulaty,M.,Erapalli, A., Iqbal, U., Garbade, M. & Gall, J. (2017). Pose for action–action for pose.
Feichtenhofer, C., Fragomeni, A., Fu, Q., Fuegen, C., Gebrese- In Proceedings of the IEEE international conference on automatic
lasie, A., Gonzalez, C., Hillis, J., Huang, X., Huang, Y., Jia, W., face & gesture recognition, FG (pp. 438–445).
Khoo, W., Kolar, J., Kottur, S., Kumar, A., Landini, F., Li, C., Li, Iqbal, U., Molchanov, P., Breuel, T. M., Gall, J. & Kautz, J. (2018). Hand
Y., Li, Z., Mangalam, K., Modhugu, R., Munro, J., Murrell, T., pose estimation via latent 2.5D heatmap regression. In Proceedings
Nishiyasu, T., Price, W., Puentes, P. R., Ramazanova, M., Sari, L., of the European conference on computer vision (ECCV) (pp. 125–
Somasundaram, K., Southerland, A., Sugano, Y., Tao, R., Vo, M., 143).
Wang, Y., Wu, X., Yagi, T., Zhu, Y., Arbelaez, P., Crandall, D., Iskakov, K., Burkov, E., Lempitsky, V. & Malkov, Y. (2019). Learnable
Damen, D., Farinella, G. M., Ghanem, B., Ithapu, V. K., Jawahar, triangulation of human pose. In Proceedings of the IEEE/CVF
C. V., Joo, H., Kitani, K., Li, H., Newcombe, R., Oliva, A., Park, international conference on computer vision (ICCV) (pp. 7718–
H. Soo, Rehg, J. M., Sato, Y., Shi, J., Shou, M. Z., Torralba, A., 7727).
Torresani, Lo, Yan, M.i, & Malik, J. (2022). Ego4D: Around the Jiang, J., Ji, Y., Wang, X., Liu, Y., Wang, J. & Long, M. (2021). Regres-
world in 3,000 hours of egocentric video. In Proceedings of the sive domain adaptation for unsupervised keypoint detection. In
IEEE/CVF conference on computer vision and pattern recognition Proceedings of the IEEE/CVF conference on computer vision and
(CVPR) (pp. 18973–18990). pattern recognition (CVPR) (pp. 6780–6789).
Hampali, S., Rad, M., Oberweger, M. & Lepetit, V. (2020). Honnotate: Kulon, D., Güler, R. A., Kokkinos, I., Bronstein, M. M. & Zafeiriou, S.
A method for 3D annotation of hand and object poses. In Proceed- (2020). Weakly-supervised mesh-convolutional hand reconstruc-
ings of the IEEE/CVF conference on computer vision and pattern tion in the wild. In Proceedings of the IEEE/CVF conference on
recognition (CVPR) (pp. 3196–3206). computer vision and pattern recognition (CVPR) (pp. 4989–4999).
Hampali, S., Sarkar, S. D., Rad, M. & Lepetit, V. (2022) Keypoint trans- Kwon, T., Tekin, B., Stühmer, J., Bogo, F. & Pollefeys, M. (2021). H2O:
former: Solving joint identification in challenging hands and object Two hands manipulating objects for first person interaction recog-
interactions for accurate 3d pose estimation. In Proceedings of the nition. In Proceedings of the IEEE/CVF international conference
IEEE/CVF conference on computer vision and pattern recognition on computer vision (ICCV) (pp. 10118–10128).
(CVPR) (pp. 11080–11090). Le, V.-H., & Nguyen, H.-C. (2020). A survey on 3d hand skeleton
Han, S., Liu, B., Cabezas, R., Twigg, C. D., Zhang, P., Petkau, J., Yu, and pose estimation by convolutional neural network. Advances in
T.-H., Tai, C.-J., Akbay, M., Wang, Z., Nitzan, A., Dong, G., Ye, Y., Science, Technology and Engineering Systems Journal (ASTES),
Tao, L., Wan, C., & Wang, R. (2020). MEgATrack: Monochrome 5(4), 144–159.
egocentric articulated hand-tracking for virtual reality. ACM Trans- Lepetit, V. (2020). Recent advances in 3d object and hand pose estima-
actions on Graphics (ToG), 39(4), 87. tion. CoRR, arXiv:2006.05927
Han, S., Wu, P.-C., Zhang, Y., Liu, B., Zhang, L., Wang, Z., Si, W., Liang, H., Yuan, J.G., Thalmann, D. & Magnenat-Thalmann, N. (2015).
Zhang, P., Cai, Y., Hodan, T., Cabezas, R., Tran, L., Akbay, M., Yu, AR in hand: Egocentric palm pose tracking and gesture recognition
T.-H., Keskin, C. & Wang, R. (2022). UmeTrack: Unified multi- for augmented reality applications. In Proceedings of the ACM
international conference on multimedia (MM) (pp. 743–744).
123
International Journal of Computer Vision (2023) 131:3193–3206 3205
Liu, S., Jiang, H., Xu, J., Liu, S. & Wang, X. (2021). Semi-supervised Oikonomidis, I., Kyriazis, N. & Argyros, A. A. (2011). Efficient
3D hand-object poses estimation with interactions in time. In Pro- model-based 3d tracking of hand articulations using kinect. In
ceedings of the IEEE/CVF conference on computer vision and Proceedings of the British machine vision conference (BMVC) (pp.
pattern recognition (CVPR) (pp. 14687–14697). 1–11).
Liu, Y., Jiang, J. & Sun, J. (2021). Hand pose estimation from rgb Oikonomidis, I., Kyriazis, N. & Argyros, A. A. (2012). Tracking the
images based on deep learning: A survey. In IEEE international articulated motion of two strongly interacting hands. In Proceed-
conference on virtual reality (ICVR) (pp. 82–89). ings of the IEEE/CVF conference on computer vision and pattern
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. recognition (CVPR), (pp. 1862–1869).
(2015). SMPL: A skinned multi-person linear model. ACM Trans- Park, G., Kim, T.-K. & Woo, W. (2020). 3d hand pose estimation with
actions on Graphics (ToG), 34(6), 24:81-24:816. a single infrared camera via domain transfer learning. In Proceed-
Lu, S., Metaxas, D. N., Samaras, D. & Oliensis, J. (2003). Using multi- ings of the IEEE international symposium on mixed and augmented
ple cues for hand tracking and model refinement. In Proceedings of reality (ISMAR) (pp. 588–599).
the IEEE/CVF conference on computer vision and pattern recog- Qi, M., Remelli, E., Salzmann, M. & Fua, P. (2020). Unsupervised
nition (CVPR) (pp. 443–450). domain adaptation with temporal-consistent self-training for 3d
Mandikal, P. & Grauman, K. (2021). DexVIP: Learning dexterous hand-object joint reconstruction. CoRR, arXiv:2012.11260
grasping with human hand pose priors from video. In Proceed- Qian, C., Sun, X., Wei, Y., Tang, X.& Sun, J. (2014). Realtime and
ings of the conference on robot learning (CoRL) (pp. 651–661). robust hand tracking from depth. In Proceedings of the IEEE/CVF
Melax, S., Keselman, L., & Orsten, S. (2013). Dynamics based 3d skele- conference on computer vision and pattern recognition (CVPR)
tal hand tracking. In Proceedings of the graphics interface (GI) (pp. (pp. 1106–1113).
63–70). Qin, Y., Wu, Y.-H., Liu, S., Jiang, H., Yang, R., Fu, Y., & Wang, X.
Miller, A., & Allen, P. (2005). Graspit!: A versatile simulator for robotic (2022). DexMV: Imitation learning for dexterous manipulation
grasping. IEEE Robotics and Automation Magazine (RAM), 11, from human videos. In Proceedings of the European conference
110–122. on computer vision (ECCV) (Vol. 13699, pp. 570–587).
Miyata, N., Kouchi, M., Kurihara, T. & Mochimaru, M. (2004). Mod- Rad, M., Oberweger, M., & Lepetit, V. (2018). Domain transfer for 3d
eling of human hand link structure from optical motion capture pose estimation from color images without manual annotations. In
data. In Proceedings of the IEEE/RSJ international conference on Proceedings of the Asian conference on computer vision (ACCV)
intelligent robots and systems (IROS) (pp. 2129–2135). (Vol. 11365, pp. 69–84).
Moon, G., Yu, S.-I., Wen, H., Shiratori, T. & Lee, K. M. (2020). Inter- Ren, P., Sun, H., Qi, Q., Wang, J. & Huang, W. (2019). SRN: Stacked
Hand2.6M: A dataset and baseline for 3D interacting hand pose regression network for real-time 3D hand pose estimation. In Pro-
estimation from a single RGB image. In Proceedings of the Euro- ceedings of the British machine vision conference (BMVC).
pean conference on computer vision (ECCV) (pp. 548–564). Rogez, G., Khademi, M., Supancic, J. S., III., Montiel, J. M. M., &
Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, Ramanan, D. (2014). 3d hand pose detection in egocentric RGB-
D. & Theobalt, C.(2018). GANerated Hands for real-time 3D hand D images. In Proceedings of the European conference on computer
tracking from monocular RGB. In Proceedings of the IEEE/CVF vision workshops (ECCVW) (Vol. 8925, pp. 356–371).
conference on computer vision and pattern recognition (CVPR) Rogez, G., Supancic III, J. S. & Ramanan, D. (2015). First-person pose
(pp. 49–59). recognition using egocentric workspaces. In Proceedings of the
Mueller, F., Davis, M., Bernard, F., Sotnychenko, O., Verschoor, M., IEEE/CVF conference on computer vision and pattern recognition
Otaduy, M. A., Casas, D., & Theobalt, C. (2019). Real-time pose (CVPR) (pp. 4325–4333).
and shape reconstruction of two interacting hands with a single Rogez, G., Supancic III, J. S. & Ramanan, D. (2015). Understanding
depth camera. ACM Transactions on Graphics (ToG), 38(4), 49:1- everyday hands in action from RGB-D images. In Proceedings
49:13. of the IEEE/CVF international conference on computer vision
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D. & (ICCV) (pp. 3889–3897).
Theobalt, C. (2017). Real-time hand tracking under occlusion from Romero, J., Kjellström, H. & Kragic, D. (2010). Hands in action: Real-
an egocentric RGB-D sensor. In Proceedings of the IEEE/CVF time 3d reconstruction of hands in interaction with objects. In
international conference on computer vision (ICCV) (pp. 1163– Proceedings of the IEEE international conference on robotics and
1172). automation (ICRA) (pp. 458–463).
Oberweger, M., Riegler, G., Wohlhart, P. & Lepetit, V. (2016). Effi- Romero, J., Tzionas, D., & Black, M. J. (2017). Embodied hands: Mod-
ciently creating 3d training data for fine hand pose estimation. In eling and capturing hands and bodies together. ACM Transactions
Proceedings of the IEEE/CVF conference on computer vision and on Graphics (ToG), 36(6), 245:1-245:17.
pattern recognition (CVPR) (pp. 4957–4965). Santavas, N., Kansizoglou, I., Bampis, L., Karakasis, E. G., & Gaster-
Oberweger, M., Wohlhart, P. & Lepetit, V. (2015). Training a feedback atos, A. (2021). Attention! A lightweight 2d hand pose estimation
loop for hand pose estimation. In Proceedings of the IEEE/CVF approach. IEEE Sensors, 21(10), 11488–11496.
international conference on computer vision (ICCV) (pp. 3316– Šarić, M. (2011). Libhand: A library for hand articulation. Version 0.9.
3324). Schröder, M., Maycock, J. & Botsch, M. (2015). Reduced marker lay-
Ohkawa, T., He, K., Sener, F., Hodan, T., Tran, L. & Keskin, C. (2023). outs for optical motion capture of hands. In Proceedings of the
AssemblyHands: Towards egocentric activity understanding via 3d ACM SIGGRAPH conference on motion in games (MIG) (pp. 7–
hand pose estimation. In Proceedings of the IEEE/CVF conference 16). ACM.
on computer vision and pattern recognition (CVPR). Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R.
Ohkawa, T., Li, Y.-J., Fu, Q., Furuta, R., Kitani, K. M. & Sato, Y. (2022). & Yao, A. (2022). Assembly101: A large-scale multi-view video
Domain adaptive hand keypoint and pixel localization in the wild. dataset for understanding procedural activities. In Proceedings of
In Proceedings of the European conference on computer vision the IEEE/CVF conference on computer vision and pattern recog-
(ECCV) (pp. 68–87). nition (CVPR) (pp. 21096–21106).
Ohkawa, T., Yagi, T., Hashimoto, A., Ushiku, Y., & Sato, Y. (2021). Sharp, T., Keskin, C., Robertson, D. P., Taylor, J., Shotton, J., Kim, D.,
Foreground-aware stylization and consensus pseudo-labeling for Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D.,
domain adaptation of first-person hand segmentation. IEEE Kohli, P., Krupka, E., Fitzgibbon, A. W. & Izadi, S. (2015). Accu-
Access, 9, 94644–94655. rate, robust, and flexible real-time hand tracking. In Proceedings
123
3206 International Journal of Computer Vision (2023) 131:3193–3206
of the SIGCHI conference on human factors in computing systems Wu, M.-Y., Ting, P.-W., Tang, Y.-H., Chou, E. T., & Fu, L.-C. (2020).
(CHI) (pp. 3633–3642). Hand pose estimation in object-interaction based on deep learning
Simon, T., Joo, H., Matthews, I. & Sheikh, Y. (2017). Hand keypoint for virtual reality applications. Journal of Visual Communication
detection in single images using multiview bootstrapping. In Pro- and Image Representation, 70, 102802.
ceedings of the IEEE/CVF conference on computer vision and Wuu, C., Zheng, N., Ardisson, S., Bali, R., Belko, D., Brockmeyer, E.,
pattern recognition (CVPR) (pp. 4645–4653). Evans, L., Godisart, T., Ha, H., Hypes, A., Koska, T., Krenn, S.,
Spurr, A., Dahiya, A., Wang, X., Zhang, X. & Hilliges, O. (2021). Lombardi, S., Luo, X., McPhail, K., Millerschoen, L., Perdoch,
Self-supervised 3d hand pose estimation from monocular RGB M., Pitts, M. Richard, A., Saragih, J. M., Saragih, J., Shiratori, T.,
via contrastive learning. In Proceedings of the IEEE/CVF interna- Simon, T., Stewart, M., Trimble, A., Weng, X., Whitewolf, D., Wu,
tional conference on computer vision (ICCV) (pp. 11210–11219). C., Yu, S. & Sheikh, Y. (2022). Multiface: A dataset for neural face
Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O. & Kautz, J. (2020). rendering. CoRR, arXiv:2207.11243
Weakly supervised 3D hand pose estimation via biomechanical Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J. T. & Yuan, J.
constraints. In Proceedings of the European conference on com- (2019). A2J: Anchor-to-joint regression network for 3D articulated
puter vision (ECCV) (pp. 211–228). pose estimation from a single depth image. In Proceedings of the
Spurr, A., Molchanov, P., Iqbal, U., Kautz, J. & Hilliges, O. (2021). IEEE/CVF international conference on computer vision (ICCV)
Adversarial motion modelling helps semi-supervised hand pose (pp. 793–802).
estimation. CoRR, arXiv:2106.05954 Xu, C. & Cheng, L. (2013). Efficient hand pose estimation from a sin-
Spurr, A., Song, J., Park, S. & Hilliges, O. (2018). Cross-modal deep gle depth image. In Proceedings of the IEEE/CVF international
variational hand pose estimation. In Proceedings of the IEEE/CVF conference on computer vision (ICCV) (pp. 3456–3462).
conference on computer vision and pattern recognition (CVPR) Yang, L., Chen, S. & Yao, A. (2021). Semihand: Semi-supervised hand
(pp. 89–98). pose estimation with consistency. In Proceedings of the IEEE/CVF
Sridhar, S., Mueller, F., Zollhoefer, M., Casas, D., Oulasvirta, A. & international conference on computer vision (ICCV) (pp. 11364–
Theobalt, C. (2016). Real-time joint tracking of a hand manipulat- 11373).
ing an object from RGB-D input. In Proceedings of the European Yuan, S., Garcia-Hernando, G., Stenger, B., Moon, G., Chang, J. Y.,
conference on computer vision (ECCV) (pp. 294–310). Lee, K. M., Molchanov, P., Kautz, J., Honari, S., Ge, L., Yuan,
Sridhar, S., Oulasvirta, A. & Theobalt, C. (2013). Interactive marker- J., Chen, X., Wang, G., Yang, F., Akiyama, K., Wu, Y., Wan, Q.,
less articulated hand motion tracking using RGB and depth data. Madadi, M., Escalera, S., Li, S., Lee, D., Oikonomidis, I., Argyros,
In Proceedings of the IEEE/CVF international conference on com- A. A. & Kim, T-K. (2018). Depth-based 3d hand pose estimation:
puter vision (ICCV) (pp. 2456–2463). From current achievements to future goals. In Proceedings of the
Supancic, J. S., III., Rogez, G., Yang, Y., Shotton, J., & Ramanan, IEEE/CVF conference on computer vision and pattern recognition
D. (2018). Depth-based hand pose estimation: Methods, data, (CVPR) (pp. 2636–2645).
and challenges. International Journal Computer Vision (IJCV), Yuan, S., Stenger, B. & Kim, T.-K. (2019). Rgb-based 3d hand pose
126(11), 1180–1198. estimation via privileged learning with depth images. In Proceed-
Taheri, O., Ghorbani, N., Black, M. J. & Tzionas, D. (2020). GRAB: A ings of the IEEE/CVF international conference on computer vision
dataset of whole-body human grasping of objects. In Proceedings workshops (ICCVW).
of the European conference on computer vision (ECCV) (pp. 581– Yuan, S., Ye, Q., Stenger, B., Jain, S. & Kim, T.-K. (2017). Big-
600). Hand2.2M benchmark: Hand pose dataset and state of the art
Tang, D., Chang, H. J., Tejani, A. & Kim, T.-K. (2014). Latent regression analysis. In Proceedings of the IEEE/CVF conference on computer
forest: Structured estimation of 3d articulated hand posture. In vision and pattern recognition (CVPR) (pp. 2605–2613).
Proceedings of the IEEE/CVF conference on computer vision and Zhang, Y., Chen, L., Liu, Y., Zheng, W. & Yong, J. (2020). Adaptive
pattern recognition (CVPR) (pp. 3786–3793). wasserstein hourglass for weakly supervised RGB 3d hand pose
Tang, D., Yu, T.-H. & Kim, T.-K. (2013). Real-time articulated hand estimation. In Proceedings of the ACM international conference
pose estimation using semi-supervised transductive regression on multimedia (MM) (pp. 2076–2084).
forests. In Proceedings of the IEEE/CVF international conference Zhou, X., Wan, Q., Zhang, W., Xue, X. & Wei, Y. (2016). Model-based
on computer vision (ICCV) (pp. 3224–3231). deep hand pose estimation. In Proceedings of the international
Tekin, B., Bogo, F. & Pollefeys, M. (2019). H+O: Unified egocentric joint conference on artificial intelligence (IJCAI) (pp. 2421–2427).
recognition of 3D hand-object poses and interactions. In Proceed- Zimmermann, C., Argus, M., & Brox, T. (2021). Contrastive represen-
ings of the IEEE/CVF conference on computer vision and pattern tation learning for hand shape estimation. In Proceedings of the
recognition (CVPR) (pp. 4511–4520). DAGM German conference on pattern recognition (GCPR) (Vol.
Tompson, J., Stein, M., LeCun, Y., & Perlin, K. (2014). Real-time 13024, pp. 250–264).
continuous pose recovery of human hands using convolutional Zimmermann, C. & Brox, T. (2017). Learning to estimate 3D hand
networks. ACM Transactions on Graphics (ToG), 33(5), 169:1- pose from single RGB images. In Proceedings of the IEEE/CVF
169:10. international conference on computer vision (ICCV) (pp. 4913–
Tzeng, E., Hoffman, J., Saenko, K. & Darrell, T. (2017). Adversarial dis- 4921).
criminative domain adaptation. In Proceedings of the IEEE/CVF Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M. J. & Brox,
conference on computer vision and pattern recognition (CVPR) T. (2019). FreiHAND: A dataset for markerless capture of hand
(pp. 2962–2971). pose and shape from single RGB images. In Proceedings of the
Wan, C., Probst, T., Gool, L. V. & Yao, A. (2019). Self-supervised 3d IEEE/CVF international conference on computer vision (ICCV)
hand pose estimation through training by fitting. In Proceedings of (pp. 813–822).
the IEEE/CVF conference on computer vision and pattern recog-
nition (CVPR) (pp. 10853–10862).
Wang, R. Y., & Popovic, J. (2009). Real-time hand-tracking with a color Publisher’s Note Springer Nature remains neutral with regard to juris-
glove. ACM Transactions on Graphics (ToG), 28(3), 63. dictional claims in published maps and institutional affiliations.
Wetzler, A., Slossberg, R. & Kimmel, R. (2015). Rule of thumb: Deep
derotation for improved fingertip detection. In Proceedings of the
British machine vision conference (BMVC) (pp. 33.1–33.12).
123