Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

International Journal of Computer Vision (2023) 131:3193–3206

https://doi.org/10.1007/s11263-023-01856-0

Efficient Annotation and Learning for 3D Hand Pose Estimation: A


Survey
Takehiko Ohkawa1 · Ryosuke Furuta1 · Yoichi Sato1

Received: 9 June 2022 / Accepted: 12 July 2023 / Published online: 7 August 2023
© The Author(s) 2023

Abstract
In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and
learning. 3D hand pose estimation has been an important research area owing to its potential to enable various applications,
such as video understanding, AR/VR, and robotics. However, the performance of models is tied to the quality and quantity
of annotated 3D hand poses. Under the status quo, acquiring such annotated 3D hand poses is challenging, e.g., due to
the difficulty of 3D annotation and the presence of occlusion. To reveal this problem, we review the pros and cons of
existing annotation methods classified as manual, synthetic-model-based, hand-sensor-based, and computational approaches.
Additionally, we examine methods for learning 3D hand poses when annotated data are scarce, including self-supervised
pretraining, semi-supervised learning, and domain adaptation. Based on the study of efficient annotation and learning, we
further discuss limitations and possible future directions in this field.

Keywords Hand pose estimation · Efficient annotation · Learning with limited labels

1 Introduction 2022; Mandikal & Grauman, 2021). In these application sce-


narios, we must consider methods for annotating hand data,
The acquisition of 3D hand pose annotations1 has presented and select an appropriate learning method according to the
a significant challenge in the study of 3D hand pose estima- amount and quality of the annotations. However, there is cur-
tion. This makes it difficult to construct large training datasets rently no established methodology that can give annotations
and develop models for various target applications, such as efficiently and learn even from imperfect annotations. This
hand-object interaction analysis (Boukhayma et al., 2019; motivates us to review methods for building training datasets
Hampali et al., 2020), pose-based action recognition (Iqbal and developing models in the presence of these challenges
et al., 2017; Tekin et al., 2019; Sener et al., 2022), augmented in the annotation process.
and virtual reality (Liang et al., 2015; Han et al., 2022; Wu During the annotations, we encounter several obstacles
et al., 2020), and robot learning from human demonstration including the difficulty of 3D measurement, occlusion, and
(Ciocarlie & Allen, 2009; Handa et al., 2020; Qin et al., dataset bias. As for the first obstacle, annotating 3D points
from a single RGB image is an ill-posed problem. While
1 We denote 3D pose as the 3D keypoint coordinates of hand joints, annotation methods using hand markers, depth sensors, or
P3D ∈ R J ×3 where J is the number of joints. multi-view cameras can provide 3D positional labels, these
setups require a controlled environment, which limits avail-
Communicated by Dima Damen. able scenarios. As for the second obstacle, occlusion hinders
B Takehiko Ohkawa annotators from accurately localizing the positions of hand
ohkawa-t@iis.u-tokyo.ac.jp joints. As for the third obstacle, annotated data are biased to
Ryosuke Furuta a specific condition constrained by the annotation method.
furuta@iis.u-tokyo.ac.jp For instance, annotation methods based on hand markers or
Yoichi Sato multi-view setups are usually installed in laboratory settings,
ysato@iis.u-tokyo.ac.jp resulting in a bias toward a limited variety of backgrounds
and interacting objects.
1 Institute of Industrial Science, The University of Tokyo, 4-6-1
Komaba, Meguro-ku, Tokyo 153-8505, Japan

123
3194 International Journal of Computer Vision (2023) 131:3193–3206

view geometry. We find these annotation methods have their


own constraints, such as the necessity of human effort, the
sim-to-real gap, the changes in hand appearance, and the lim-
ited portability of the camera setups. Thus, these annotation
methods may not always be adopted for every application.
Due to the problems and constraints of each annotation
method, we need to consider how to develop models even
when we do not have enough annotations. Therefore, learn-
ing with a small amount of labels is another important topic.
For learning from limited annotated data, leveraging a large
pool of unlabeled hand images as well as labeled images is
Fig. 1 Our survey on 3D hand pose estimation is organized from two
aspects: (i) obtaining 3D hand pose annotation and (ii) learning even a primary interest, e.g., in self-supervised pretraining, semi-
with a limited amount of annotated data. These two issues will be con- supervised learning, and domain adaptation. Self-supervised
sidered in the scenarios of practical applications where we work on pretraining encourages the hand pose estimator to learn from
dataset construction and model development with limited resources.
unlabeled hand images, so it enables building a strong fea-
The figure is adapted from Zimmermann and Brox (2017)
ture extractor before performing supervised learning. While
semi-supervised learning trains the estimator with labeled
Given such challenges in annotation, we conduct a system- and unlabeled hand images collected from the same environ-
atic review of the literature on 3D hand pose estimation from ment, domain adaptation further solves the so-called problem
two distinct perspectives: efficient annotation and efficient of domain gap between the two image sets, e.g., the differ-
learning (see Fig. 1). The former view highlights how exist- ence between synthetic data and real data.
ing methods assign reasonable annotations in a cost-effective The rest of this survey is organized as follows. In
way, covering a range of topics: the availability and quality Sect. 2, we introduce the formulation and modeling of 3D
of annotations and the limitations when deploying the anno- hand pose estimation. In Sect. 3, we present open chal-
tation methods. The latter view focuses on how models can lenges in the construction of hand pose datasets involving
be developed in scenarios where annotation setups cannot be depth measurement, occlusion, and dataset bias. In Sect. 4,
implemented or available annotations are insufficient. we cover existing methods of 3D hand pose annotation,
In contrast to existing surveys on network architecture and namely manual, synthetic-model-based, hand-marker-based,
modeling (Chatzis et al., 2020; Doosti, 2019; Le & Nguyen, and computational approaches. In Sect. 5, we provide learn-
2020; Lepetit, 2020; Liu et al., 2021), our survey delves into ing methods from a limited amount of annotated data, namely
another fundamental direction that arises from the annota- self-supervised pretraining, semi-supervised learning, and
tion issues, namely, dataset construction with cost-effective domain adaptation. In Sect. 6, we finally show promising
annotation and model development with limited resources. In future directions of 3D hand pose estimation.
particular, our survey includes benchmarks, datasets, image
capture setups, automatic annotation, learning with limited
labels, and transfer learning. Finally, we discuss potential 2 Overview of 3D Hand Pose Estimation
future directions of this field beyond the current state of the
art. Task setting. As shown in Fig. 2, 3D hand pose estimation
For the study of annotation, we categorize existing meth- is typically formulated as the estimation from a monocular
ods into manual (Chao et al., 2021; Mueller et al., 2017; RGB/depth image (Erol et al., 2007; Supancic et al., 2018;
Sridhar et al., 2016), synthetic-model-based (Chen et al., Yuan et al., 2018). The output is parameterized by the hand
2021; Hasson et al., 2019; Mueller et al., 2017, 2018; joint positions with 14, 16, or 21 keypoints, which are intro-
Zimmermann & Brox, 2017), hand-marker-based (Garcia- duced in Tompson et al. (2014), Tang et al. (2014), and Qian et
Hernando et al., 2018; Taheri et al., 2020; Yuan et al., 2017), al. (2014), respectively. The dense representation of 21 hand
and computational approaches (Hampali et al., 2020; Kulon joints2 has been popularly used as it contains more precise
et al., 2020; Kwon et al., 2021; Moon et al., 2020; Simon et al., information about hand structure. For a single RGB image
2017; Zimmermann et al., 2019). While manual annotation in which depth and scale are ambiguous, the 3D coordinates
requires querying human annotators, hand markers automate of the hand joint relative to the hand root are estimated from
the annotation process by tracking sensors attached to a a scale-normalized hand image (Cai et al., 2018; Ge et al.,
hand. Synthetic methods utilize computer graphics engines 2019; Zimmermann & Brox, 2017). Recent works addition-
to render plausible hand images with precise keypoint coor- ally estimate hand shape by regressing 3D hand pose and
dinates. Computational methods assign labels by fitting a
hand template model to the observed data or using multi- 2 Five end keypoints are fingertips, not strictly called joints.

123
International Journal of Computer Vision (2023) 131:3193–3206 3195

Fig. 2 Formulation and modeling of single-view 3D hand pose estima- hand template model. For modeling, there are three major designs; A 2D
tion. For input, we use either RGB or depth images cropped to the hand heatmap regression and depth regression, B extended three-dimensional
region. The model learns to produce a 3D hand pose defined by 3D heatmap regression called 2.5D heatmaps, and C direct regression of
coordinates. Some works additionally estimate hand shape using a 3D 3D coordinates

shape parameters together (Boukhayma et al., 2019; Ge et For the architecture of the backbone network, CNNs [e.g.,
al., 2019; Mueller et al., 2019; Zhou et al., 2016). In eval- ResNet (He et al., 2016)] are a basic choice while recent
uation, produced prediction is compared with ground truth, Transformer-based methods have been proposed (Hampali
e.g., in the space of world or image coordinates. These two et al., 2022; Huang et al., 2020). To generate feasible hand
metrics are often used: mean per joint position error (MPJPE) poses, regularization is a key trick in correcting predicted
in millimeters, and area under curve of percentage of correct 3D hand poses. Based on the anatomical study of hands, bio-
keypoints (PCK-AUC). mechanical constraints are imposed to limit predicted bone
Modeling. Classic methods estimate a hand pose by find- lengths and joint angles (Spurr et al., 2020; Chen et al., 2021;
ing the closest sample from a large set of hand poses, e.g., Liu et al., 2021).
synthetic hand pose sets. Some works formulate the task as
nearest neighbor search (Rogez et al., 2015; Romero et al.,
2010) while others solve pose classification given predefined
hand pose classes and a SVM classifier (Rogez et al., 2014,
3 Challenges in Dataset Construction
2015; Sridhar et al., 2013).
Task formulation and algorithms for estimating 3D hand
Recent studies have adopted an end-to-end training man-
poses are outlined in Sect. 2. During training, it is neces-
ner where models learn the correspondence between the input
sary to build a large amount of training data with diverse
image and its label of the 3D hand pose. Standard single-view
hand poses, viewpoints, and backgrounds. However, obtain-
methods from an RGB image (Cai et al., 2018; Ge et al., 2019;
ing such massive hand data with accurate annotations has
Zimmermann & Brox, 2017) consist of (A) the estimation of
been challenging for the following reasons.
2D hand poses by heatmap regression and depth regression
Difficulty of 3D annotation. Annotating the 3D position of
for each 2D keypoint (see Fig. 2). The 2D keypoints are
hand joints from a single RGB image is inherently impos-
learned by optimizing heatmaps centered on each 2D hand
sible without any prior information or additional sensors
joint position. An additional regression network predicts the
due to an ill-posed condition. To assign accurate hand pose
depth distance of detected 2D hand keypoints. Other works
labels, hand-marker-based annotation using magnetic sen-
use (B) extended 2.5D heatmap regression with a depth-wise
sors (Garcia-Hernando et al., 2018; Wetzler et al., 2015; Yuan
heatmap in addition to the 2D heatmaps (Iqbal et al., 2018;
et al., 2017), motion capture systems (Miyata et al., 2004;
Moon et al., 2020), so it does not require a depth regression
Schröder et al., 2015; Taheri et al., 2020), or hand gloves
branch. Depth-based hand pose estimation also utilizes such
(Bianchi et al., 2013; Glauser et al., 2019; Wang & Popovic,
heatmap regression (Huang et al., 2020; Ren et al., 2019;
2009) has been studied. These sensors can provide 6-DoF
Xiong et al., 2019). Instead of the heatmap training, other
information (i.e., location and orientation) of attached mark-
methods learn to (C) directly regress keypoint coordinates
ers and enable us to calculate the coordinates of full hand
(Santavas et al., 2021; Spurr et al., 2018).
joints from the tracked markers. However, their setups are

123
3196 International Journal of Computer Vision (2023) 131:3193–3206

Fig. 3 Difficulty of hand pose annotation in a single RGB image (Simon


et al., 2017). Occlusion of hand joints is caused by a articulation, b Fig. 4 Example of major data collection setups. The synthetic image on
viewpoint bias, and c grasping objects the left (ObMan (Hasson et al., 2019)) can be generated inexpensively,
but they exhibit unrealistic hand texture. The hand markers on the middle
(FPHA (Garcia-Hernando et al., 2018)) enable automatic tracking of
hand joints, although the markers distort the appearance of hands. The
expensive and need good calibration, which constrains avail- in-lab setup on the right (DexYCB (Chao et al., 2021)) uses a black
able scenarios. background to make it easier to recognize hands and objects, but it
On the contrary, depth sensors (e.g., RealSense) or multi- limits data variation in environments
view camera studios (Chao et al., 2021; Hampali et al., 2020;
Moon et al., 2020; Simon et al., 2017; Zimmermann et al.,
days due to the aforementioned problems. Rather, existing
2019) make it possible to obtain depth information near hand
hand pose datasets exhibit a bias to a particular imaging con-
regions. Given 2D keypoints for an image, these setups enable
dition constrained by the annotation method.
annotation of 3D hand poses by measuring the depth distance
As shown in Fig. 4, generating data using synthetic mod-
at each 2D keypoint. However, these annotation methods do
els (Chen et al., 2021; Hasson et al., 2019; Mueller et al.,
not always produce satisfactory 3D annotations, e.g., due
2017, 2018; Zimmermann & Brox, 2017) is cost-effective,
to an occlusion problem (detailed in the next section). In
but it creates unrealistic hand texture (Ohkawa et al.,
addition, depth images are significantly affected by the sen-
2021). Although the hand-marker-based annotation (Garcia-
sor noise, such as unknown depth values in some regions
Hernando et al., 2018; Taheri et al., 2020; Wetzler et al., 2015;
and ghost shadows around object boundaries (Xu & Cheng,
Yuan et al., 2017) can automatically track the hand joints
2013). Due to the limited depth distance that depth cam-
from the information of hand sensors, the sensors distort
eras can capture, the depth measurement becomes inaccurate
the hand appearance and hinder the natural hand movement.
when the hands are far from the sensor.
In-lab data acquired by multi-camera setups (Chao et al.,
Occlusion. Hand images often contain complex occlusions
2021; Hampali et al., 2020; Moon et al., 2020; Simon et al.,
that distract human annotators from localizing hand key-
2017; Zimmermann et al., 2019) make the annotation eas-
points. Examples of possible occlusions are shown in Fig. 3.
ier because they can reduce the occlusion effect. However,
In figure (a), articulation causes a self-occlusion that makes
the variations in environments (e.g., backgrounds and inter-
some hand joints (e.g., fingertips) invisible due to the overlap
acting objects) are limited because the setups are not easily
with the other parts of the hand. In figure (b), such self-
portable.
occlusion depends on a specific camera viewpoint. In figure
(c), hand-held objects induce occlusion that hides the hand
joints by the object during the interaction.
To address this issue, hand-marker-based tracking (Garcia- 4 Annotation Methods
Hernando et al., 2018; Taheri et al., 2020; Wetzler et al., 2015;
Yuan et al., 2017) and multi-view camera studios (Chao et Given the above challenges concerning the construction of
al., 2021; Hampali et al., 2020; Moon et al., 2020; Simon et hand pose datasets, we review existing 3D hand pose datasets
al., 2017; Zimmermann et al., 2019) have been studied. The in terms of annotation design. As shown in Table 1, we cat-
hand markers offer 6-DoF information during these occlu- egorize the annotation methods as manual, synthetic-model-
sions, so the hand-maker-based annotation is robust to the based, hand-marker-based, and computational approaches.
occlusion. For multi-camera settings, the effect of occlusion We then study the pros and cons of each annotation method
can be reduced when many cameras are densely arranged. in Table 2.
Dataset bias. While hands are a common entity in various
image capture settings, the category of objects, includ- 4.1 Manual Annotation
ing hand-held objects (i.e., foregrounds) and backgrounds,
is potentially diverse. In order to improve the generaliza- MSRA (Qian et al., 2014), Dexter+Object (Sridhar et al.,
tion ability of hand pose estimators, hand images must be 2016), and EgoDexter (Mueller et al., 2017) manually anno-
annotated under various imaging conditions (e.g., lighting, tate 2D hand keypoints on the depth images and determine
viewpoints, hand poses, and backgrounds). However, it is the depth distance from the depth value of the images on the
challenging to create such large and diverse datasets nowa- 2D point. This method enables assigning reasonable annota-

123
Table 1 Taxonomy of methods for annotating 3D hand poses
Annotation Dataset Year Modality Resolution #Frame #Subj./ #View Motion Obj. Sensor/engine
#Obj. pose

Manual MSRA (Qian et al., 2014) 2014 Depth 320 × 240 2K 6/– 1 (3rd) × × Creative Senz3D
Dexter+Object (Sridhar et al., 2016) 2016 RGB-D 640 × 480 3K 2/2 1 (3rd)   Creative Senz3D
EgoDexter (Mueller et al., 2017) 2017 RGB-D 640 × 480 3K 4/– 1 (1st)  × RealSense SR300
Synthetic SynthHands (Mueller et al., 2017) 2017 RGB-D 640 × 480 220K –/7 5 (1st) × × Unity
RHD (Zimmermann & Brox, 2017) 2017 RGB 320 × 320 43K 20/– 1 (3rd) × × Unity
GANerated (Mueller et al., 2018) 2018 RGB 256 × 256 331K –/– 1 (3rd) × × Unity
International Journal of Computer Vision (2023) 131:3193–3206

ObMan (Hasson et al., 2019) 2019 RGB-D 256 × 256 154K 20/3K 1 (3rd) ×  Blender/GraspIt (Miller & Allen, 2005)
MVHM (Chen et al., 2021) 2021 RGB-D 256 × 256 320K –/– 8 (3rd) × × Blender
Hand marker BigHand2.2M (Yuan et al., 2017) 2017 Depth 640 × 480 2200K 10/– 1 (1st + 3rd)  × NDI trakSTAR
FPHA (Garcia-Hernando et al., 2018) 2018 RGB-D 640 × 480 105K 6/4 1 (1st)   NDI trakSTAR
GRAB (Taheri et al., 2020) 2020 MoCap – 1624K 10/51 –   VICON Vantage 16
Computational
Model fitting ICVL (Tang et al., 2014) 2014 Depth 320 × 240 180K 10/– 1 (3rd) × × Creative Senz3D
NYU (Tompson et al., 2014) 2014 RGB-D 640 × 480 81K 2/– 1 (3rd) × × PrimeSense
FreiHAND (Zimmermann et al., 2019) 2019 RGB 224 × 224 37K 32/27 8 (3rd) × × Basler ace
YouTube3DHands (Kulon et al., 2020) 2020 RGB 640 × 480 47K (109)/– 1 (3rd)  × –
HO-3D (Hampali et al., 2020) 2020 RGB-D 640 × 480 103K 10/10 1–5 (3rd)  
DexYCB (Chao et al., 2021) 2021 RGB-D 640 × 480 582K 10/20 8 (3rd)   RealSense D415
H2O (Kwon et al., 2021) 2021 RGB-D 1280 × 720 571K 4/10 5 (1st & 3rd)   Azure Kinect
Triangulation Panoptic Studio (Simon et al., 2017) 2017 RGB 1920 × 1080 15K –/– 31 (3rd)  × HD camera
InterHand2.6M (Moon et al., 2020) 2020 RGB 512 × 334 2590K 27/– 80–140 (3rd)  × HD camera
AssemblyHands (Ohkawa et al., 2023) 2023 RGB/Mono 636 × 480 3030K 34/– 12 (1st & 3rd)  × HD camera/Meta Quest
We categorize the annotation methods as manual, synthetic-model-based, hand-marker-based, and computational annotation

123
3197
3198 International Journal of Computer Vision (2023) 131:3193–3206

Table 2 Pros and cons of each


Annotation Pros Cons
annotation approach
Manual Reasonable accuracy Labor intensive
Hard to address occlusion
Synthetic Large scale Sim-to-real gap
High diveristy Hard to simulate motion
Low cost
Hand marker Robust to occlusion Requires special sensors
Low annotation cost Changes visual modality
Prevents natural motion
Computational Natural motion Lacks diversity
Low annotation cost Hard to evaluate quality
Needs multi-camera setups

tions of 3D coordinates (i.e., 2D position and depth) when et al., 2015)). Ohkawa et al. proposed foreground-aware
hand joints are fully visible. image stylization to convert the simulation texture in the
However, it is not extensively available according to the ObMan data to a more realistic one while separating the hand
number of frames due to the high annotation cost. In addition, regions and backgrounds (Ohkawa et al., 2021). Corona et
since it is not robust for occluded keypoints, this approach al. attempted to synthesize more natural hand grasps with
only allows fingertip annotation, instead of full hand joints. affordance classification and the refinement of fingertip loca-
For these limitations, these datasets provide a small amount tions (Corona et al., 2020). However, the ObMan data only
of data (≈ 3K images) used for evaluation only. Addition- provide static hand images with hand-held objects, not hand
ally, these single-view datasets can produce view-dependent motion. The hand motion simulation while approaching the
annotation errors because a single-depth camera captures the object remains an open problem.
distance to the hand skin surface, not the true joint position.
To reduce such unavoidable errors, subsequent annotation
methods based on multi-camera setups provide further accu- 4.3 Hand-Marker-Based Annotation
rate annotations (see Sect. 4.4).
As shown in Fig. 5, hand-marker-based annotation automat-
ically tracks attached hand markers and then calculates the
4.2 Synthetic-Model-Based Annotation
coordinates of hand joints. Initially, Wetzler et al. attached
magnetic sensors to fingertips that provide 6-DoF infor-
To acquire large-scale hand images and labels, synthetic
mation of the markers (Wetzler et al., 2015). While this
methods based on synthetic hand and full-body models
scheme can annotate fingertips only, recent datasets, Big-
(Loper et al. 2015; Rogez et al. 2014; Romero et al. 2017;
Hand2.2M (Yuan et al., 2017) and FPHA (Garcia-Hernando
Šarić 2011) have been proposed. SynthHands (Mueller et
et al., 2018), use these sensors to offer the annotation of the
al., 2017) and RHD (Zimmermann & Brox, 2017) render
full 21 hand joints. Figure 6 shows how to compute the joint
synthetic hand images with randomized real backgrounds
positions given six magnetic sensors. It uses inverse kine-
from either a first- or third-person view. MVHM (Chen et
matics to infer all 21 hand joints, which fits a hand skeleton
al., 2021) generates multi-view synthetic hand data rendered
from eight viewpoints. These datasets have succeeded in
providing accurate hand keypoint labels on a large scale.
Although they can generate various background patterns
inexpensively, the lighting and texture of hands are not well
simulated, and the simulation of hand-object interaction is
not considered in the data generation process.
To handle these issues, GANerated (Mueller et al., 2018)
utilizes GAN-based image translation to stylize synthetic
hands more realistically. Furthermore, ObMan (Hasson et al.,
2019) simulates the hand-object interaction in data genera-
tion using a hand grasp simulator (Graspit (Miller & Allen,
2005)) with known 3D object models (ShapeNet (Chang Fig. 5 Illustration of a hand marker setup (Yuan et al., 2017)

123
International Journal of Computer Vision (2023) 131:3193–3206 3199

Fig. 6 Calculation of joint positions from tracked markers (Yuan et al.,


2017). Si denotes the position of the markers, and W , Mi , Pi , Di , and
Ti are the positions of hand joints listed from the wrist to the fingertips Fig. 7 Illustration of a multi-camera setup (Zimmermann et al., 2019)

with the constraints of the maker positions and user-specific


bone length manually measured beforehand.
However, these sensors obstruct natural hand movement
and distort the appearance of the hand. Due to the changes in
hand appearance, these datasets have been proposed for the
benchmark of depth-based estimation, not the RGB-based
task. On the contrary, GRAB (Taheri et al., 2020) is built
with a motion capture system for human hands and body, but
it does not possess visual modality, e.g., RGB images.

4.4 Computational Annotation

Computational annotation is categorized into two major


approaches: hand model fitting and triangulation. Unlike
Fig. 8 Illustration of a many-camera setup (Wuu et al., 2022). This
hand-marker-based annotation, these methods can capture
setup has about 100 synchronized cameras and is used to create the
natural hand motion without attaching hand markers. InterHands2.6M dataset (Moon et al., 2020)
Model fitting (depth). Early works of computational anno-
tation utilize model fitting on depth images (Supancic et al.,
2018; Yuan et al., 2018). Since a depth image provides 3D Oberweger et al. considered model fitting with temporal
structural information, their works fit a 3D hand model, from coherence (Oberweger et al., 2016). This method selects ref-
which joint positions can be obtained, to the depth image. erence frames from a depth video and asks annotators for
ICVL (Tang et al., 2014) fits a convex rigid body model by manual labeling. Model fitting is done separately for anno-
solving a linear complementary problem with physical con- tated reference frames and unlabeled non-reference frames.
straints (Melax et al., 2013). NYU (Tompson et al., 2014) Finally, all sequential poses are optimized to satisfy temporal
uses a hand model defined by spheres and cylinders and smoothness.
formulates the model fitting as a kind of particle swarm opti- Triangulation (RGB). For the annotation of RGB images, a
mization (Oikonomidis et al., 2011, 2012). The use of other multi-camera studio is often used to compute 3D points by
cues for the model fitting is also studied (Ballan et al., 2012; multi-view geometry, i.e., triangulation (see Fig. 7). Panop-
Lu et al., 2003), such as edges, optical flow, shading, and tic Studio (Simon et al., 2017) and InterHand2.6M (Moon
collisions. Sharp et al. paint hands to obtain hand part labels et al., 2020) triangulate a 3D hand pose from multiple
by color segmentation on RGB images and the proxy cue of 2D hand keypoints provided by an open source library,
hand parts further helps the depth-based model fitting (Sharp OpenPose (Hidalgo et al., 2018), or human annotators. The
et al., 2015). generated 3D hand pose is reprojected onto the image planes
Using these depth datasets, several more accurate label- of other cameras to annotate hand images with novel view-
ing methods have been proposed. Rogez et al. gave manual points. This multi-view annotation scheme is beneficial when
annotation to a few joints and searched the closest 3D pose many cameras are installed (see Fig. 8). For instance, the
from a pool of synthetic hand pose data (Rogez et al., 2014). InterHand2.6M manually annotates keypoints from 6 views

123
3200 International Journal of Computer Vision (2023) 131:3193–3206

jecting annotated keypoints from third-person cameras onto


first-person image planes. This reduces the cost of annotating
first-person images, which is considered expensive because
the image distribution changes drastically over time and the
hands are sometimes out of view.
These computational methods can generate labels with
little human effort, although the camera system itself is costly.
However, assessing the quality of the labels is still difficult. In
fact, the annotation quality depends on the number of cameras
Fig. 9 Synchronized multi-camera setup with first-person and third-
person cameras (Kwon et al., 2021)
and their arrangement, the accuracy of hand detection and
the estimation of 2D hand poses, and the performance of
triangulation and fitting algorithms.
and reprojects the triangulated points to the other many views
(100+). This setup can produce over 100 training images
for every single annotation. The InterHand2.6M has million- 5 Learning with Limited Labels
scale training data.
This point-level triangulation method works quite well As explained in Sect. 4, existing annotation methods have
when many cameras (30+) are arranged (Moon et al., certain pros and cons. Since perfect annotation in terms of
2020; Simon et al., 2017). However, the AssemblyHands amount and quality cannot be assumed, training 3D hand
setup (Ohkawa et al., 2023) has only eight static cameras, pose estimators with limited annotated data is another impor-
and then the predicted 2D keypoints to be triangulated tend tant study. Accordingly, we introduce learning methods using
to be suboptimal due to hand-object occlusion during the unlabeled data in this section, namely self-supervised pre-
assembly task. To improve the accuracy of triangulation in training, semi-supervised learning, and domain adaptation.
such sparse camera settings, Ohkawa et al. adopt multi-view
aggregation of encoded features by the 2D keypoint detector 5.1 Self-Supervised Pretraining and Learning
and compute 3D coordinates from constructed 3D volumet-
ric features (Bartol et al., 2022; Iskakov et al., 2019; Ohkawa Self-supervised pretraining aims to utilize massive unlabeled
et al., 2023; Zimmermann et al., 2019). This feature-level hand images and build an improved encoder network before
triangulation provides better accuracy than the point-level supervised learning with labeled images. As shown in Fig. 10,
method, achieving an average keypoint error of 4.20 mm, recent works (Spurr et al., 2021; Zimmermann et al., 2021)
which is 85% lower than the error of the original annotations first pretrain an encoder network that extracts image features
in Assembly101 (Sener et al., 2022). by using contrastive learning [e.g., MoCo (He et al., 2020)
Model fitting (RGB). Model fitting is also used in RGB- and SimCLR (Chen et al., 2020)] and then fine-tune the whole
based pose annotation. FreiHAND (Zimmermann et al., network in a supervised manner. The core idea of contrastive
2019, 2021) utilizes a 3D hand template (MANO (Romero learning is to push a pair of similar instances closer together
et al., 2017)) fitting to multi-view hand images with sparse in an embedding space while unrelated instances are pushed
2D keypoint annotation. The dataset increases the variation apart. This approach focuses on how to define the similarity
of training images by randomly synthesizing the back-
ground and using captured real hands as the foreground.
YouTube3DHands (Kulon et al., 2020) uses the MANO
model fitting to estimated 2D hand poses in YouTube videos.
HO-3D (Hampali et al., 2020), DexYCB (Chao et al., 2021),
and H2O (Kwon et al., 2021) jointly annotate 3D hand and
object poses to facilitate a better understanding of hand-
object interaction. Using estimated or manually annotated
2D keypoints, their datasets fit the MANO model and 3D
object models to the hand images with objects.
While most methods capture hands from static third-
person cameras, H2O and AssemblyHands install first-
Fig. 10 Self-supervised pretraining of 3D hand pose estimation (Zim-
person cameras that are synchronized with static third-person
mermann et al., 2021). The pretraining phase (step 1) aims to construct
cameras (see Fig. 9). With camera calibration and head- an improved encoder network by using many unlabeled data before
mounted camera tracking, such camera systems can offer supervised learning (step 2). The work uses MoCo (He et al., 2020) as
3D hand pose annotations for first-person images by pro- a method of self-supervised learning

123
International Journal of Computer Vision (2023) 131:3193–3206 3201

of hand images and how to design embedding techniques. 5.3 Domain Adaptation
Spurr et al. proposed to geometrically align two features
generated from differently augmented instances (Spurr et Domain adaptation aims to improve model performance on
al., 2021). Zimmermann et al. found that multi-view images target data by learning from labeled source data and target
representing the same hand pose can be effective pair super- data with limited labels. This study has addressed two types
vision (Zimmermann et al., 2021). of underlying domain gaps: between different datasets and
Other works utilize the scheme of self-supervised learning between different modalities.
that solves an auxiliary task, instead of the target task of hand The former problem between different datasets is a com-
pose estimation. Given the prediction on an unlabeled depth mon domain adaptation problem where the source and target
image, (Oberweger et al., 2015; Wan et al., 2019) render a data are sampled from two datasets with different image
synthetic depth image and penalize the matching between the statistics, e.g., sim-to-real adaptation (Jiang et al., 2021; Tang
input image and the one generated from the prediction. This et al., 2013) (see Fig. 12). The model has access to readily
auxiliary loss by image synthesis is informative even when available synthetic images with labels and target real images
annotations are scarce. without labels. The latter problem between different modal-
ities is characterized as modality transfer where the source
and target data represent the same scene, but their modalities
5.2 Semi-Supervised Learning
are different, e.g., depth vs. RGB (see Fig. 13). This aims to
utilize information-rich source data, e.g., depth images con-
As shown in Fig. 11, semi-supervised learning is used to
tain 3D structural information, for inferring easily available
learn from small labeled data and large unlabeled data simul-
target data (e.g., RGB images).
taneously. Liu et al. proposed a pseudo-labeling method
To reduce the gap between the two datasets, two major
that learns unlabeled instances with pseudo-ground-truth
approaches have been proposed: generative methods and
given from the model’s prediction (Liu et al., 2021). This
adversarial methods. In generative methods, Qi et al. pro-
pseudo-label training is applied only when its prediction sat-
posed an image translation method to alter the synthetic
isfies spatial and temporal constraints. The spatial constraints
textures to realistic ones and train a model on generated real-
check the correspondence of a 2D hand pose and the 2D pose
like synthetic data (Qi et al., 2020).
projected from 3D hand pose prediction. In addition, they
Adversarial methods enforce matching two domains’ fea-
include a constraint based on bio-mechanical feasibility, such
tures so that the feature extractor can encode features even
as bone lengths and joint angles. The temporal constraints
from the target domain. However, in addition to the domain
indicate the smoothness of hand pose and mesh predictions
gap in an input space (e.g., the difference in backgrounds),
over time.
the gap in a label space also exists in this task, which is
Yang et al. proposed the combination of pseudo-labeling
and consistency training (Yang et al., 2021). In pseudo-
labeling, the generated pseudo-labels are corrected by fitting
the hand template model. In addition, the work enforces
consistency losses between the predictions of differently aug-
mented instances and between the modalities of 2D hand
poses and hand masks.
Spurr et al. applied adversarial training to a sequence of
predicted hand poses (Spurr et al., 2021). The encoder net- Fig. 12 Poor generalization to an unknown domain (Jiang et al., 2021).
work is expected to be improved by fooling a discriminator The models trained on synthetic images (source) exhibit a limited capac-
that distinguishes between plausible and invalid hand poses. ity for inferring poses on real images (target)

Fig. 13 Example of modality transfer. During training, RGB and depth


Fig. 11 Semi-supervised learning of 3D hand pose estimation (Liu et images are accessible and RGB images are given in the test phase. The
al., 2021). The model is trained jointly on annotated data and unlabeled training aims to utilize the support of depth information to improve
data with pseudo-labels RGB-based hand pose estimation

123
3202 International Journal of Computer Vision (2023) 131:3193–3206

not assumed in typical adversarial methods (Ganin & Lem- wearers in a static camera setup will advance the analysis of
pitsky, 2015; Tzeng et al., 2017). Zhang et al. developed a multi-person cooperation and interaction, e.g., game playing
feature matching method based on Wasserstein distance and and construction with multiple people.
proposed adaptive weighting to enable matching only for
features related to hand characteristics, except for label infor-
6.2 Various Types of Activities
mation (Zhang et al., 2020). Jiang et al. utilized an adversarial
regressor and optimized the domain disparity by a minimax
We believe that increasing the type of activities is an impor-
game (Jiang et al., 2021). Such minimax of disparity is effec-
tant direction for generalizing models to various situations
tive in domain adaptation of regression tasks, including hand
with hand-object interaction. A major limitation of existing
pose estimation.
hand datasets is the narrow variation of users’ performing
As for the modality transfer problem, Yuan et al. and
tasks and grasping objects. To avoid object occlusion, some
Rad et al. attempted to use depth images as the auxiliary
works did not capture hand-object interaction (Moon et al.,
information during training and test the model on RGB
2020; Yuan et al., 2017; Zimmermann & Brox, 2017). Oth-
images (Rad et al., 2018; Yuan et al., 2019). They observed
ers (Chao et al., 2021; Hasson et al., 2019; Hampali et al.,
that learned features from depth images could support RGB-
2020) used pre-registered 3D object models (e.g., YCB (Çalli
based hand pose estimation. Park et al. transferred the
et al., 2015)) to simplify in-hand object pose estimation. User
knowledge from depth images to infrared (IR) images that
action is also very simple in these benchmarks, such as pick
have less motion blur (Park et al., 2020). Their training is
and place.
facilitated by matching two features from paired images, e.g.,
From an affordance perspective (Hassanin et al., 2021),
(RGB, depth) and (depth, IR). Baek et al. newly defined the
diversifying the object category will result in increasing
domain of hand-only images where a hand-held object is
hand pose variation. Potential future works will capture goal-
removed. The work translates hand-object images to hand-
oriented and procedural activities that naturally occur in our
only images by using GAN and mesh renderer (Baek et al.,
daily life (Damen et al., 2021; Grauman et al., 2022; Sener
2020). Given a hand-object image with an unknown object,
et al., 2022), such as cooking, art and craft, and assembly.
this method can generate hand-only images, from which hand
To enable this, we need to develop portable camera
pose estimation is more tractable.
systems and robust annotation methods for complex back-
grounds and unknown objects. In addition, occurring hand
poses are constrained to the context of the activity. Thus,
6 Future Directions
pose estimators conditioned by actions, objects, or textual
descriptions of the scene will improve estimation in various
6.1 Flexible Camera Systems
activities.
We believe that hand image capture will feature more flex-
ible camera systems, such as using first-person cameras. To 6.3 Towards Minimal Human Effort
reduce the occlusion effect without the need for hand mark-
ers, recently published hand datasets have been acquired Sections 4 and 5 separately explain efficient annotation and
by multi-camera setups, e.g., DexYCB (Chao et al., 2021), learning. To minimize the effort of human intervention, utiliz-
InterHand2.6M (Moon et al., 2020), and FreiHAND (Zim- ing findings from both annotation and learning perspectives
mermann et al., 2019). These setups are static and not suitable is one of the promising directions. Feng et al. exploited active
for capturing dynamic user behavior. To address this, a first- learning that optimizes which unlabeled instance should be
person camera attached to the user’s head or body is useful annotated and semi-supervised learning that jointly utilizes
because it mostly captures close-up hands even when the user labeled data and large unlabeled data (Feng et al., 2021).
moves around. However, as shown in Table 1, existing first- However, this method is constrained to triangulation-based
person benchmarks have a very limited variety due to heavy 3D pose estimation. As we mentioned in Sect. 4.4, another
occlusion, motion blur, and a narrow field-of-view. major computational annotation is model fitting; thus, we still
One promising direction is a joint camera setup with need to consider such a collaborative approach in the anno-
first-person and third-person cameras, such as H2O (Kwon tation based on model fitting.
et al., 2021) and AssemblyHands (Ohkawa et al., 2023). Zimmermann et al. also proposed a framework of human-
This results in flexibly capturing the user’s hands from the in-loop annotation that inspects the annotation quality man-
first-person camera while taking the benefits of multiple ually while updating annotation networks on the inspected
third-person cameras (e.g., mitigating the occlusion effect). annotations (Zimmermann & Brox, 2017). However, this
However, the first-person camera wearer doesn’t always have human check will be a bottleneck in large dataset construc-
to be alone. Image capture with multiple first-person camera tion. The evaluation of annotation quality on the fly is a

123
International Journal of Computer Vision (2023) 131:3193–3206 3203

necessary technique to scale up the combination of anno- right holder. To view a copy of this licence, visit http://creativecomm
tation and learning. ons.org/licenses/by/4.0/.

6.4 Generalization and Adaptation References


Increasing the generalization ability across different datasets Baek, S., Kim, K. I., & Kim T.-K. (2020). Weakly-supervised domain
or adapting models to a specific domain is a remaining issue. adaptation via GAN and mesh model for estimating 3d hand poses
The bias of existing training datasets hinders the estima- interacting objects. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition (CVPR) (pp. 6120–6130).
tors from inferring test images captured under very different
Ballan, L., Taneja, A., Gall, J., Gool, L. V., & Pollefeys, M. (2012).
imaging conditions. In fact, as reported in Han et al. (2020); Motion capture of hands in action using discriminative salient
Zimmermann et al. (2019), models trained on existing hand points. In Proceedings of the European conference on computer
pose datasets poorly generalize to other datasets. For real- vision (ECCV) (Vol. 7577, pp. 640–653).
Bartol, K., Bojanić, D., Petković, T. & Pribanić T. (2022). Generaliz-
world applications (e.g., AR), it is crucial to transfer models
able human pose triangulation. In Proceedings of the IEEE/CVF
from indoor hand datasets to outdoor videos because com- conference on computer vision and pattern recognition (CVPR)
mon multi-camera setups are not available outdoors (Ohkawa (pp. 11018–11027).
et al., 2022). Thus, aggregating multiple annotated yet biased Bianchi, M., Salaris, P., & Bicchi, A. (2013). Synergy-based hand
pose sensing: Optimal glove design. The International Journal
datasets for generalization and robustly adapting to very dif-
of Robotics Research (IJRR), 32(4), 407–424.
ferent environments are important future tasks. Boukhayma, A., de Bem, R., & Torr, P. H. S. (2019). 3D hand shape and
pose from images in the wild. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition (CVPR)
(pp. 10843–10852).
7 Summary Cai, Y., Ge, L., Cai, J., & Yuan, J. (2018). Weakly-supervised 3D hand
pose estimation from monocular RGB images. In Proceedings of
the European conference on computer vision (ECCV) (pp. 678–
We presented the survey of 3D hand pose estimation from the 694).
standpoint of efficient annotation and learning. We provided Çalli, B., Walsman, A., Singh, A., Srinivasa, S. S., Abbeel, P., & Dol-
a comprehensive overview of this task and modeling, and lar, A. M. (2015). Benchmarking in manipulation research: Using
open challenges during dataset construction. We investigated the Yale-CMU-Berkeley object and model set. IEEE Robotics
Automation Magazine, 22(3), 36–52.
annotation methods categorized as manual, synthetic-model- Chang, A. X., Funkhouser, T. A., Guibas, L. J., Hanrahan, P., Huang,
based, hand-marker-based, and computational approaches, Q.-X., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J.,
and examined their respective strengths and weaknesses. In Yi, L. & Yu, F. (2015). Shapenet: An information-rich 3d model
addition, we studied learning methods that can be applied repository. CoRR, arXiv:1512.03012
Chao, Y.-W., Yang, W., Xiang, Y., Molchanov, P., Handa, A., Tremblay,
even when annotations are scarce, namely self-supervised J., Narang, Y. S., Van Wyk, K., Iqbal, U., Birchfield, S., Kautz,
pretraining, semi-supervised learning, and domain adapta- J. & Fox, D. (2021). DexYCB: A benchmark for capturing hand
tion. Finally, we discussed potential future advancements in grasping of objects. In Proceedings of the IEEE/CVF conference
3D hand pose estimation, including next-generation camera on computer vision and pattern recognition (CVPR) (pp. 9044–
9053).
setups, increased object and action variation, jointly opti- Chatzis, T., Stergioulas, A., Konstantinidis, D., Dimitropoulos, K., &
mized annotation and learning techniques, and generalization Daras, P. (2020). A comprehensive study on deep learning-based
and adaptation. 3d hand pose estimation methods. Applied Sciences, 10, 6850.
Chen, L., Lin, S.-Y., Xie, Y., Lin, Y.-Y. & Xie, X. (2021). MVHM:
Acknowledgements This work was supported by JST ACT-X Grant A large-scale multi-view hand mesh benchmark for accurate 3d
Number JPMJAX2007, JSPS KAKENHI Grant Number JP22KJ0999, hand pose estimation. In Proceedings of the IEEE/CVF winter
and JST AIP Acceleration Research Grant Number JPMJCR20U1, conference on applications of computer vision (WACV) (pp. 836–
Japan. 845).
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. E. (2020). A simple
Funding Open access funding provided by The University of Tokyo. framework for contrastive learning of visual representations. In
Proceedings of the international conference on machine learning
Open Access This article is licensed under a Creative Commons (ICML) (Vol. 119, pp. 1597–1607).
Attribution 4.0 International License, which permits use, sharing, adap- Chen, Y., Tu, Z., Kang, D., Bao, L., Zhang, Y., Zhe, X., Chen, R.,
tation, distribution and reproduction in any medium or format, as & Yuan, J. (2021). Model-based 3d hand reconstruction via self-
long as you give appropriate credit to the original author(s) and the supervised learning. In Proceedings of the IEEE/CVF conference
source, provide a link to the Creative Commons licence, and indi- on computer vision and pattern recognition (CVPR) (pp. 10451–
cate if changes were made. The images or other third party material 10460).
in this article are included in the article’s Creative Commons licence, Ciocarlie, M. T., & Allen, P. K. (2009). Hand posture subspaces for
unless indicated otherwise in a credit line to the material. If material dexterous robotic grasping. The International Journal of Robotics
is not included in the article’s Creative Commons licence and your Research (IJRR), 28(7), 851–867.
intended use is not permitted by statutory regulation or exceeds the Corona, E., Pumarola, A., Alenyà, G., Moreno-Noguer, F., & Rogez, G.
permitted use, you will need to obtain permission directly from the copy- (2020). GanHand: Predicting human grasp affordances in multi-

123
3204 International Journal of Computer Vision (2023) 131:3193–3206

object scenes. In Proceedings of the IEEE/CVF conference on view end-to-end hand tracking for VR. In Proceedings of the ACM
computer vision and pattern recognition (CVPR) (pp. 5030–5040). SIGGRAPH Asia conference (pp. 50:1–50:9).
Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Ma, J., Kaza- Handa, A., Wyk, K. V., Yang, W., Liang, J., Chao, Y.-W., Wan, Q.,
kos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W. & Wray, Birchfield, S., Ratliff, N. & Fox, D. (2020) DexPilot: Vision-
M. (2021). Rescaling egocentric vision. International Journal of based teleoperation of dexterous robotic hand-arm system. In
Computer Vision (IJCV), early access. Proceedings of the IEEE international conference on robotics and
Doosti, B. (2019). Hand pose estimation: A survey. CoRR, automation (ICRA) (pp. 9164–9170).
arXiv:1903.01013 Hassanin, M., Khan, S., & Tahtali, M. (2021). Visual affordance and
Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). function understanding: A survey. ACM Computing Survey, 54(3),
Vision-based hand pose estimation: A review. Computer Vision 47:1-47:35.
and Image Understanding (CVIU), 108(1–2), 52–73. Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M. J., Laptev,
Feng, Q., He, K., Wen, H., Keskin, C., & Ye, Y. (2021). Active learn- I. & Schmid, C. (2019). Learning joint reconstruction of hands and
ing with pseudo-labels for multi-view 3d pose estimation. CoRR, manipulated objects. In Proceedings of the IEEE/CVF conference
arXiv:2112.13709 on computer vision and pattern recognition (CVPR) (pp. 11807–
Ganin, Y. & Lempitsky, V. (2015). Unsupervised domain adaptation by 11816).
backpropagation. In Proceedings of the international conference He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. B. (2020). Momen-
on machine learning (ICML) (pp. 1180–1189). tum contrast for unsupervised visual representation learning. In
Garcia-Hernando, G., Yuan, S., Baek, S. & Kim, T.-K. (2018). First- Proceedings of the IEEE/CVF conference on computer vision and
person hand action benchmark with RGB-D videos and 3D hand pattern recognition (CVPR) (pp. 9726–9735).
pose annotations. In Proceedings of the IEEE/CVF conference on He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for
computer vision and pattern recognition (CVPR) (pp. 409–419). image recognition. In Proceedings of the IEEE/CVF conference on
Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J. & Yuan, J. (2019). computer vision and pattern recognition (CVPR) (pp. 770–778).
3D hand shape and pose estimation from a single RGB image. In Hidalgo, G., Cao, Z., Simon, T., Wei, S.-E., Raaj, Y., Joo,
Proceedings of the IEEE/CVF conference on computer vision and H. & Sheikh, Y. (2018). OpenPose. https://github.com/CMU-
pattern recognition (CVPR) (pp. 10833–10842). Perceptual-Computing-Lab/openpose
Glauser, O., Wu, S., Panozzo, D., Hilliges, O., & Sorkine-Hornung, O. Huang, L., Tan, J., Liu, J., & Yuan, J. (2020). Hand-transformer: Non-
(2019). Interactive hand pose estimation using a stretch-sensing autoregressive structured modeling for 3d hand pose estimation.
soft glove. ACM Transactions on Graphics (ToG), 38(4), 41:1- In Proceedings of the European conference on computer vision
41:15. (ECCV) (Vol. 12370, pp. 17–33).
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, Huang, W., Ren, P., Wang, J., Qi, Q. & Sun, H. (2020). AWR: Adaptive
R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagara- weighting regression for 3D hand pose estimation. In Proceed-
jan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma, ings of the AAAI conference on artificial intelligence (AAAI) (pp.
J., Wray, M., Xu, M.g, Xu, E. Zhongcong, Zhao, C., Bansal, S., 11061–11068).
Batra, D., Cartillier, V., Crane, S., Do, T., Doulaty,M.,Erapalli, A., Iqbal, U., Garbade, M. & Gall, J. (2017). Pose for action–action for pose.
Feichtenhofer, C., Fragomeni, A., Fu, Q., Fuegen, C., Gebrese- In Proceedings of the IEEE international conference on automatic
lasie, A., Gonzalez, C., Hillis, J., Huang, X., Huang, Y., Jia, W., face & gesture recognition, FG (pp. 438–445).
Khoo, W., Kolar, J., Kottur, S., Kumar, A., Landini, F., Li, C., Li, Iqbal, U., Molchanov, P., Breuel, T. M., Gall, J. & Kautz, J. (2018). Hand
Y., Li, Z., Mangalam, K., Modhugu, R., Munro, J., Murrell, T., pose estimation via latent 2.5D heatmap regression. In Proceedings
Nishiyasu, T., Price, W., Puentes, P. R., Ramazanova, M., Sari, L., of the European conference on computer vision (ECCV) (pp. 125–
Somasundaram, K., Southerland, A., Sugano, Y., Tao, R., Vo, M., 143).
Wang, Y., Wu, X., Yagi, T., Zhu, Y., Arbelaez, P., Crandall, D., Iskakov, K., Burkov, E., Lempitsky, V. & Malkov, Y. (2019). Learnable
Damen, D., Farinella, G. M., Ghanem, B., Ithapu, V. K., Jawahar, triangulation of human pose. In Proceedings of the IEEE/CVF
C. V., Joo, H., Kitani, K., Li, H., Newcombe, R., Oliva, A., Park, international conference on computer vision (ICCV) (pp. 7718–
H. Soo, Rehg, J. M., Sato, Y., Shi, J., Shou, M. Z., Torralba, A., 7727).
Torresani, Lo, Yan, M.i, & Malik, J. (2022). Ego4D: Around the Jiang, J., Ji, Y., Wang, X., Liu, Y., Wang, J. & Long, M. (2021). Regres-
world in 3,000 hours of egocentric video. In Proceedings of the sive domain adaptation for unsupervised keypoint detection. In
IEEE/CVF conference on computer vision and pattern recognition Proceedings of the IEEE/CVF conference on computer vision and
(CVPR) (pp. 18973–18990). pattern recognition (CVPR) (pp. 6780–6789).
Hampali, S., Rad, M., Oberweger, M. & Lepetit, V. (2020). Honnotate: Kulon, D., Güler, R. A., Kokkinos, I., Bronstein, M. M. & Zafeiriou, S.
A method for 3D annotation of hand and object poses. In Proceed- (2020). Weakly-supervised mesh-convolutional hand reconstruc-
ings of the IEEE/CVF conference on computer vision and pattern tion in the wild. In Proceedings of the IEEE/CVF conference on
recognition (CVPR) (pp. 3196–3206). computer vision and pattern recognition (CVPR) (pp. 4989–4999).
Hampali, S., Sarkar, S. D., Rad, M. & Lepetit, V. (2022) Keypoint trans- Kwon, T., Tekin, B., Stühmer, J., Bogo, F. & Pollefeys, M. (2021). H2O:
former: Solving joint identification in challenging hands and object Two hands manipulating objects for first person interaction recog-
interactions for accurate 3d pose estimation. In Proceedings of the nition. In Proceedings of the IEEE/CVF international conference
IEEE/CVF conference on computer vision and pattern recognition on computer vision (ICCV) (pp. 10118–10128).
(CVPR) (pp. 11080–11090). Le, V.-H., & Nguyen, H.-C. (2020). A survey on 3d hand skeleton
Han, S., Liu, B., Cabezas, R., Twigg, C. D., Zhang, P., Petkau, J., Yu, and pose estimation by convolutional neural network. Advances in
T.-H., Tai, C.-J., Akbay, M., Wang, Z., Nitzan, A., Dong, G., Ye, Y., Science, Technology and Engineering Systems Journal (ASTES),
Tao, L., Wan, C., & Wang, R. (2020). MEgATrack: Monochrome 5(4), 144–159.
egocentric articulated hand-tracking for virtual reality. ACM Trans- Lepetit, V. (2020). Recent advances in 3d object and hand pose estima-
actions on Graphics (ToG), 39(4), 87. tion. CoRR, arXiv:2006.05927
Han, S., Wu, P.-C., Zhang, Y., Liu, B., Zhang, L., Wang, Z., Si, W., Liang, H., Yuan, J.G., Thalmann, D. & Magnenat-Thalmann, N. (2015).
Zhang, P., Cai, Y., Hodan, T., Cabezas, R., Tran, L., Akbay, M., Yu, AR in hand: Egocentric palm pose tracking and gesture recognition
T.-H., Keskin, C. & Wang, R. (2022). UmeTrack: Unified multi- for augmented reality applications. In Proceedings of the ACM
international conference on multimedia (MM) (pp. 743–744).

123
International Journal of Computer Vision (2023) 131:3193–3206 3205

Liu, S., Jiang, H., Xu, J., Liu, S. & Wang, X. (2021). Semi-supervised Oikonomidis, I., Kyriazis, N. & Argyros, A. A. (2011). Efficient
3D hand-object poses estimation with interactions in time. In Pro- model-based 3d tracking of hand articulations using kinect. In
ceedings of the IEEE/CVF conference on computer vision and Proceedings of the British machine vision conference (BMVC) (pp.
pattern recognition (CVPR) (pp. 14687–14697). 1–11).
Liu, Y., Jiang, J. & Sun, J. (2021). Hand pose estimation from rgb Oikonomidis, I., Kyriazis, N. & Argyros, A. A. (2012). Tracking the
images based on deep learning: A survey. In IEEE international articulated motion of two strongly interacting hands. In Proceed-
conference on virtual reality (ICVR) (pp. 82–89). ings of the IEEE/CVF conference on computer vision and pattern
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. recognition (CVPR), (pp. 1862–1869).
(2015). SMPL: A skinned multi-person linear model. ACM Trans- Park, G., Kim, T.-K. & Woo, W. (2020). 3d hand pose estimation with
actions on Graphics (ToG), 34(6), 24:81-24:816. a single infrared camera via domain transfer learning. In Proceed-
Lu, S., Metaxas, D. N., Samaras, D. & Oliensis, J. (2003). Using multi- ings of the IEEE international symposium on mixed and augmented
ple cues for hand tracking and model refinement. In Proceedings of reality (ISMAR) (pp. 588–599).
the IEEE/CVF conference on computer vision and pattern recog- Qi, M., Remelli, E., Salzmann, M. & Fua, P. (2020). Unsupervised
nition (CVPR) (pp. 443–450). domain adaptation with temporal-consistent self-training for 3d
Mandikal, P. & Grauman, K. (2021). DexVIP: Learning dexterous hand-object joint reconstruction. CoRR, arXiv:2012.11260
grasping with human hand pose priors from video. In Proceed- Qian, C., Sun, X., Wei, Y., Tang, X.& Sun, J. (2014). Realtime and
ings of the conference on robot learning (CoRL) (pp. 651–661). robust hand tracking from depth. In Proceedings of the IEEE/CVF
Melax, S., Keselman, L., & Orsten, S. (2013). Dynamics based 3d skele- conference on computer vision and pattern recognition (CVPR)
tal hand tracking. In Proceedings of the graphics interface (GI) (pp. (pp. 1106–1113).
63–70). Qin, Y., Wu, Y.-H., Liu, S., Jiang, H., Yang, R., Fu, Y., & Wang, X.
Miller, A., & Allen, P. (2005). Graspit!: A versatile simulator for robotic (2022). DexMV: Imitation learning for dexterous manipulation
grasping. IEEE Robotics and Automation Magazine (RAM), 11, from human videos. In Proceedings of the European conference
110–122. on computer vision (ECCV) (Vol. 13699, pp. 570–587).
Miyata, N., Kouchi, M., Kurihara, T. & Mochimaru, M. (2004). Mod- Rad, M., Oberweger, M., & Lepetit, V. (2018). Domain transfer for 3d
eling of human hand link structure from optical motion capture pose estimation from color images without manual annotations. In
data. In Proceedings of the IEEE/RSJ international conference on Proceedings of the Asian conference on computer vision (ACCV)
intelligent robots and systems (IROS) (pp. 2129–2135). (Vol. 11365, pp. 69–84).
Moon, G., Yu, S.-I., Wen, H., Shiratori, T. & Lee, K. M. (2020). Inter- Ren, P., Sun, H., Qi, Q., Wang, J. & Huang, W. (2019). SRN: Stacked
Hand2.6M: A dataset and baseline for 3D interacting hand pose regression network for real-time 3D hand pose estimation. In Pro-
estimation from a single RGB image. In Proceedings of the Euro- ceedings of the British machine vision conference (BMVC).
pean conference on computer vision (ECCV) (pp. 548–564). Rogez, G., Khademi, M., Supancic, J. S., III., Montiel, J. M. M., &
Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, Ramanan, D. (2014). 3d hand pose detection in egocentric RGB-
D. & Theobalt, C.(2018). GANerated Hands for real-time 3D hand D images. In Proceedings of the European conference on computer
tracking from monocular RGB. In Proceedings of the IEEE/CVF vision workshops (ECCVW) (Vol. 8925, pp. 356–371).
conference on computer vision and pattern recognition (CVPR) Rogez, G., Supancic III, J. S. & Ramanan, D. (2015). First-person pose
(pp. 49–59). recognition using egocentric workspaces. In Proceedings of the
Mueller, F., Davis, M., Bernard, F., Sotnychenko, O., Verschoor, M., IEEE/CVF conference on computer vision and pattern recognition
Otaduy, M. A., Casas, D., & Theobalt, C. (2019). Real-time pose (CVPR) (pp. 4325–4333).
and shape reconstruction of two interacting hands with a single Rogez, G., Supancic III, J. S. & Ramanan, D. (2015). Understanding
depth camera. ACM Transactions on Graphics (ToG), 38(4), 49:1- everyday hands in action from RGB-D images. In Proceedings
49:13. of the IEEE/CVF international conference on computer vision
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D. & (ICCV) (pp. 3889–3897).
Theobalt, C. (2017). Real-time hand tracking under occlusion from Romero, J., Kjellström, H. & Kragic, D. (2010). Hands in action: Real-
an egocentric RGB-D sensor. In Proceedings of the IEEE/CVF time 3d reconstruction of hands in interaction with objects. In
international conference on computer vision (ICCV) (pp. 1163– Proceedings of the IEEE international conference on robotics and
1172). automation (ICRA) (pp. 458–463).
Oberweger, M., Riegler, G., Wohlhart, P. & Lepetit, V. (2016). Effi- Romero, J., Tzionas, D., & Black, M. J. (2017). Embodied hands: Mod-
ciently creating 3d training data for fine hand pose estimation. In eling and capturing hands and bodies together. ACM Transactions
Proceedings of the IEEE/CVF conference on computer vision and on Graphics (ToG), 36(6), 245:1-245:17.
pattern recognition (CVPR) (pp. 4957–4965). Santavas, N., Kansizoglou, I., Bampis, L., Karakasis, E. G., & Gaster-
Oberweger, M., Wohlhart, P. & Lepetit, V. (2015). Training a feedback atos, A. (2021). Attention! A lightweight 2d hand pose estimation
loop for hand pose estimation. In Proceedings of the IEEE/CVF approach. IEEE Sensors, 21(10), 11488–11496.
international conference on computer vision (ICCV) (pp. 3316– Šarić, M. (2011). Libhand: A library for hand articulation. Version 0.9.
3324). Schröder, M., Maycock, J. & Botsch, M. (2015). Reduced marker lay-
Ohkawa, T., He, K., Sener, F., Hodan, T., Tran, L. & Keskin, C. (2023). outs for optical motion capture of hands. In Proceedings of the
AssemblyHands: Towards egocentric activity understanding via 3d ACM SIGGRAPH conference on motion in games (MIG) (pp. 7–
hand pose estimation. In Proceedings of the IEEE/CVF conference 16). ACM.
on computer vision and pattern recognition (CVPR). Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R.
Ohkawa, T., Li, Y.-J., Fu, Q., Furuta, R., Kitani, K. M. & Sato, Y. (2022). & Yao, A. (2022). Assembly101: A large-scale multi-view video
Domain adaptive hand keypoint and pixel localization in the wild. dataset for understanding procedural activities. In Proceedings of
In Proceedings of the European conference on computer vision the IEEE/CVF conference on computer vision and pattern recog-
(ECCV) (pp. 68–87). nition (CVPR) (pp. 21096–21106).
Ohkawa, T., Yagi, T., Hashimoto, A., Ushiku, Y., & Sato, Y. (2021). Sharp, T., Keskin, C., Robertson, D. P., Taylor, J., Shotton, J., Kim, D.,
Foreground-aware stylization and consensus pseudo-labeling for Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D.,
domain adaptation of first-person hand segmentation. IEEE Kohli, P., Krupka, E., Fitzgibbon, A. W. & Izadi, S. (2015). Accu-
Access, 9, 94644–94655. rate, robust, and flexible real-time hand tracking. In Proceedings

123
3206 International Journal of Computer Vision (2023) 131:3193–3206

of the SIGCHI conference on human factors in computing systems Wu, M.-Y., Ting, P.-W., Tang, Y.-H., Chou, E. T., & Fu, L.-C. (2020).
(CHI) (pp. 3633–3642). Hand pose estimation in object-interaction based on deep learning
Simon, T., Joo, H., Matthews, I. & Sheikh, Y. (2017). Hand keypoint for virtual reality applications. Journal of Visual Communication
detection in single images using multiview bootstrapping. In Pro- and Image Representation, 70, 102802.
ceedings of the IEEE/CVF conference on computer vision and Wuu, C., Zheng, N., Ardisson, S., Bali, R., Belko, D., Brockmeyer, E.,
pattern recognition (CVPR) (pp. 4645–4653). Evans, L., Godisart, T., Ha, H., Hypes, A., Koska, T., Krenn, S.,
Spurr, A., Dahiya, A., Wang, X., Zhang, X. & Hilliges, O. (2021). Lombardi, S., Luo, X., McPhail, K., Millerschoen, L., Perdoch,
Self-supervised 3d hand pose estimation from monocular RGB M., Pitts, M. Richard, A., Saragih, J. M., Saragih, J., Shiratori, T.,
via contrastive learning. In Proceedings of the IEEE/CVF interna- Simon, T., Stewart, M., Trimble, A., Weng, X., Whitewolf, D., Wu,
tional conference on computer vision (ICCV) (pp. 11210–11219). C., Yu, S. & Sheikh, Y. (2022). Multiface: A dataset for neural face
Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O. & Kautz, J. (2020). rendering. CoRR, arXiv:2207.11243
Weakly supervised 3D hand pose estimation via biomechanical Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J. T. & Yuan, J.
constraints. In Proceedings of the European conference on com- (2019). A2J: Anchor-to-joint regression network for 3D articulated
puter vision (ECCV) (pp. 211–228). pose estimation from a single depth image. In Proceedings of the
Spurr, A., Molchanov, P., Iqbal, U., Kautz, J. & Hilliges, O. (2021). IEEE/CVF international conference on computer vision (ICCV)
Adversarial motion modelling helps semi-supervised hand pose (pp. 793–802).
estimation. CoRR, arXiv:2106.05954 Xu, C. & Cheng, L. (2013). Efficient hand pose estimation from a sin-
Spurr, A., Song, J., Park, S. & Hilliges, O. (2018). Cross-modal deep gle depth image. In Proceedings of the IEEE/CVF international
variational hand pose estimation. In Proceedings of the IEEE/CVF conference on computer vision (ICCV) (pp. 3456–3462).
conference on computer vision and pattern recognition (CVPR) Yang, L., Chen, S. & Yao, A. (2021). Semihand: Semi-supervised hand
(pp. 89–98). pose estimation with consistency. In Proceedings of the IEEE/CVF
Sridhar, S., Mueller, F., Zollhoefer, M., Casas, D., Oulasvirta, A. & international conference on computer vision (ICCV) (pp. 11364–
Theobalt, C. (2016). Real-time joint tracking of a hand manipulat- 11373).
ing an object from RGB-D input. In Proceedings of the European Yuan, S., Garcia-Hernando, G., Stenger, B., Moon, G., Chang, J. Y.,
conference on computer vision (ECCV) (pp. 294–310). Lee, K. M., Molchanov, P., Kautz, J., Honari, S., Ge, L., Yuan,
Sridhar, S., Oulasvirta, A. & Theobalt, C. (2013). Interactive marker- J., Chen, X., Wang, G., Yang, F., Akiyama, K., Wu, Y., Wan, Q.,
less articulated hand motion tracking using RGB and depth data. Madadi, M., Escalera, S., Li, S., Lee, D., Oikonomidis, I., Argyros,
In Proceedings of the IEEE/CVF international conference on com- A. A. & Kim, T-K. (2018). Depth-based 3d hand pose estimation:
puter vision (ICCV) (pp. 2456–2463). From current achievements to future goals. In Proceedings of the
Supancic, J. S., III., Rogez, G., Yang, Y., Shotton, J., & Ramanan, IEEE/CVF conference on computer vision and pattern recognition
D. (2018). Depth-based hand pose estimation: Methods, data, (CVPR) (pp. 2636–2645).
and challenges. International Journal Computer Vision (IJCV), Yuan, S., Stenger, B. & Kim, T.-K. (2019). Rgb-based 3d hand pose
126(11), 1180–1198. estimation via privileged learning with depth images. In Proceed-
Taheri, O., Ghorbani, N., Black, M. J. & Tzionas, D. (2020). GRAB: A ings of the IEEE/CVF international conference on computer vision
dataset of whole-body human grasping of objects. In Proceedings workshops (ICCVW).
of the European conference on computer vision (ECCV) (pp. 581– Yuan, S., Ye, Q., Stenger, B., Jain, S. & Kim, T.-K. (2017). Big-
600). Hand2.2M benchmark: Hand pose dataset and state of the art
Tang, D., Chang, H. J., Tejani, A. & Kim, T.-K. (2014). Latent regression analysis. In Proceedings of the IEEE/CVF conference on computer
forest: Structured estimation of 3d articulated hand posture. In vision and pattern recognition (CVPR) (pp. 2605–2613).
Proceedings of the IEEE/CVF conference on computer vision and Zhang, Y., Chen, L., Liu, Y., Zheng, W. & Yong, J. (2020). Adaptive
pattern recognition (CVPR) (pp. 3786–3793). wasserstein hourglass for weakly supervised RGB 3d hand pose
Tang, D., Yu, T.-H. & Kim, T.-K. (2013). Real-time articulated hand estimation. In Proceedings of the ACM international conference
pose estimation using semi-supervised transductive regression on multimedia (MM) (pp. 2076–2084).
forests. In Proceedings of the IEEE/CVF international conference Zhou, X., Wan, Q., Zhang, W., Xue, X. & Wei, Y. (2016). Model-based
on computer vision (ICCV) (pp. 3224–3231). deep hand pose estimation. In Proceedings of the international
Tekin, B., Bogo, F. & Pollefeys, M. (2019). H+O: Unified egocentric joint conference on artificial intelligence (IJCAI) (pp. 2421–2427).
recognition of 3D hand-object poses and interactions. In Proceed- Zimmermann, C., Argus, M., & Brox, T. (2021). Contrastive represen-
ings of the IEEE/CVF conference on computer vision and pattern tation learning for hand shape estimation. In Proceedings of the
recognition (CVPR) (pp. 4511–4520). DAGM German conference on pattern recognition (GCPR) (Vol.
Tompson, J., Stein, M., LeCun, Y., & Perlin, K. (2014). Real-time 13024, pp. 250–264).
continuous pose recovery of human hands using convolutional Zimmermann, C. & Brox, T. (2017). Learning to estimate 3D hand
networks. ACM Transactions on Graphics (ToG), 33(5), 169:1- pose from single RGB images. In Proceedings of the IEEE/CVF
169:10. international conference on computer vision (ICCV) (pp. 4913–
Tzeng, E., Hoffman, J., Saenko, K. & Darrell, T. (2017). Adversarial dis- 4921).
criminative domain adaptation. In Proceedings of the IEEE/CVF Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M. J. & Brox,
conference on computer vision and pattern recognition (CVPR) T. (2019). FreiHAND: A dataset for markerless capture of hand
(pp. 2962–2971). pose and shape from single RGB images. In Proceedings of the
Wan, C., Probst, T., Gool, L. V. & Yao, A. (2019). Self-supervised 3d IEEE/CVF international conference on computer vision (ICCV)
hand pose estimation through training by fitting. In Proceedings of (pp. 813–822).
the IEEE/CVF conference on computer vision and pattern recog-
nition (CVPR) (pp. 10853–10862).
Wang, R. Y., & Popovic, J. (2009). Real-time hand-tracking with a color Publisher’s Note Springer Nature remains neutral with regard to juris-
glove. ACM Transactions on Graphics (ToG), 28(3), 63. dictional claims in published maps and institutional affiliations.
Wetzler, A., Slossberg, R. & Kimmel, R. (2015). Rule of thumb: Deep
derotation for improved fingertip detection. In Proceedings of the
British machine vision conference (BMVC) (pp. 33.1–33.12).

123

You might also like