Professional Documents
Culture Documents
Volnet: Estimating Human Body Part Volumes From A Single RGB Image
Volnet: Estimating Human Body Part Volumes From A Single RGB Image
Volnet: Estimating Human Body Part Volumes From A Single RGB Image
Abstract
Human body volume estimation from a single RGB image RGB Image
is a challenging problem despite minimal attention from
the research community. However VolNet, an architecture Body Height 157 cm 168 cm 182 cm
leveraging 2D and 3D pose estimation, body part segmenta-
tion and volume regression extracted from a single 2D RGB
image combined with the subject’s body height can be used 2D Pose
to estimate the total body volume. VolNet is designed to
predict the 2D and 3D pose as well as the body part seg-
mentation in intermediate tasks. We generated a synthetic,
large-scale dataset of photo-realistic images of human bod-
ies with a wide range of body shapes and realistic poses Segmentation
called SURREALvols1 . By using Volnet and combining mul-
tiple stacked hourglass networks together with ResNeXt,
our model correctly predicted the volume in ∼82% of cases
with a 10% tolerance threshold. This is a considerable im- 3D Pose
provement compared to state-of-the-art solutions such as
BodyNet with only a ∼38% success rate.
Head 7.97 / 7.84 7.75 / 7.53 7.19 / 6.65
Torso 42.74 / 42.05 54.24 / 52.56 33.22 / 26.72
1. Introduction Left Upper Arm 1.70 / 1.70 2.33 / 2.28 1.50 / 1.16
Left Fore Arm 1.50 / 1.53 1.60 / 1.66 1.28 / 1.18
Reconstruction of 3D poses and shapes from a single 2D Left Hand 0.64 / 0.71 0.66 / 0.74 0.71 / 0.83
image has received increasing attention from the deep learn- Right Upper Arm 2.05 / 2.02 2.85 / 2.63 1.65 / 1.31
ing community [9, 46, 31, 21, 25, 40, 11]. Currently, lim- Right Fore Arm 1.45 / 1.47 1.63 / 1.70 1.40 / 1.26
ited work has been done on estimating the volume of objects Right Hand 0.65 / 0.75 0.66 / 0.73 0.70 / 0.81
and, especially, human body parts. In fact, similar work is Left Up Leg 5.79 / 5.18 7.30 / 7.11 5.72 / 3.89
limited to estimating the shape of an object e.g. by esti- Left Lower Leg 2.83 / 2.56 3.57 / 3.43 3.02 / 2.07
mating the parameters of a model similar to Skinned Multi- Left Foot 1.43 / 1.43 1.75 / 1.88 1.82 / 1.72
Person Linear Model (SMPL) [24]. However, there are Right Up Leg 5.73 / 5.21 7.29 / 7.05 5.47 / 3.55
many applications in the fields such as ergonomics, virtual Right Lower Leg 2.79 / 2.54 3.57 / 3.45 2.97 / 2.11
try on, and medicine, where body volume fractions are cru- Right Foot 1.47 / 1.44 1.75 / 1.83 1.89 / 1.78
cial. Furthermore, the weight of individual body parts can Total 78.72 / 76.41 96.95 / 94.57 68.55 / 55.04
be calculated using volume and density. Our study shows
that the volume of body parts can be estimated from a sin- Figure 1: Two good and one bad predictions with all inputs,
gle 2D RGB image. To do this, we designed and imple- outputs and intermediate results. The volumes are given in
mented a novel custom neural network architecture based dm3 . In each case, the first value represents the predicted
on the stacked hourglass [30] approach. Our main contribu- value, the second one the ground truth value.
1 https://github.com/fleinen/SURREALvols
tions are the following: images from the kitchen, living room, bedroom, and dining
room categories. The background and the person have no
• We extended the SURREAL dataset [38] with volumes relation to each other, leading to people floating and inter-
of 14 different body parts leading to a novel benchmark secting with invisible objects. To further improve variance,
dataset called SURREALvols. camera pose and lighting were randomly sampled.
• We introduced a novel network architecture that can 3D reconstruction. Reconstructing the 3D position and
estimate the total volume of a human being as well as 3D shape of arbitrary and specific objects has received in-
the volumes of 14 individual body parts using a single creasing attention. Navaneet et al. directly regressed 3D
RGB image with an accuracy of ∼82% with a toler- features using a deep neural network from a 2D image [29].
ance threshold of 10 % of the body volume. For the purpose of estimating the 3D shape of a human
body, the typical technique is to regress the parameters of a
• We show how to improve the volume regression for predefined human shape model such as SMPL. Bogo et al.
single body parts when fine-tuning the network, which [10] extracted keypoints from RGB images and fit them to
leads to an additional ∼2% increase in Mean Absolute SMPL. Lassner et al. extended this approach to use 2D sil-
Percentage Error (MAPE). houettes [22]. BodyNet [37] represents the 3D body in the
form of voxels, different from previous approaches directly
2. Related Work estimating the 3D body pose and shape. The BodyNet ar-
Pose and 3D Body Representation. Estimating human chitecture uses multiple stacked hourglass networks to pre-
pose from an individual 2D image is possible by leverag- dict the 2D and 3D pose, as well as the body part segmenta-
ing neural networks based on a stacked hourglass architec- tion. Before estimating the actual 3D shape in a voxel grid.
ture [36, 30, 16, 6]. Typically, multiple hourglass mod- The output is then optionally fit to the SMPL model. Bo-
ules are stacked together enabling the network to reevalu- dyNet is trained on the SURREAL [38] and Unite the Peo-
ate initial guesses. Next to the pose, there are many appli- ple [22] datasets. Xu et al. used DensePose [15] to estimate
cations requiring a digital 3D representation of the human an IUV map as a proxy representation to fit SMPL [43].
body. For example, VR and AR [23, 17], computer anima- To improve the training process, they use a render-and-
tions [7, 34, 27], or in car monitoring. SMPL is a generative compare approach. Alldieck et al. proposed an approach
human body model trained over real body scans [24]. The that predicts 3D human shapes in a canonical pose from an
shape β ∈ R10 is obtained from the first coefficients of the input video showing the subject from all sides [4]. A deep
PCA while the pose θ ∈ R3K + 3 is modeled by the relative convolutional network predicted the shape parameters only.
rotations of the K = 23 joints in axis-angle representa- Like Xu et al., they improved their training process using
tion. Overall, SMPL provides a male and female template render-and-compare on a few frames. In subsequent work
mesh, each consisting of 6890 vertices and 23 joints. Thus, they introduced Tex2Shape to generate a detailed 3D mesh
a considerable number of realistic human body shapes can from only a single RGB image [5], and made use of Dense-
be simulated in different poses. Pose’s UV coordinates to first regress SMPL shape param-
The dataset is organized into a maximum of 100 frame eters β using a simple convolutional network architecture
clips where, body shape, texture, background, lighting, and and then generated normals from the DensePose’s texture
camera position are constant, and only the person’s pose and maps using GAN architecture. The normals were then used
location change. Subjects are based on SMPL which allow to refine the 3D mesh, resulting in a detailed 3D mesh of
generation of people with a wide range of poses and shapes. people wearing clothes.
To ensure that the poses and motions look realistic, posi- Visual Body Weight Estimation. Velardo et al. esti-
tions are taken from the CMU MoCap database [1] which mated human body weight using a linear regression model
contains 3D positions of MoCap markers of more than 2000 that maps seven different body measurements (body height,
sequences. Fitting SMPL to these markers leads directly to upper leg length, calf circumference, upper arm length, up-
the SMPL pose parameters. At the beginning of each clip, per arm circumference, waist circumference, and upper leg
the person is placed at the center of the picture and is al- circumference) to the body weight [39]. Data from National
lowed to move away within one clip. SMPL’s ten shape Health and Nutrition Examination Survey (NHANES) [13]
parameters are sampled randomly and not changed within was used for optimization. Due to the lack of data for hu-
one clip. To achieve a photo-realistic appearance of the re- man bodies with total body weight below 35 Kg or above
sulting RGB data, a human texture is added to the mesh. 130 Kg, consequently only 27k data points were available.
Textures were extracted from CAESAR [32] scans show- The resulting model predicted the weight of 60 % of sam-
ing people in tight-fitting as well as usual clothing. The ples from the test set with a maximum error of ±5 %. 93 %
textured person is rendered in front of a real-world photo of the samples were predicted with a maximum error of
randomly sampled from the LSUN dataset [45] using only ±10 %. Since this model is biased because it was con-
structed with data collected only in the USA.
To boost the ability to generalize, the RGB images are aug- left + right hand, left + right hip joint, left + right knee, left + right foot
Body
Height
Body
Part
Volumes
Figure 4: Architecture of VolNet. The 2D pose is predicted from a single RGB image. Next, the RGB image and the 2D pose
are used as input for the body part segmentation. Together these predict the 3D pose. All four intermediate results together
combined with the body height are then used to predict the volume of each individual body part.
residual block
residual block
residual block
residual block
2x2 upsample
2x2 upsample
batch norm
batch norm
1x1 conv
1x1 conv
residual
residual
residual
residual
block
block
block
block
Input
add conc
1x1 conv
residual
block
Figure 5: Structure of each of the first three subnetworks. The front module and both hourglasses are adapted from Newell
et al. [30]. Final results from the last stack are upsampled to 256 × 256 again. Losses are applied after the blue blocks.
used as inputs the RGB image, the 2D pose, and the body 4.2.2 Volume Regression
part segmentation which were concatenated along their
channel dimension resulting in a tensor with 34 channels. The last step in VolNet predicts the volume of 14 prede-
As in BodyNet, the 3D pose is represented as a 16 voxel fined body parts. It uses the RGB image combined with
grid, one for each joint, containing a multivariate Gaussian the results from the previously mentioned subtasks concate-
with its mean at the position of the joint. BodyNet assumes nated along their channel dimension. This results in a sin-
that the depth of a body roughly fits in 85 cm, split into 19 gle 256 × 256 × 226 tensor. The body height in centimeters
bins where each bin represents a range of 5 cm. For our ap- is used as an additional input to the network. Hence, the
proach, we instead used the relative depth in an orthogonal subnetwork must deal with spatially dependent data as well
projection and split the depth into 12 bins rather than 19. as single scalar data. For the convolutional component, we
used a standard ResNeXt-50 [41] with a cardinality of 32.
The body height is concatenated to the resulting vector of
the convolutions consisting of 2,048 values. The result is
passed to a fully connected portion with only two fully con-
This subnetwork mainly focused on estimating the depth nected layers, one hidden and one output layer. The hid-
value for each joint since the spatial position was already den layer has 1,024 neurons and uses leaky ReLU activa-
known from the first subnetwork. As with the 2D pose es- tion with α = 0.1. Since there are 14 body parts, the output
timation, this subnetwork was trained using RMSprop op- layer has 14 neurons with linear activation. Since body part
timizer with a learning rate of 2.5 ∗ 10−4 and MSE. For volumes are strongly influenced by body height and are sen-
evaluation, the accuracy with a fixed threshold of 12 pixels sitive to small changes, batch normalization was not used to
for the spatial resolution and two bins for depth were used. preserve exact body height.
The maximum accuracy of 82.55% was reached. We calculated the MSE between the ground truth and
Ratio within MAPE
AE [dm3 ] APE [%] 1
µ σ µ σ 0.8
head 0.4 0.4 6.2 5.2 0.6
torso 3.0 3.2 7.8 8.1 Total Volume 0.4
left upper arm 0.2 0.2 11.3 10.3 Single Body Parts 0.2
left fore arm 0.1 0.1 8.6 7.1 0
left hand 0.1 0.0 10.0 8.4 0 20 40 60 80 100
right upper arm 0.2 0.2 10.4 10.2 MAPE
right fore arm 0.1 0.1 8.6 7.5
right hand 0.1 0.0 10.0 6.8 Figure 6: Cumulative error of VolNet with the volume re-
left up leg 0.5 0.5 9.8 9.3 gressor on the test set.
left lower leg 0.2 0.2 8.7 7.1
left foot 0.1 0.1 6.8 4.7
Relative Error
right up leg 0.5 0.5 9.7 9.3 0.2
right lower leg 0.2 0.2 8.3 7.0
0
right foot 0.1 0.1 6.4 4.6
total volume 4.7 4.9 6.2 6.2 −0.2
0 π 2π
Table 2: Performance of VolNet on the test set clearly show-
(a) Around y-axis (“Front Flip”)
ing that our architecture is able to predict the subjects body Relative Error
part volumes accurately.
0.2
0
predicted body part volumes as loss since it automatically −0.2
gives greater weight to large and influential errors. Such π
0 2π
errors usually occur with greater frequency when estimating
the volume of large body parts like the torso. Also in this (b) Around z-axis (“Pirouette”)
case we use Adam with default parameters and a learning
rate of α = 2.5 ∗ 10−4 . Figure 7: Volume error in percent for rotating the person
around its own axes. π is always the neutral position of a
5. Evaluation person standing straight facing the camera.