Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Deep High-Resolution Representation Learning for Human Pose Estimation

Ke Sun1,2∗† Bin Xiao2∗ Dong Liu1 Jingdong Wang2‡


1
University of Science and Technology of China 2 Microsoft Research Asia
sunk@mail.ustc.edu.cn, dongleiu@ustc.edu.cn, {Bin.Xiao,jingdw}@microsoft.com

Abstract depth

In this paper, we are interested in the human pose es- 1×

timation problem with a focus on learning reliable high-

scale
resolution representations. Most existing methods recover

high-resolution representations from low-resolution repre-
sentations produced by a high-to-low resolution network.

Instead, our proposed network maintains high-resolution feature conv. down up
representations through the whole process. maps unit samp. samp.

We start from a high-resolution subnetwork as the first


stage, gradually add high-to-low resolution subnetworks Figure 1. Illustrating the architecture of the proposed HRNet. It
one by one to form more stages, and connect the mutli- consists of parallel high-to-low resolution subnetworks with re-
resolution subnetworks in parallel. We conduct repeated peated information exchange across multi-resolution subnetworks
multi-scale fusions such that each of the high-to-low reso- (multi-scale fusion). The horizontal and vertical directions cor-
lution representations receives information from other par- respond to the depth of the network and the scale of the feature
allel representations over and over, leading to rich high- maps, respectively.
resolution representations. As a result, the predicted key- video pose estimation and tracking [48, 70], etc.
point heatmap is potentially more accurate and spatially
The recent developments show that deep convolutional
more precise. We empirically demonstrate the effectiveness
neural networks have achieved the state-of-the-art perfor-
of our network through the superior pose estimation results
mance. Most existing methods pass the input through a
over two benchmark datasets: the COCO keypoint detection
network, typically consisting of high-to-low resolution sub-
dataset and the MPII Human Pose dataset. In addition, we
networks that are connected in series, and then raise the
show the superiority of our network in pose tracking on the
resolution. For instance, Hourglass [39] recovers the high
PoseTrack dataset. The code and models have been publicly
resolution through a symmetric low-to-high process. Sim-
available at https://github.com/leoxiaobin/
pleBaseline [70] adopts a few transposed convolution layers
deep-high-resolution-net.pytorch.
for generating high-resolution representations. In addition,
dilated convolutions are also used to blow up the later lay-
ers of a high-to-low resolution network (e.g., VGGNet or
1. Introduction ResNet) [26, 74].
2D human pose estimation has been a fundamental yet We present a novel architecture, namely High-
challenging problem in computer vision. The goal is to lo- Resolution Net (HRNet), which is able to maintain high-
calize human anatomical keypoints (e.g., elbow, wrist, etc.) resolution representations through the whole process. We
or parts. It has many applications, including human action start from a high-resolution subnetwork as the first stage,
recognition, human-computer interaction, animation, etc. gradually add high-to-low resolution subnetworks one by
This paper is interested in single-person pose estimation, one to form more stages, and connect the multi-resolution
which is the basis of other related problems, such as multi- subnetworks in parallel. We conduct repeated multi-scale
person pose estimation [6, 26, 32, 38, 46, 55, 40, 45, 17, 69], fusions by exchanging the information across the paral-
∗ Equal
lel multi-resolution subnetworks over and over through the
contribution.
† This work is done when Ke Sun was an intern at Microsoft Research,
whole process. We estimate the keypoints over the high-
Beijing, P.R. China resolution representations output by our network. The re-
‡ Corresponding author sulting network is illustrated in Figure 1.

5693
(a) (b)

(d)

feature reg. dilated strided up trans. sum


(c) maps conv. conv. conv. samp. conv.

Figure 2. Illustration of representative pose estimation networks that rely on the high-to-low and low-to-high framework. (a) Hourglass [39].
(b) Cascaded pyramid networks [11]. (c) SimpleBaseline [70]: transposed convolutions for low-to-high processing. (d) Combination with
dilated convolutions [26]. Bottom-right legend: reg. = regular convolution, dilated = dilated convolution, trans. = transposed convolution,
strided = strided convolution, concat. = concatenation. In (a), the high-to-low and low-to-high processes are symmetric. In (b), (c) and
(d), the high-to-low process, a part of a classification network (ResNet or VGGNet), is heavy, and the low-to-high process is light. In (a)
and (b), the skip-connections (dashed lines) between the same-resolution layers of the high-to-low and low-to-high processes mainly aim
to fuse low-level and high-level features. In (b), the right part, refinenet, combines the low-level and high-level features that are processed
through convolutions.

Our network has two benefits in comparison to exist- ference process [13]. Nowadays, deep convolutional neural
ing widely-used networks [39, 26, 74, 70] for pose estima- network provides dominant solutions [20, 34, 60, 41, 42,
tion. (i) Our approach connects high-to-low resolution sub- 47, 56, 16]. There are two mainstream methods: regressing
networks in parallel rather than in series as done in most the position of keypoints [64, 7], and estimating keypoint
existing solutions. Thus, our approach is able to main- heatmaps [13, 14, 75] followed by choosing the locations
tain the high resolution instead of recovering the resolu- with the highest heat values as the keypoints.
tion through a low-to-high process, and accordingly the pre- Most convolutional neural networks for keypoint
dicted heatmap is potentially spatially more precise. (ii) heatmap estimation consist of a stem subnetwork similar to
Most existing fusion schemes aggregate low-level and high- the classification network, which decreases the resolution,
level representations. Instead, we perform repeated multi- a main body producing the representations with the same
scale fusions to boost the high-resolution representations resolution as its input, followed by a regressor estimating
with the help of the low-resolution representations of the the heatmaps where the keypoint positions are estimated
same depth and similar level, and vice versa, resulting in and then transformed in the full resolution. The main body
that high-resolution representations are also rich for pose mainly adopts the high-to-low and low-to-high framework,
estimation. Consequently, our predicted heatmap is poten- possibly augmented with multi-scale fusion and intermedi-
tially more accurate. ate (deep) supervision.
We empirically demonstrate the superior keypoint detec-
High-to-low and low-to-high. The high-to-low process
tion performance over two benchmark datasets: the COCO
aims to generate low-resolution and high-level representa-
keypoint detection dataset [35] and the MPII Human Pose
tions, and the low-to-high process aims to produce high-
dataset [2]. In addition, we show the superiority of our net-
resolution representations [4, 11, 22, 70, 39, 60]. Both the
work in video pose tracking on the PoseTrack dataset [1].
two processes are possibly repeated several times for boost-
ing the performance [74, 39, 14].
2. Related Work
Representative network design patterns include: (i) Sym-
Most traditional solutions to single-person pose estima- metric high-to-low and low-to-high processes. Hourglass
tion adopt the probabilistic graphical model or the picto- and its follow-ups [39, 14, 74, 30] design the low-to-high
rial structure model [76, 49], which is recently improved by process as a mirror of the high-to-low process. (ii) Heavy
exploiting deep learning for better modeling the unary and high-to-low and light low-to-high. The high-to-low pro-
pair-wise energies [9, 63, 44] or imitating the iterative in- cess is based on the ImageNet classification network, e.g.,

5694
ResNet adopted in [11, 70], and the low-to-high process is of many weight-shared U-Nets, consists of two separate fu-
simply a few bilinear-upsampling [11] or transpose convo- sion processes across multi-resolution representations: on
lution [70] layers. (iii) Combination with dilated convolu- the first stage, information is only sent from high resolution
tions. In [26, 50, 34], dilated convolutions are adopted in to low resolution; on the second stage, information is only
the last two stages in the ResNet or VGGNet to eliminate sent from low resolution to high resolution, and thus less
the spatial resolution loss, which is followed by a light low- competitive. Multi-scale densenets [23] does not target and
to-high process to further increase the resolution, avoiding cannot generate reliable high-resolution representations.
expensive computation cost for only using dilated convolu-
tions [11, 26, 50]. Figure 2 depicts four representative pose 3. Approach
estimation networks.
Human pose estimation, a.k.a. keypoint detection, aims
Multi-scale fusion. The straightforward way is to feed to detect the locations of K keypoints or parts (e.g., elbow,
multi-resolution images separately into multiple networks wrist, etc) from an image I of size W × H × 3. The state-
and aggregate the output response maps [62]. Hour- of-the-art methods transform this problem to estimating K
′ ′
glass [39] and its extensions [74, 30] combine low-level heatmaps of size W ×H , {H1 , H2 , . . . , HK }, where each
features in the high-to-low process into the same-resolution heatmap Hk indicates the location confidence of the kth
high-level features in the low-to-high process progres- keypoint.
sively through skip connections. In cascaded pyramid net- We follow the widely-adopted pipeline [39, 70, 11] to
work [11], a globalnet combines low-to-high level features predict human keypoints using a convolutional network,
in the high-to-low process progressively into the low-to- which is composed of a stem consisting of two strided con-
high process, and then a refinenet combines the low-to-high volutions decreasing the resolution, a main body outputting
level features that are processed through convolutions. Our the feature maps with the same resolution as its input fea-
approach repeats multi-scale fusion, which is partially in- ture maps, and a regressor estimating the heatmaps where
spired by deep fusion and its extensions [65, 71, 57, 77, 79]. the keypoint positions are chosen and transformed to the
Intermediate supervision. Intermediate supervision or full resolution. We focus on the design of the main body
deep supervision, early developed for image classifica- and introduce our High-Resolution Net (HRNet) that is de-
tion [33, 59], is also adopted for helping deep networks picted in Figure 1.
training and improving the heatmap estimation quality, Sequential multi-resolution subnetworks. Existing net-
e.g., [67, 39, 62, 3, 11]. The hourglass approach [39] and works for pose estimation are built by connecting high-to-
the convolutional pose machine approach [67] process the low resolution subnetworks in series, where each subnet-
intermediate heatmaps as the input or a part of the input of work, forming a stage, is composed of a sequence of con-
the remaining subnetwork. volutions and there is a down-sample layer across adjacent
subnetworks to halve the resolution.
Our approach. Our network connects high-to-low sub- Let Nsr be the subnetwork in the sth stage and r be the
networks in parallel. It maintains high-resolution repre- 1
resolution index (Its resolution is 2r−1 of the resolution of
sentations through the whole process for spatially precise the first subnetwork). The high-to-low network with S (e.g.,
heatmap estimation. It generates reliable high-resolution 4) stages can be denoted as:
representations through repeatedly fusing the representa-
tions produced by the high-to-low subnetworks. Our ap- N11 → N22 → N33 → N44 . (1)
proach is different from most existing works, which need
a separate low-to-high upsampling process and aggregate Parallel multi-resolution subnetworks. We start from a
low-level and high-level representations. Our approach, high-resolution subnetwork as the first stage, gradually add
without using intermediate heatmap supervision, is superior high-to-low resolution subnetworks one by one, forming
in keypoint detection accuracy and efficient in computation new stages, and connect the multi-resolution subnetworks
complexity and parameters. in parallel. As a result, the resolutions for the parallel sub-
There are related multi-scale networks for classification networks of a later stage consists of the resolutions from the
and segmentation [5, 8, 72, 78, 29, 73, 53, 54, 23, 80, previous stage, and an extra lower one.
53, 51, 18]. Our work is partially inspired by some of An example network structure, containing 4 parallel sub-
them [54, 23, 80, 53], and there are clear differences making networks, is given as follows,
them not applicable to our problem. Convolutional neural
fabrics [54] and interlinked CNN [80] fail to produce high- N11 → N21 → N31 → N41
quality segmentation results because of a lack of proper de- ց N22 → N32 → N42
(2)
sign on each subnetwork (depth, batch normalization) and ց N33 → N43
multi-scale fusion. The grid network [18], a combination ց N44 .

5695
feature
maps Network instantiation. We instantiate the network for key-
point heatmap estimation by following the design rule of
strided ResNet to distribute the depth to each stage and the number
3×3
of channels to each resolution.
up samp. The main body, i.e., our HRNet, contains four stages
1×1
with four parallel subnetworks, whose the resolution is
Figure 3. Illustrating how the exchange unit aggregates the infor- gradually decreased to a half and accordingly the width (the
mation for high, medium and low resolutions from the left to the number of channels) is increased to the double. The first
right, respectively. Right legend: strided 3 × 3 = strided 3 × 3 con- stage contains 4 residual units where each unit, the same to
volution, up samp. 1×1 = nearest neighbor up-sampling following the ResNet-50, is formed by a bottleneck with the width 64,
a 1 × 1 convolution. and is followed by one 3×3 convolution reducing the width
of feature maps to C. The 2nd, 3rd, 4th stages contain 1, 4,
Repeated multi-scale fusion. We introduce exchange units 3 exchange blocks, respectively. One exchange block con-
across parallel subnetworks such that each subnetwork re- tains 4 residual units where each unit contains two 3 × 3
peatedly receives the information from other parallel sub- convolutions in each resolution and an exchange unit across
networks. Here is an example showing the scheme of ex- resolutions. In summary, there are totally 8 exchange units,
changing information. We divided the third stage into sev- i.e., 8 multi-scale fusions are conducted.
eral (e.g., 3) exchange blocks, and each block is composed
In our experiments, we study one small net and one big
of 3 parallel convolution units with an exchange unit across
net: HRNet-W32 and HRNet-W48, where 32 and 48 rep-
the parallel units, which is given as follows,
resent the widths (C) of the high-resolution subnetworks in
1
C31 ց 2
ր C31 ց 3
ր C31 ց last three stages, respectively. The widths of other three
1
C32 → E31 2
→ C32 → E32 3
→ C32 → E33 , parallel subnetworks are 64, 128, 256 for HRNet-W32, and
1
C33 ր 2
ց C33 ր 3
ց C33 ր 96, 192, 384 for HRNet-W48.
(3)
b
where Csr represents the convolution unit in the rth resolu- 4. Experiments
tion of the bth block in the sth stage, and Esb is the corre-
sponding exchange unit. 4.1. COCO Keypoint Detection
We illustrate the exchange unit in Figure 3 and present Dataset. The COCO dataset [35] contains over 200, 000
the formulation in the following. We drop the subscript s images and 250, 000 person instances labeled with 17 key-
and the superscript b for discussion convenience. The in- points. We train our model on COCO train2017 dataset, in-
puts are s response maps: {X1 , X2 , . . . , Xs }. The outputs cluding 57K images and 150K person instances. We eval-
are s response maps: {Y1 , Y2 , . . . , Ys }, whose resolutions uate our approach on the val2017 set and test-dev2017 set,
and widths are the same to the input. Each
Ps output is an ag- containing 5000 images and 20K images, respectively.
gregation of the input maps, Yk = i=1 a(Xi , k). The
exchange unit across stages has an extra output map Ys+1 : Evaluation metric. The standard evaluation metric is
Ys+1 = a(Ys , s + 1). based
P
on Object Keypoint Similarity (OKS): OKS =
2
i exp(−d /2s2 ki2 )δ(vi >0)
The function a(Xi , k) consists of upsampling or down- Pi . Here di is the Euclidean dis-
i δ(v i >0)
sampling Xi from resolution i to resolution k. We adopt tance between the detected keypoint and the correspond-
strided 3 × 3 convolutions for downsampling. For instance, ing ground truth, vi is the visibility flag of the ground
one strided 3×3 convolution with the stride 2 for 2× down- truth, s is the object scale, and ki is a per-keypoint
sampling, and two consecutive strided 3 × 3 convolutions constant that controls falloff. We report standard aver-
with the stride 2 for 4× downsampling. For upsampling, age precision and recall scores1 : AP50 (AP at OKS =
we adopt the simple nearest neighbor sampling following a 0.50) AP75 , AP (the mean of AP scores at 10 posi-
1 × 1 convolution for aligning the number of channels. If tions, OKS = 0.50, 0.55, . . . , 0.90, 0.95; APM for medium
i = k, a(·, ·) is just an identify connection: a(Xi , k) = Xi . objects, APL for large objects, and AR at OKS =
Heatmap estimation. We regress the heatmaps simply 0.50, 0.55, . . . , 0.90, 0.95.
from the high-resolution representations output by the last Training. We extend the human detection box in height
exchange unit, which empirically works well. The loss or width to a fixed aspect ratio: height : width = 4 : 3,
function, defined as the mean squared error, is applied and then crop the box from the image, which is resized to
for comparing the predicted heatmaps and the groundtruth a fixed size, 256 × 192 or 384 × 288. The data augmenta-
heatmaps. The groundtruth heatmpas are generated by ap- tion includes random rotation ([−45◦ , 45◦ ]), random scale
plying 2D Gaussian with standard deviation of 1 pixel cen-
tered on the grouptruth location of each keypoint. 1 http://cocodataset.org/#keypoints-eval

5696
Table 1. Comparisons on the COCO validation set. Pretrain = pretrain the backbone on the ImageNet classification task. OHKM = online
hard keypoints mining [11].
Method Backbone Pretrain Input size #Params GFLOPs AP AP50 AP75 APM APL AR
8-stage Hourglass [39] 8-stage Hourglass N 256 × 192 25.1M 14.3 66.9 − − − − −
CPN [11] ResNet-50 Y 256 × 192 27.0M 6.20 68.6 − − − − −
CPN + OHKM [11] ResNet-50 Y 256 × 192 27.0M 6.20 69.4 − − − − −
SimpleBaseline [70] ResNet-50 Y 256 × 192 34.0M 8.90 70.4 88.6 78.3 67.1 77.2 76.3
SimpleBaseline [70] ResNet-101 Y 256 × 192 53.0M 12.4 71.4 89.3 79.3 68.1 78.1 77.1
SimpleBaseline [70] ResNet-152 Y 256 × 192 68.6M 15.7 72.0 89.3 79.8 68.7 78.9 77.8
HRNet-W32 HRNet-W32 N 256 × 192 28.5M 7.10 73.4 89.5 80.7 70.2 80.1 78.9
HRNet-W32 HRNet-W32 Y 256 × 192 28.5M 7.10 74.4 90.5 81.9 70.8 81.0 79.8
HRNet-W48 HRNet-W48 Y 256 × 192 63.6M 14.6 75.1 90.6 82.2 71.5 81.8 80.4
SimpleBaseline [70] ResNet-152 Y 384 × 288 68.6M 35.6 74.3 89.6 81.1 70.5 79.7 79.7
HRNet-W32 HRNet-W32 Y 384 × 288 28.5M 16.0 75.8 90.6 82.7 71.9 82.8 81.0
HRNet-W48 HRNet-W48 Y 384 × 288 63.6M 32.9 76.3 90.8 82.9 72.3 83.4 81.2

Table 2. Comparisons on the COCO test-dev set. #Params and FLOPs are calculated for the pose estimation network, and those for human
detection and keypoint grouping are not included.
Method Backbone Input size #Params GFLOPs AP AP50 AP75 APM APL AR
Bottom-up: keypoint detection and grouping
OpenPose [6] − − − − 61.8 84.9 67.5 57.1 68.2 66.5
Associative Embedding [38] − − − − 65.5 86.8 72.3 60.6 72.6 70.2
PersonLab [45] − − − − 68.7 89.0 75.4 64.1 75.5 75.4
MultiPoseNet [32] − − − − 69.6 86.3 76.6 65.0 76.3 73.5
Top-down: human detection and single-person keypoint detection
Mask-RCNN [21] ResNet-50-FPN − − − 63.1 87.3 68.7 57.8 71.4 −
G-RMI [46] ResNet-101 353 × 257 42.6M 57.0 64.9 85.5 71.3 62.3 70.0 69.7
Integral Pose Regression [58] ResNet-101 256 × 256 45.0M 11.0 67.8 88.2 74.8 63.9 74.0 −
G-RMI + extra data [46] ResNet-101 353 × 257 42.6M 57.0 68.5 87.1 75.5 65.8 73.3 73.3
CPN [11] ResNet-Inception 384 × 288 − − 72.1 91.4 80.0 68.7 77.2 78.5
RMPE [17] PyraNet [74] 320 × 256 28.1M 26.7 72.3 89.2 79.1 68.0 78.6 −
CFN [24] − − − − 72.6 86.1 69.7 78.3 64.1 −
CPN (ensemble) [11] ResNet-Inception 384 × 288 − − 73.0 91.7 80.9 69.5 78.1 79.0
SimpleBaseline [70] ResNet-152 384 × 288 68.6M 35.6 73.7 91.9 81.1 70.3 80.0 79.0
HRNet-W32 HRNet-W32 384 × 288 28.5M 16.0 74.9 92.5 82.8 71.3 80.9 80.1
HRNet-W48 HRNet-W48 384 × 288 63.6M 32.9 75.5 92.5 83.3 71.9 81.5 80.5
HRNet-W48 + extra data HRNet-W48 384 × 288 63.6M 32.9 77.0 92.7 84.5 73.4 83.1 82.0

([0.65, 1.35]), and flipping. Following [66], half body data ing [70, 39, 11], we compute the heatmap by averaging the
augmentation is also involved. headmaps of the original and flipped images. Each keypoint
We use the Adam optimizer [31]. The learning sched- location is predicted by adjusting the highest heatvalue lo-
ule follows the setting [70]. The base learning rate is set as cation with a quarter offset in the direction from the highest
1e−3, and is dropped to 1e−4 and 1e−5 at the 170th and response to the second highest response.
200th epochs, respectively. The training process is termi-
nated within 210 epochs. Results on the validation set. We report the results of our
method and other state-of–the-art methods in Table 1. Our
Testing. The two-stage top-down paradigm similar as [46,
small network - HRNet-W32, trained from scratch with the
11, 70] is used: detect the person instance using a person
input size 256 × 192, achieves an 73.4 AP score, outper-
detector, and then predict detection keypoints.
forming other methods with the same input size. (i) Com-
We use the same person detectors provided by Simple-
pared to Hourglass [39], our small network improves AP by
Baseline2 for both validation set and test-dev set. Follow-
6.5 points, and the GFLOPs of our network is much lower
2 https://github.com/Microsoft/ and less than half, while the number of parameters are sim-
human-pose-estimation.pytorch ilar and ours is slightly larger. (ii) Compared to CPN [11]

5697
w/o and w/ OHKM, our network, with slightly larger model Table 3. Comparisons on the MPII test set (PCKh@0.5).
size and slightly higher complexity, achieves 4.8 and 4.0 Method Hea Sho Elb Wri Hip Kne Ank Total
points gain, respectively. (iii) Compared to the previous Insafutdinov et al. [26] 96.8 95.2 89.3 84.4 88.4 83.4 78.0 88.5
best-performed SimpleBaseline [70], our HRNet-W32 ob- Wei et al. [67] 97.8 95.0 88.7 84.0 88.4 82.8 79.4 88.5
tains significant improvements: 3.0 points gain for the back- Bulat et al. [4] 97.9 95.1 89.9 85.3 89.4 85.7 81.7 89.7
bone ResNet-50 with a similar model size and GFLOPs, and Newell et al. [39] 98.2 96.3 91.2 87.1 90.1 87.4 83.6 90.9
1.4 points gain for the backbone ResNet-152 whose model Sun et al. [56] 98.1 96.2 91.2 87.2 89.8 87.4 84.1 91.0
size (#Params) and GFLOPs are twice as many as ours. Tang et al. [61] 97.4 96.4 92.1 87.7 90.2 87.7 84.3 91.2
Our nets can benefit from (i) training from the model pre- Ning et al. [43] 98.1 96.3 92.2 87.8 90.6 87.6 82.7 91.2
trained on the ImageNet: The gain is 1.0 points for HRNet- Luvizon et al. [36] 98.1 96.6 92.0 87.5 90.6 88.0 82.7 91.2
W32; (ii) increasing the capacity by increasing the width: Chu et al. [14] 98.5 96.3 91.9 88.1 90.6 88.0 85.0 91.5
HRNet-W48 gets 0.7 and 0.5 points gain for the input sizes Chou et al. [12] 98.2 96.8 92.2 88.0 91.3 89.1 84.9 91.8
256 × 192 and 384 × 288, respectively. Chen et al. [10] 98.1 96.5 92.5 88.5 90.2 89.6 86.0 91.9
Considering the input size 384 × 288, our HRNet-W32 Yang et al. [74] 98.5 96.7 92.5 88.7 91.1 88.6 86.0 92.0
and HRNet-W48, get the 75.8 and 76.3 AP, which have 1.4 Ke et al. [30] 98.5 96.8 92.7 88.4 90.6 89.3 86.3 92.1
and 1.2 improvements compared to the input size 256 × Tang et al. [60] 98.4 96.9 92.6 88.7 91.8 89.4 86.2 92.3
192. In comparison to the SimpleBaseline [70] that uses SimpleBaseline [70] 98.5 96.6 91.9 87.6 91.1 88.1 84.1 91.5
ResNet-152 as the backbone, our HRNet-W32 and HRNet- HRNet-W32 98.6 96.9 92.8 89.0 91.5 89.0 85.7 92.3
W48 attain 1.5 and 2.0 points gain in terms of AP at 45%
Table 4. #Params and GFLOPs of some top-performed methods
and 92.4% computational cost, respectively.
reported in Table 3. The GFLOPs is computed with the input size
Results on the test-dev set. Table 2 reports the pose estima- 256 × 256.
tion performances of our approach and the existing state-of- Method #Params GFLOPs PCKh@0.5
the-art approaches. Our approach is significantly better than Insafutdinov et al. [26] 42.6M 41.2 88.5
bottom-up approaches. On the other hand, our small net- Newell et al. [39] 25.1M 19.1 90.9
work, HRNet-W32, achieves an AP of 74.9. It outperforms Yang et al. [74] 28.1M 21.3 92.0
all the other top-down approaches, and is more efficient in Tang et al. [60] 15.5M 15.6 92.3
terms of model size (#Params) and computation complexity SimpleBaseline [70] 68.6M 20.9 91.5
(GFLOPs). Our big model, HRNet-W48, achieves the high- HRNet-W32 28.5M 9.5 92.3
est 75.5 AP. Compared to the SimpleBaseline [70] with the
same input size, our small and big networks receive 1.2 and ground-truth head bounding box. The PCKh@0.5 (α = 0.5)
1.8 improvements, respectively. With additional data from score is reported.
AI Challenger [68] for training, our single big network can
Results on the test set. Tables 3 and 4 show the
obtain an AP of 77.0.
PCKh@0.5 results, the model size and the GFLOPs of the
4.2. MPII Human Pose Estimation top-performed methods. We reimplement the SimpleBase-
line [70] by using ResNet-152 as the backbone with the
Dataset. The MPII Human Pose dataset [2] consists of im- input size 256 × 256. Our HRNet-W32 achieves a 92.3
ages taken from a wide-range of real-world activities with PCKh@0.5 score, and outperforms the stacked hourglass
full-body pose annotations. There are around 25K images approach [39] and its extensions [56, 14, 74, 30, 60]. Our
with 40K subjects, where there are 12K subjects for test- result is the same as the best one [60] among the previously-
ing and the remaining subjects for the training set. The data published results on the leaderboard of Nov. 16th, 20183 .
augmentation and the training strategy are the same to MS We would like to point out that the approach [60], comple-
COCO, except that the input size is cropped to 256 × 256 mentary to our approach, exploits the compositional model
for fair comparison with other methods. to learn the configuration of human bodies and adopts multi-
Testing. The testing procedure is almost the same to that level intermediate supervision, from which our approach
in COCO except that we adopt the standard testing strategy can also benefit. We also tested our big network - HRNet-
to use the provided person boxes instead of detected person W48 and obtained the same result 92.3. The reason might
boxes. Following [14, 74, 60], a six-scale pyramid testing be that the performance in this dataset tends to be saturate.
procedure is performed.
Evaluation metric. The standard metric [2], the PCKh 4.3. Application to Pose Tracking
(head-normalized probability of correct keypoint) score, is Dataset. PoseTrack [27] is a large-scale benchmark for hu-
used. A joint is correct if it falls within αl pixels of the man pose estimation and articulated tracking in video. The
groundtruth position, where α is a constant and l is the head
size that corresponds to 60% of the diagonal length of the 3 http://human-pose.mpi-inf.mpg.de/#results

5698
dataset, based on the raw videos provided by the popular Table 5. Results of pose tracking on the PoseTrack2017 test set.
MPII Human Pose dataset, contains 550 video sequences Entry Additional training Data mAP MOTA
with 66, 374 frames. The video sequences are split into ML-LAB [81] COCO+MPII-Pose 70.3 41.8
292, 50, 208 videos for training, validation, and testing, re- SOPT-PT [52] COCO+MPII-Pose 58.2 42.0
spectively. The length of the training videos ranges between BUTD2 [28] COCO 59.2 50.6
41 − 151 frames, and 30 frames from the center of the video MVIG [52] COCO+MPII-Pose 63.2 50.7
are densely annotated. The number of frames in the valida- PoseFlow [52] COCO+MPII-Pose 63.0 51.0
tion/testing videos ranges between 65 − 298 frames. The ProTracker [19] COCO 59.6 51.8
30 frames around the keyframe from the MPII Pose dataset HMPT [52] COCO+MPII-Pose 63.7 51.9
are densely annotated, and afterwards every fourth frame is JointFlow [15] COCO 63.6 53.1
annotated. In total, this constitutes roughly 23, 000 labeled STAF [52] COCO+MPII-Pose 70.3 53.8
frames and 153, 615 pose annotations. MIPAL [52] COCO 68.8 54.5
Evaluation metric. We evaluate the results from two as- FlowTrack [70] COCO 74.6 57.8
pects: frame-wise multi-person pose estimation, and multi- HRNet-W48 COCO 74.9 57.9
person pose tracking. Pose estimation is evaluated by the
mean Average Precision (mAP) as done in [50, 27]. Multi- Table 6. Ablation study of exchange units that are used in re-
person pose tracking is evaluated by the multi-object track- peated multi-scale fusion. Int. exchange across = intermediate
ing accuracy (MOTA) [37, 27]. Details are given in [27]. exchange across stages, Int. exchange within = intermediate ex-
change within stages.
Training. We train our HRNet-W48 for single person pose Method Final exchange Int. exchange across Int. exchange within AP
estimation on the PoseTrack2017 training set, where the (a) X 70.8
network is initialized by the model pre-trained on COCO (b) X X 71.9
dataset. We extract the person box, as the input of our net- (c) X X X 73.4
work, from the annotated keypoints in the training frames
by extending the bounding box of all the keypoints (for one
datasets. This further implies the effectiveness of our pose
single person) by 15% in length. The training setup, includ-
estimation network.
ing data augmentation, is almost the same as that for COCO
except that the learning schedule is different (as now it is 4.4. Ablation Study
for fine-tuning): the learning rate starts from 1e−4, drops
to 1e−5 at the 10th epoch, and to 1e−6 at the 15th epoch; We study the effect of each component in our approach
the iteration ends within 20 epochs. on the COCO keypoint detection dataset. All results are
obtained over the input size of 256 × 192 except the study
Testing. We follow [70] to track poses across frames. It about the effect of the input size.
consists of three steps: person box detection and propa-
gation, human pose estimation, and pose association cross Repeated multi-scale fusion. We empirically analyze the
nearby frames. We use the same person box detector as used effect of the repeated multi-scale fusion. We study three
in SimpleBaseline [70], and propagate the detected box into variants of our network. (a) W/o intermediate exchange
nearby frames by propagating the predicted keypoints ac- units (1 fusion): There is no exchange between multi-
cording to the optical flows computed by FlowNet 2.0 [25]4 , resolution subnetworks except the last exchange unit. (b)
followed by non-maximum suppression for box removing. W/ across-stage exchange units (3 fusions): There is no ex-
The pose association scheme is based on the object key- change between parallel subnetworks within each stage. (c)
point similarity between the keypoints in one frame and the W/ both across-stage and within-stage exchange units (to-
keypoints propagated from the nearby frame according to tally 8 fusion): This is our proposed method. All the net-
the optical flows. The greedy matching algorithm is then works are trained from scratch. The results on the COCO
used to compute the correspondence between keypoints in validation set given in Table 6 show that the multi-scale fu-
nearby frames. More details are given in [70]. sion is helpful and more fusions lead to better performance.
Results on the PoseTrack2017 test set. Table 5 reports Resolution maintenance. We study the performance of a
the results. Compared with the second best approach, the variant of the HRNet: all the four high-to-low resolution
FlowTrack in SimpleBaseline [70], that uses ResNet-152 subnetworks are added at the beginning and the depth are
as the backbone, our approach gets 0.3 and 0.1 points gain the same; the fusion schemes are the same to ours. Both
in terms of mAP and MOTA, respectively. The superi- our HRNet-W32 and the variant (with similar #Params and
ority over the FlowTrack [70] is consistent to that on the GFLOPs) are trained from scratch and tested on the COCO
COCO keypoint detection and MPII human pose estimation validation set. The variant achieves an AP of 72.5, which
is lower than the 73.4 AP of HRNet-W32. We believe that
4 https://github.com/NVIDIA/flownet2-pytorch the reason is that the low-level features extracted from the

5699
Figure 4. Qualitative results of some example images in the MPII (top) and COCO (bottom) datasets: containing viewpoint and appearance
change, occlusion, multiple persons, and common imaging artifacts.


1x 2x 4x
[

   [
[ [

 
[
[ [
  [
[

$3 
$3 



  


 [

[





6LPSOH%DVHOLQH
 
+51HW:
[



        

+51HW: +51HW: *)/23V

Figure 5. Ablation study of high- and low-resolution representa- Figure 6. Illustrating how the performances of our HRNet and
tions. 1×, 2×, 4× correspond to the representations of the high, SimpleBaseline [70] are affected by the input size.
medium, low resolutions, respectively.
128 × 96. This implies that our approach is more advanta-
early stages over the low-resolution subnetworks are less geous in the real applications where the computation cost is
helpful. In addition, the simple high-resolution network of also an important factor. On the other hand, our approach
similar parameter and GFLOPs without low-resolution par- with the input size 256 × 192 outperforms the SimpleBase-
allel subnetworks shows much lower performance . line with the large input size of 384 × 288.
Representation resolution. We study how the represen- 5. Conclusion and Future Works
tation resolution affects the pose estimation performance
from two aspects: check the quality of the heatmap esti- In this paper, we present a high-resolution network for
mated from the feature maps of each resolution from high human pose estimation, yielding accurate and spatially-
to low, and study how the input size affects the quality. precise keypoint heatmaps. The success stems from two
We train our small and big networks initialized by the aspects: (i) maintain the high resolution through the whole
model pretrained for the ImageNet classification. Our net- process without the need of recovering the high resolu-
work outputs four response maps from high-to-low solu- tion; and (ii) fuse multi-resolution representations repeat-
tions. The quality of heatmap prediction over the lowest- edly, rendering reliable high-resolution representations.
resolution response map is too low and the AP score is be- The future works include the applications to other dense
low 10 points. The AP scores over the other three maps prediction tasks5 , e.g., face alignment, object detection, se-
are reported in Figure 5. The comparison implies that the mantic segmentation, as well as the investigation on aggre-
resolution does impact the keypoint prediction quality. gating multi-resolution representations in a less light way.
Figure 6 shows how the input size affects the perfor- Acknowledgements. The authors thank Dianqi Li and Lei
mance in comparison with SimpleBaseline (ResNet-50). Zhang for helpful discussions. Dr. Liu was partially sup-
Thanks to maintaining the high resolution through the ported by the Strategic Priority Research Program of the
whole process, the improvement for the smaller input size Chinese Academy of Sciences under Grant XDB06040900.
is more significant than the larger input size, e.g., the im- 5 The project page is https://jingdongwang2017.github.

provement is 4.0 points for 256 × 192 and 6.3 points for io/Projects/HRNet/

5700
References [17] Haoshu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu.
RMPE: regional multi-person pose estimation. In ICCV,
[1] Mykhaylo Andriluka, Umar Iqbal, Anton Milan, Eldar In- pages 2353–2362, 2017. 1, 5
safutdinov, Leonid Pishchulin, Juergen Gall, and Bernt [18] Damien Fourure, Rémi Emonet, Élisa Fromont, Damien
Schiele. Posetrack: A benchmark for human pose estima- Muselet, Alain Trémeau, and Christian Wolf. Residual conv-
tion and tracking. In CVPR, pages 5167–5176, 2018. 2 deconv grid network for semantic segmentation. In British
[2] Mykhaylo Andriluka, Leonid Pishchulin, Peter V. Gehler, Machine Vision Conference 2017, BMVC 2017, London, UK,
and Bernt Schiele. 2d human pose estimation: New bench- September 4-7, 2017, 2017. 3
mark and state of the art analysis. In CVPR, pages 3686– [19] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani,
3693, 2014. 2, 6 Manohar Paluri, and Du Tran. Detect-and-track: Efficient
[3] Vasileios Belagiannis and Andrew Zisserman. Recurrent hu- pose estimation in videos. In CVPR, pages 350–359, 2018.
man pose estimgation. In FG, pages 468–475, 2017. 3 7
[4] Adrian Bulat and Georgios Tzimiropoulos. Human pose esti- [20] Georgia Gkioxari, Alexander Toshev, and Navdeep Jaitly.
mation via convolutional part heatmap regression. In ECCV, Chained predictions using convolutional neural networks. In
volume 9911 of Lecture Notes in Computer Science, pages ECCV, pages 728–743, 2016. 2
717–732. Springer, 2016. 2, 6 [21] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B.
[5] Zhaowei Cai, Quanfu Fan, Rogério Schmidt Feris, and Nuno Girshick. Mask R-CNN. In ICCV, pages 2980–2988, 2017.
Vasconcelos. A unified multi-scale deep convolutional neu- 5
ral network for fast object detection. In ECCV, pages 354– [22] Peiyun Hu and Deva Ramanan. Bottom-up and top-down
370, 2016. 3 reasoning with hierarchical rectified gaussians. In CVPR,
[6] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. pages 5600–5609, 2016. 2
Realtime multi-person 2d pose estimation using part affinity [23] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens
fields. In CVPR, pages 1302–1310, 2017. 1, 5 van der Maaten, and Kilian Q. Weinberger. Multi-scale
[7] João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Ji- dense convolutional networks for efficient prediction. CoRR,
tendra Malik. Human pose estimation with iterative error abs/1703.09844, 2017. 3
feedback. In CVPR, pages 4733–4742, 2016. 2 [24] Shaoli Huang, Mingming Gong, and Dacheng Tao. A coarse-
[8] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, fine network for keypoint localization. In ICCV, pages 3047–
Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic im- 3056. IEEE Computer Society, 2017. 5
age segmentation with deep convolutional nets, atrous con- [25] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper,
volution, and fully connected crfs. IEEE Trans. Pattern Anal. Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu-
Mach. Intell., 40(4):834–848, 2018. 3 tion of optical flow estimation with deep networks. In CVPR,
[9] Xianjie Chen and Alan L. Yuille. Articulated pose estima- pages 1647–1655, 2017. 7
tion by a graphical model with image dependent pairwise [26] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres,
relations. In NIPS, pages 1736–1744, 2014. 2 Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A
[10] Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and deeper, stronger, and faster multi-person pose estimation
Jian Yang. Adversarial posenet: A structure-aware convolu- model. In ECCV, pages 34–50, 2016. 1, 2, 3, 6
tional network for human pose estimation. In ICCV, pages [27] Umar Iqbal, Anton Milan, and Juergen Gall. Posetrack: Joint
1221–1230, 2017. 6 multi-person pose estimation and tracking. In CVPR, pages
[11] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang 4654–4663, 2017. 6, 7
Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network [28] Sheng Jin, Xujie Ma, Zhipeng Han, Yue Wu, Wei Yang,
for multi-person pose estimation. CoRR, abs/1711.07319, Wentao Liu, Chen Qian, and Wanli Ouyang. Towards multi-
2017. 2, 3, 5 person pose tracking: Bottom-up and top-down methods. In
[12] Chia-Jung Chou, Jui-Ting Chien, and Hwann-Tzong Chen. ICCV PoseTrack Workshop, 2017. 7
Self adversarial training for human pose estimation. CoRR, [29] Angjoo Kanazawa, Abhishek Sharma, and David W. Ja-
abs/1707.02439, 2017. 6 cobs. Locally scale-invariant convolutional neural networks.
[13] Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang CoRR, abs/1412.5104, 2014. 3
Wang. Structured feature learning for pose estimation. In [30] Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei
CVPR, pages 4715–4723, 2016. 2 Lyu. Multi-scale structure-aware network for human pose
[14] Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. estimation. CoRR, abs/1803.09894, 2018. 2, 3, 6
Yuille, and Xiaogang Wang. Multi-context attention for hu- [31] Diederik P. Kingma and Jimmy Ba. Adam: A method for
man pose estimation. In CVPR, pages 5669–5678, 2017. 2, stochastic optimization. CoRR, abs/1412.6980, 2014. 5
6 [32] Muhammed Kocabas, Salih Karagoz, and Emre Akbas. Mul-
[15] Andreas Doering, Umar Iqbal, and Juergen Gall. Joint flow: tiposenet: Fast multi-person pose estimation using pose
Temporal flow fields for multi person tracking, 2018. 7 residual network. In ECCV, volume 11215 of Lecture Notes
[16] Xiaochuan Fan, Kang Zheng, Yuewei Lin, and Song Wang. in Computer Science, pages 437–453. Springer, 2018. 1, 5
Combining local appearance and holistic view: Dual-source [33] Chen-Yu Lee, Saining Xie, Patrick W. Gallagher, Zhengyou
deep neural networks for human pose estimation. In CVPR, Zhang, and Zhuowen Tu. Deeply-supervised nets. In AIS-
pages 1347–1355, 2015. 2 TATS, 2015. 3

5701
[34] Ita Lifshitz, Ethan Fetaya, and Shimon Ullman. Human pose [51] Tobias Pohlen, Alexander Hermans, Markus Mathias, and
estimation using deep consensus voting. In ECCV, pages Bastian Leibe. Full-resolution residual networks for seman-
246–260, 2016. 2, 3 tic segmentation in street scenes. In CVPR, 2017. 3
[35] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James [52] PoseTrack. PoseTrack Leader Board. https://
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and posetrack.net/leaderboard.php. 7
C. Lawrence Zitnick. Microsoft COCO: common objects in [53] Mohamed Samy, Karim Amer, Kareem Eissa, Mahmoud
context. In ECCV, pages 740–755, 2014. 2, 4 Shaker, and Mohamed ElHelw. Nu-net: Deep residual wide
[36] Diogo C. Luvizon, Hedi Tabia, and David Picard. Human field of view convolutional neural network for semantic seg-
pose regression by combining indirect part detection and mentation. In CVPRW, June 2018. 3
contextual information. CoRR, abs/1710.02322, 2017. 6 [54] Shreyas Saxena and Jakob Verbeek. Convolutional neural
[37] Anton Milan, Laura Leal-Taixé, Ian D. Reid, Stefan Roth, fabrics. In NIPS, pages 4053–4061, 2016. 3
and Konrad Schindler. MOT16: A benchmark for multi- [55] Taiki Sekii. Pose proposal networks. In ECCV, September
object tracking. CoRR, abs/1603.00831, 2016. 7 2018. 1
[38] Alejandro Newell, Zhiao Huang, and Jia Deng. Associa- [56] Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong
tive embedding: End-to-end learning for joint detection and Liu, and Jingdong Wang. Human pose estimation using
grouping. In NIPS, pages 2274–2284, 2017. 1, 5 global and local normalization. In ICCV, pages 5600–5608,
[39] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- 2017. 2, 6
glass networks for human pose estimation. In ECCV, pages [57] Ke Sun, Mingjie Li, Dong Liu, and Jingdong Wang. IGCV3:
483–499, 2016. 1, 2, 3, 5, 6 interleaved low-rank group convolutions for efficient deep
[40] Xuecheng Nie, Jiashi Feng, Junliang Xing, and Shuicheng neural networks. In BMVC, page 101. BMVA Press, 2018. 3
Yan. Pose partition networks for multi-person pose estima- [58] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen
tion. In ECCV, September 2018. 1 Wei. Integral human pose regression. In ECCV, pages 536–
[41] Xuecheng Nie, Jiashi Feng, and Shuicheng Yan. Mutual 553, 2018. 5
learning to adapt for joint human parsing and pose estima- [59] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
tion. In ECCV, September. 2 Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
[42] Xuecheng Nie, Jiashi Feng, Yiming Zuo, and Shuicheng Vanhoucke, and Andrew Rabinovich. Going deeper with
Yan. Human pose estimation with parsing induced learner. convolutions. In CVPR, pages 1–9, 2015. 3
In CVPR, June 2018. 2 [60] Wei Tang, Pei Yu, and Ying Wu. Deeply learned composi-
[43] Guanghan Ning, Zhi Zhang, and Zhiquan He. Knowledge- tional models for human pose estimation. In ECCV, Septem-
guided deep fractal neural networks for human pose estima- ber 2018. 2, 6
tion. IEEE Trans. Multimedia, 20(5):1246–1259, 2018. 6 [61] Zhiqiang Tang, Xi Peng, Shijie Geng, Lingfei Wu, Shaoting
[44] Wanli Ouyang, Xiao Chu, and Xiaogang Wang. Multi-source Zhang, and Dimitris N. Metaxas. Quantized densely con-
deep learning for human pose estimation. In CVPR, pages nected u-nets for efficient landmark localization. In ECCV,
2337–2344, 2014. 2 pages 348–364, 2018. 6
[45] George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros [62] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun,
Gidaris, Jonathan Tompson, and Kevin Murphy. Person- and Christoph Bregler. Efficient object localization using
lab: Person pose estimation and instance segmentation with convolutional networks. In CVPR, pages 648–656, 2015. 3
a bottom-up, part-based, geometric embedding model. In [63] Jonathan J. Tompson, Arjun Jain, Yann LeCun, and
ECCV, September 2018. 1, 5 Christoph Bregler. Joint training of a convolutional network
[46] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander and a graphical model for human pose estimation. In NIPS,
Toshev, Jonathan Tompson, Chris Bregler, and Kevin Mur- pages 1799–1807, 2014. 2
phy. Towards accurate multi-person pose estimation in the [64] Alexander Toshev and Christian Szegedy. Deeppose: Human
wild. In CVPR, pages 3711–3719, 2017. 1, 5 pose estimation via deep neural networks. In CVPR, pages
[47] Xi Peng, Zhiqiang Tang, Fei Yang, Rogerio S. Feris, and 1653–1660, 2014. 2
Dimitris Metaxas. Jointly optimize data augmentation and [65] Jingdong Wang, Zhen Wei, Ting Zhang, and Wenjun Zeng.
network training: Adversarial data augmentation in human Deeply-fused nets. CoRR, abs/1605.07716, 2016. 3
pose estimation. In CVPR, June 2018. 2 [66] Zhicheng Wang, Wenbo Li, Binyi Yin, Qixiang Peng, Tianzi
[48] Tomas Pfister, James Charles, and Andrew Zisserman. Flow- Xiao, Yuming Du, Zeming Li, Xiangyu Zhang, Gang Yu,
ing convnets for human pose estimation in videos. In ICCV, and Jian Sun. Mscoco keypoints challenge 2018. In Joint
pages 1913–1921, 2015. 1 Recognition Challenge Workshop at ECCV 2018, 2018. 5
[49] Leonid Pishchulin, Mykhaylo Andriluka, Peter V. Gehler, [67] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser
and Bernt Schiele. Poselet conditioned pictorial structures. Sheikh. Convolutional pose machines. In CVPR, pages
In CVPR, pages 588–595, 2013. 2 4724–4732, 2016. 3, 6
[50] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern [68] Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming
Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin,
Schiele. Deepcut: Joint subset partition and labeling for Yanwei Fu, et al. Ai challenger: A large-scale dataset
multi person pose estimation. In CVPR, pages 4929–4937, for going deeper in image understanding. arXiv preprint
2016. 3, 7 arXiv:1711.06475, 2017. 6

5702
[69] Fangting Xia, Peng Wang, Xianjie Chen, and Alan L. Yuille.
Joint multi-person pose estimation and semantic part seg-
mentation. In CVPR, pages 6080–6089, 2017. 1
[70] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines
for human pose estimation and tracking. In ECCV, pages
472–487, 2018. 1, 2, 3, 5, 6, 7, 8
[71] Guotian Xie, Jingdong Wang, Ting Zhang, Jianhuang Lai,
Richang Hong, and Guo-Jun Qi. Interleaved structured
sparse convolutional neural networks. In CVPR, pages 8847–
8856. IEEE Computer Society, 2018. 3
[72] Saining Xie and Zhuowen Tu. Holistically-nested edge de-
tection. In ICCV, pages 1395–1403, 2015. 3
[73] Yichong Xu, Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang,
and Zheng Zhang. Scale-invariant convolutional neural net-
works. CoRR, abs/1411.6369, 2014. 3
[74] Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and
Xiaogang Wang. Learning feature pyramids for human pose
estimation. In ICCV, pages 1290–1299, 2017. 1, 2, 3, 5, 6
[75] Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang
Wang. End-to-end learning of deformable mixture of parts
and deep convolutional neural networks for human pose es-
timation. In CVPR, pages 3073–3082, 2016. 2
[76] Yi Yang and Deva Ramanan. Articulated pose estimation
with flexible mixtures-of-parts. In CVPR, pages 1385–1392,
2011. 2
[77] Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. In-
terleaved group convolutions. In ICCV, pages 4383–4392,
2017. 3
[78] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
CVPR, pages 6230–6239, 2017. 3
[79] Liming Zhao, Mingjie Li, Depu Meng, Xi Li, Zhaoxi-
ang Zhang, Yueting Zhuang, Zhuowen Tu, and Jingdong
Wang. Deep convolutional neural networks with merge-and-
run mappings. In IJCAI, pages 3170–3176, 2018. 3
[80] Yisu Zhou, Xiaolin Hu, and Bo Zhang. Interlinked convo-
lutional neural networks for face parsing. In ISNN, pages
222–231, 2015. 3
[81] Xiangyu Zhu, Yingying Jiang, and Zhenbo Luo. Multi-
person pose estimation for posetrack with enhanced part
affinity fields. In ICCV PoseTrack Workshop, 2017. 7

5703

You might also like