Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Microprocessors and Microsystems 96 (2023) 104739

Contents lists available at ScienceDirect

Microprocessors and Microsystems


journal homepage: www.elsevier.com/locate/micpro

A simple and effective multi-person pose estimation model for low power
embedded system
Hua Li a ,∗, Shiping Wen b ,∗, Kaibo Shi c
a School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
b
Australian AI Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia
c
School of Information Science and Engineering, Chengdu University, Chengdu, Sichuan, China

ARTICLE INFO ABSTRACT

Keywords: In recent years, algorithms based on human pose estimation have been applied more and more in low power
Multi-person pose estimation embedded system. However, the keypoints detection under occlusion is not well solved, resulting in poor effect
Low power embedded system in practical application on embedded devices. In this paper, we propose a novel Simple and Effective Network
Simple and effective network
(SEN) to deal with the multi-person pose estimation problem of low power embedded system via detecting
Internet of things
occlusion keypoints quite well to a certain extent. This model is easy to apply to embedded devices, and has
the characteristics of simplicity, strong expansibility and wide application, which are very important in the
world of Internet of things. Our model contains three novel modules: Feature Fusion Module (FFM), Channel
Enhancement Attention Module (CEAM), and Feature Enhancement Module (FEM). The FFM fuses the shallow
and deep feature maps, bringing rich context information to the model. At the same time, it can alleviate the
problem of information loss caused by downsampling operations and locate the keypoints more accurately.
The FEM and the CEAM act on the deep feature maps of the network, which helps to infer the keypoints
of occlusion or invisibility. Related experiments explain that the raised means is effective and achieves the
superior performance over two benchmark datasets: the COCO keypoints detection dataset and the MPII Human
Pose dataset.

1. Introduction obtained excellent results [8–11]. Most existing methods are devoted
to designing complex network structures, which leads to increasingly
Multi-person pose estimation is the basis of many computer vision complex network models. For example, in [11], hourglass module
tasks, and itself is particularly challenging. The target is to locate all for continuous upsampling and downsampling features is proposed,
human joints (eyes, wrists, knees, etc.) in images of any scene. Its ap- and the hourglass module is repeatedly stacked to complete the pose
plication scenarios are particularly rich, including human performance
estimation. Cascaded Pyramid Network (CPN) [9] includes two stages:
analysis [1], emotion analysis [2], human action recognition [3,4],
GlobalNet and RefineNet. GlobalNet is to locate some simple keypoints,
human–computer interaction [5] and human pose tracking [6]. It is
worth mentioning that there are many applications based on human RefineNet to locate the keypoints that are more difficult. Although
pose estimation on embedded devices, such as human new perspective current methods have achieved promising results on common bench-
generation, virtual fitness coach and human movement migration. marks such as the COCO [12] and MPII [13], there are still some
Thanks to the improvement of computing power of embedded de- challenging problems that have not been solved well. For instance,
vices and the continuous optimization of algorithm performance, our keypoints involving complex backgrounds, keypoints where multiple
environment has become more and more intelligent. Researchers put people occlude each other, and keypoints of complex actions cannot
more energy into the research of the Internet of things, smart home be accurately located.
and smart coach. This paper focuses on human pose estimation. It has The main reasons are three parts: (1) The shallow and deep feature
been widely used in low power embedded system. Further research on maps are not effectively used in some existing methods based on CNN
it will greatly expand its application in embedded devices.
[7]. Shallow layers have high-resolution feature maps and rich local
In recent years, convolution neural network (CNN) [7] has been
information, which can help us accurately locate keypoints. Deep layers
playing an important role in various computer vision tasks. It is becom-
have low-resolution feature maps and rich semantic information, which
ing more and more popular in multi-person pose estimation and have

∗ Corresponding authors.
E-mail addresses: huali@std.uestc.edu.cn (H. Li), shiping.wen@uts.edu.au (S. Wen), skbs111@163.com (K. Shi).

https://doi.org/10.1016/j.micpro.2022.104739
Received 22 September 2021; Received in revised form 19 July 2022; Accepted 4 December 2022
Available online 5 December 2022
0141-9331/© 2022 Elsevier B.V. All rights reserved.
H. Li et al. Microprocessors and Microsystems 96 (2023) 104739

can help us infer occlusion and invisible keypoints location. So the shal- to the human detector to predict the keypoints of a single person. The
low and deep feature maps are extremely important. (2) Downsampling latter needs to locate the joint points of everyone in a picture.
caused some information to be lost, but did not recover its information Multi-Person Pose Estimation. Lately, researchers have become
well. Most methods use the downsampling operation, which may cause more enthusiastic about the study of multi-person pose estimation due
important information to be lost, resulting in inaccurate positioning to urgent needs of the market. However, it remains full of challenges, as
of keypoints. How to compensate for lost information is the key to there may be mutual occlusion among multiple people. There are two
improving performance. (3) Some methods treat all spatial position and main means to realize multi-person estimation: top-down approaches
channel feature maps equally, which results in the network model not or bottom-up approaches.
focusing on the joints that need to be detected. Focusing on the area Bottom-Up Approaches. The first step of bottom up approaches
that needs to be detected is helpful for some keypoints in the presence [21–27] is to predict all the joint points, and the second step is to
of occlusion. divide them into each individual according to a certain algorithm.
In order to simply and effectively solve the problems mentioned Openpose [24] uses the ‘‘part affinity fields’’ (PAFs) algorithm to map
above, so as to better locate the keypoints in the presence of occlusion the relationship between different joint points, thus dividing the joint
and invisibility, we propose Simple and Effective Network (SEN). Our points to different people. Newell et al. [25] apply ‘‘stacked hourglass
network structure is inspired by simple baseline [8]. Their work shows network’’ [11] for keypoints prediction. At the same time, the keypoints
that simple models also perform well in human pose estimation. On are grouped by association embedding. MultiPoseNet [27] establishes
the basis of simple baseline [8], we introduced Feature Fusion Module a dual-branch network to complete multi-person pose estimation. The
(FFM), Feature Enhancement Module (FEM) and Channel Enhancement keypoint subnet is used to predict human joints, and the person de-
Attention Module (CEAM) respectively. The FFM fuses the shallow and tection subnet is responsible for detecting the position of the human.
deep feature maps, bringing rich context information to the model. It Finally, the pose residual net (PRN) is used to divide the keypoints to
can alleviate the problem of information loss caused by downsampling different people.
operations and locate the keypoints more accurately. The FEM and the Top-Down Approaches. Top-down approaches [8,9,28–30] divide
CEAM act on the deep feature maps of the network, which helps to the prediction of keypoints into two independent steps. Firstly, a human
infer the keypoints of occlusion or invisibility.
detector is used to detect all human bounding boxes from image, and
Based on our Simple and Effective Network, we use the top-down
the image is cropped into fixed size picture groups according to the
pipeline to deal with the problem of multi-person pose estimation.
bounding boxes. Then single person pose estimation model is used
Firstly, human body detector is used to detect all the human body
to predict the keypoints of each person. The whole pipeline process
bounding boxes, then the keypoints of each person in the human body
is shown in Fig. 2. ‘‘Cascaded Pyramid Network’’ (CPN) [9] uses the
bounding boxes are located via using our SEN model. The charac-
feature pyramid architecture [31] as backbone, and divides keypoints
teristics of our SEN model is simple and effective. Our experiments
into the easy and hard level. It uses GlobalNet to locate the easy
are based on two common human pose estimation benchmarks: MS-
keypoints and RefineNet to locate the hard keypoints. Mask-RCNN [30]
COCO [12] and MPII [13]. On these two benchmarks, our model
directly adds the branch of keypoints prediction to complete the multi-
is significantly improved while compared with SimpleBaseline [8] in
person pose estimation. SimpleBaseline [8] detects joint points by using
similar backbone and input size. On the COCO test-dev2017 dataset,
residual modules and deconvolution layers. It is a typical encoder–
our best model gains AP by 1.4 points. On the MPII dataset, our best
decoder model. Our work follows SimpleBaseline [8], and our core idea
model achieves PCKh@0.5 of 90.0.
is to complete the pose estimation task simply and effectively.
In summary, our contributions are three-fold as follows:
Single Person Pose Estimation. There are two main ways to
(1) We propose a novel network structure called simple and ef-
complete single person pose estimation: direct coordinates regression
fective network (SEN). Compared with other increasingly complex
[32–34] and prediction of heatmaps [10,35–38]. Due to the lack of spa-
network structures, our network has more advantages. This model is
easy to apply to low power embedded devices, and has the characteris- tial information, the performance of the direct regression coordinates
tics of simplicity, strong expansibility and wide application, which are is much worse than that of the prediction heatmaps, so the present
very important in the world of Internet of things. methods almost adopt prediction the heatmaps. Our method is also
(2) We propose a Feature Fusion Module (FFM) to fuse low-level and based on the prediction heatmaps. DeepPose [32] is a pioneering work
high-level feature maps, so that spatial information and semantic infor- to use CNN [7] to settle the task of HPE. It is directly based on the
mation can be combined. In addition, we build a Feature Enhancement keypoints coordinates for training. Tompson et al. [35] explored how
Module (FEM) and a Channel Enhancement Attention Module (CEAM) to handle this problem by using CNN [7] model to regress the heatmaps
to help us locate occlusion or invisible keypoints. of joint points. Wei et al. [36] proposed ‘‘Convolutional Pose Machines’’
(3) Our approach achieves superior performance and high efficiency (CPMs) to predict the score map of keypoints. It is a multi-stage CNN
on the COCO keypoint dataset [12] and MPII human pose benchmark model, which uses the method of intermediate supervision to avoid the
[13]. At the same time, our approach expands the idea of designing gradient vanishing problem caused by the network is too deep. Yang
simple and effective models for its application in embedded devices. et al. [10] tried to use pyramid features to deal with this problem. This
method improved the performance of HPE to some extent.
2. Related work Attention Mechanism. With the excellent performance of atten-
tion mechanism in many visual tasks, such as action recognition [39,
In recent years, due to the embedded applications demand, Human 40], multi-label classification [41], face reconstruction [42], attention
Pose Estimation (HPE) has been developed rapidly. The traditional mechanism has become a research hotspot. Chu et al. [43] firstly
methods handle the problem of HPE adopt probabilistic graphical applied attention mechanism model to the task of HPE. They proposed
model [14] or pictorial structure model [15,16]. The characteristic of multi-context attention mechanism to deal with this task, which greatly
these methods is that they need to rely on manually extracted features advanced the performance. The most representative is the work of
to predict the location of keypoints. CNN [17] can not only extract Su et al. [44]. They proposed two attention modules that enhance
features automatically, but also show excellent performance in many information to advance the performance of HPE. We were impressed by
tasks. So now the mainstream methods to deal with the problem of HPE the superior performance of SENet [45], which won the champion of
are to use CNN models [8,10,11,18–20]. Our means is also to utilize ILSVRC 2017 classification task competition. So our work follows SENet
CNN model [17]. This topic is classified into single-person and multi- and makes improvements based on it to improve the performance of
person pose estimation. The former is to use cropped image according human pose estimation.

2
H. Li et al. Microprocessors and Microsystems 96 (2023) 104739

Fig. 1. Overview of our architecture. We use ResNet as backbone to extract features, and then use three deconvolution layers as regressor. Feature Fusion Module (FFM) fuses
the shallow and deep feature maps, bringing rich context information to the model. Feature Enhancement Module (FEM) and Channel Enhancement Attention Module (CEAM) act
on the deep feature maps of the network, which helps to infer the keypoints of occlusion or invisibility. (For interpretation of the references to color in this figure legend, the
reader is referred to the web version of this article.)

Fig. 2. The top-down pipeline is used to HPE. For any source image, firstly, we need to detect the location of each person, and the image is cropped to a fixed size according to
their exact location as the input of Convolutional Neural Network (CNN). Then train the model with CNN and output heatmaps. Finally, the coordinates of keypoints are obtained
by argmax function on heatmaps.

Fig. 3. Human pose estimation using simple baseline network. The orange part represents the feature map that has been processed by the residual module. The brown part
represents the feature map that has been processed by the deconvolution module. The blue part represents the generated heatmaps of the keypoints. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)

3. Approach 3.1. Simple baseline network

Our network follows SimpleBaseline [8] and builds on it by pre- The simple baseline network (SBN) [8] first uses ResNet as backbone
senting Feature Fusion Module (FFM), Feature Enhancement Module to extract image features, and then utilize three deconvolution layers
(FEM), and Channel Enhancement Attention Module (CEAM), which and a 1 × 1 convolution to generate heatmaps. The whole SBN model
will cover in detail in this section. The network architecture is shown is shown in Fig. 3. For a clearer description, use 𝐶2, 𝐶3, 𝐶4, 𝐶5 to
in Fig. 1. Before officially introducing our work, let us briefly review represent the last four residual blocks, 𝐷1, 𝐷2, 𝐷3 to represent the
SimpleBaseline and Squeeze-and-Excitation Network (SENet) [45]. deconvolution part following 𝐶5, and 𝐾 to represent the final generated

3
H. Li et al. Microprocessors and Microsystems 96 (2023) 104739

(FFM), which structure is shown in Fig. 1, FFM directly adds the feature
maps to obtain from residual block 𝐶2 and deconvolution 𝐷3. They
all have same channel dimension and resolution. The reasons why
we only add the feature maps of these two parts are as follows: (1)
This module is particularly simple, it follows the principle of simple
network structure design. (2) The feature maps obtained by 𝐶2 contain
the most complete spatial information, and subsequent downsampling
operations will cause information loss. The feature maps obtained by
𝐷3 contain the richest semantic information. Adding these two parts to-
gether is the most cost-effective. FFM can not only provide rich spatial
and semantic information, but also recover the information lost due to
downsampling, thus helping us locate keypoints more accurately.

3.4. Channel Enhancement Attention Module

Fig. 4. The figure on the left is SE block of squeeze and exception networks, and The Channel Enhancement Attention Module (CEAM) is improved
on the right is the channel enhancement attention module (CEAM) formed after our on the SE block. The specific structure is shown in Fig. 3(b). In the
improvement. H, W, and C represent the height, width and number of channels of the SE block, the channel importance weight values in the range of 0 to 1
feature maps, respectively. r is the ratio of channel compression. are finally generated by the Sigmoid function, then the weight values
are multiplied with the input feature maps. However, since the weight
values are all less than 1, multiplying with the input feature maps
heatmaps. The quantity of heatmaps is equal to the quantity of joint will attenuate all features, this problem is particularly prominent in
points. From 𝐶2 to 𝐶5, feature maps resolution is reduced by half, while the multi-attention system [44,46,47]. To deal with this problem, thus
the number of output channels is doubled. Until 𝐶5, the upsampling better serve the HPE, we propose the Channel Enhancement Attention
operation is started to generate heatmaps. In this work, deconvolution Module. We first change the Sigmoid function in the SE block to the
is used to complete upsampling. ReLu function so that the generated channel importance weight values
The biggest advantage of SBN is simplicity, but it is not effective are not limited to the range of 0 to 1. Another benefit is to alleviate
enough. It cannot locate occluded or invisible keypoints well. The the problem of gradient disappearance in model training. The last
reason is that there are many downsampling operations in the residual multiplication operation is then changed to an addition operation. This
is more conducive to enhancing the expression of useful features.
block, which will cause some important information to be lost, but the
network does not take any means to recover the lost information. More
3.5. Feature Enhancement Module
importantly, in SBN, only through deconvolution to generate heatmaps,
deep feature maps are not sufficiently mined. In order to make up The Feature Enhancement Module (FEM) is shown in the purple
for these shortcomings, we propose Feature Fusion Module (FFM) to dotted frame in Fig. 1. We added a new branch after the 𝐶5 stage
recover the lost information while providing rich context information. of the residual block of SimpleBaseline. This branch first reduces the
Use Feature Enhancement Module (FEM) and Channel Enhancement dimension of the feature channels to the same as 𝐷1 through the
Attention Module (CEAM) to mine deep feature maps to better reason convolution of 1 × 1. What needs to be explained is the numbers of
about occlusion or invisible keypoints. channels 𝐷1, 𝐷2, and 𝐷3 are the same. Then the feature maps are
continuously upsampled, and bilinear interpolation is used to complete
3.2. Squeeze-and-excitation network the upsampling. A total of upsampling is 3 times, which is recorded here
as 𝑈 1, 𝑈 2, 𝑈 3. Each upsampling double the resolution of the feature
The squeeze-and-excitation network (SENet) won the champion of maps, and finally restore the resolution to the size of the heatmaps.
ILSVRC 2017 classification task competition, which proves its superior After 𝑈 1, 𝑈 2, and 𝑈 3, the resulting features are added to the features
performance. SENet is mainly composed of SE blocks. Here we mainly obtained by 𝐷1, 𝐷2, and 𝐷3, respectively.
review SE blocks, which structure is shown in Fig. 3(a). It is a typical As mentioned above, the feature maps processed by the resid-
channel attention block, which consists of two steps: squeezing step and ual block C5 contain rich semantic information, which is especially
excitation step. important for predicting keypoints that are occluded or invisible. In
In squeezing step, the feature maps with input size of 𝐻 × 𝑊 × 𝐶 is SimpleBaseline, these feature maps are only used through three decon-
first passed through the global average pooling operation to obtain the volution layers, which is not fully utilized. It is to make up for this
channel feature descriptor with the size of 1 × 1 × 𝐶. Follow utilize the shortcoming that we propose FEM. We use bilinear interpolation as a
fully connected layer (FC) to reduce its dimensions to 1 × 1 × 𝐶𝑟 , where means of upsampling and deconvolution to complement each other to
𝑟 is the dimensionality reduction ratio. In excitation step, they are enhance the expression of features. At the same time, we use CEAM to
first filtered using the activation function ReLu, then restored to their focus on information that is important to us and further strengthen the
original size of 1 × 1 × 𝐶 using FC, and finally the channel importance ability to express features. This is particularly effective for detecting
weight values between 0 and 1 are generated by Sigmoid function. After occluded and invisible keypoints.
Feature Fusion Module (FFM), Channel Enhancement Attention
squeezing and excitation steps, the feature maps input to the block
Module (CEAM) and Feature Enhancement Module (FEM) will be
and the learned channel importance weight values are multiplied to
further discussed in the ablation experiment.
enhance the useful features (see Fig. 4).
4. Experiments
3.3. Feature Fusion Module
The framework we use follows the top-down pipeline for multi-
As well known, in convolutional neural network (CNN) [7], shallow person pose estimation. First, we need to detect all people in the
feature maps have high resolution and rich spatial detail information, picture, and crop each person from the image, set it to the same size as
while deep feature maps have low resolution and strong semantic the input of our Simple and Effective Network model, and then predict
information representation ability. If the two feature maps can be the joint points of each person through our model. Our advanced
effectively combined in CNN [7], the performance will certainly be method is assessed on two current mainstream HPE datasets: COCO
improved. Based on this view, we advance Feature Fusion Module Keypoints Challenge dataset [12] and MPII dataset [13].

4
H. Li et al. Microprocessors and Microsystems 96 (2023) 104739

4.1. COCO keypoint detection Table 1


Compared with the results of other mainstream methods on COCO val2017 dataset.
OHKM means ‘‘Online Hard Keypoints Mining’’ [9].
Datasets. First, we validated our proposed approach on the COCO
Method Backbone Input size OHKM AP(OKS)
dataset. Each human body instance in this dataset can label up to
8-stage Hourglass [11] – 256 × 192 × 66.9
17 keypoints (eyes, wrists, knees, etc.). The training dataset includes
8-stage Hourglass [11] – 256 × 256 × 67.1
57k images and 150k human body instances. 5k and 20k images are CPN [9] ResNet-50 256 × 192 × 68.6
distributed in the verification dataset and test dataset. Our model does CPN [9] ResNet-50 384 × 288 × 70.6
not use other data for training, the data used are all from COCO CPN [9] ResNet-50 256 × 192 ✓ 69.4
training dataset. By default, all ablation experiments are based on CPN [9] ResNet-50 384 × 288 ✓ 71.6
SBN [8] ResNet-50 256 × 192 × 70.4
COCO validation setdata. For the COCO dataset, we use mainstream
SBN [8] ResNet-50 384 × 288 × 72.2
evaluation method, which is based on the ‘‘Object Keypoint Similarity’’
SEN(Ours) ResNet-50 256 × 192 × 71.6
(OKS) [12]. The following is its specific description: SEN(Ours) ResNet-50 384 × 288 × 73.5
∑ 2 2 2
𝑖 [𝑒𝑥𝑝(−𝑑𝑖 ∕2𝑠 𝑘𝑖 )𝛿(𝑣𝑖 > 0)]
𝑂𝐾𝑆 = ∑ (1)
𝑖 [𝛿(𝑣𝑖 > 0)]
where 𝑑𝑖 represents the Euclidean distance between ground-truth and OHKM is used, AP is same as our 256 × 192 input without OHKM.
the corresponding detected keypoint. 𝑠 is the object scale, 𝑘𝑖 is a When the input is 384 × 288, our model is 2.9 and 1.9 higher than
constant for controls falloff, and 𝑣𝑖 reflects the annotation and visibility that without OHKM and OHKM respectively. (iii) Compared with SBN
of ground-truth keypoint. The possible values of 𝑣𝑖 are 0, 1, 2. 𝑣𝑖 = 0 [8], our models improve by 1.2 and 1.3, respectively, when the input
indicates that the keypoint is not marked, 𝑣𝑖 = 1 indicates that the size is 256 × 192 and 384 × 288.
keypoint is marked but not visible, and 𝑣𝑖 = 2 indicates that the Results on COCO test-dev2017. Table 2 illustrates the results of
keypoint is marked and visible. We calculate Average Precision (AP) our models and methods in recent years on the COCO test-dev 2017.
based on OKS to evaluate the accuracy of keypoint location. It can be seen that our means outperform than most existing methods.
Training. Our model training process is end-to-end. Adam opti- Our small model, backbone is ResNet-50, input resolution is 256 × 192,
mizer [48] and a minibatch of size 32 are used to update the pa- the AP has reached 71.4, which is better than most models in the
rameters. At the beginning, the learning rate is set to 5e−4, and same situation, and even better than some models that used deeper
then decreased by a factor of 0.1 at 90 and 120 epochs. Finally, a backbones and higher input sizes. Compared with the SBN models [8],
total of 140 epochs were trained. Our ResNet backbone parameters under the same backbone and Input Size conditions, our models have
are initialized using the model that has been trained in advance on an AP improvement of 1.4, 0.7, and 0.6 respectively. It is particularly
ImageNet [49]. Resnet backbone with layers of 50, 101 and 152 were important to mention that our small model, with backbone of ResNet-
used for experiments. Input size 256 × 192 and the backbone ResNet-50 50 and resolution of 384 × 288, the AP achieves 72.8, showing excellent
is used by default. performance.
We have the same data processing method as Simplebaseline [8].
The human bounding box is extended to a fixed aspect ratio (e.g., 4.2. MPII dataset
height: width = 4: 3), then crop the box from the picture, finally resize
it to a fixed size (256 × 192) as the input image. The data augmentation Datasets and Evaluation metric. The MPII dataset contains about
includes flip, random scale (±40%), random rotation (±30◦ ). All our 25k images, more than 40k human instances. Each human body in-
experiments are done with Pytorch [50] on two NVIDIA 1080Ti GPU stance can be labeled with 16 keypoints at most. The images are all
based on the ubuntu 16.04 operating system. captured from YouTube human activities videos, which cover more
Testing. Our methods follow the top-down pipeline. For a fair than 410 actions. Like most existing methods, we use 22k images for
comparison, we directly use the COCO val2017 and COCO test-dev2017 training and 3k images for testing. In the MPII dataset, we use the
human bounding boxes provided by SimpleBaseline [8], where the standard metric ‘‘Percentage of Correct Keypoints with respect to head’’
results of COCO val2017 are detected using the faster-RCNN algorithm (PCKh) [13] to evaluate performance. The equation is expressed as
[51]. According to the conventional method [9,11], the final heatmaps follows:
are obtained from the average value of the original image heatmaps ‖𝑦𝑖 − 𝑦̂𝑖 ‖2
and the flipped image heatmaps. We use the most advanced method ≤𝑟 (2)
‖𝑦𝑟ℎ𝑖𝑝 − 𝑦𝑙𝑠ℎ𝑜 ‖2
[52] to obtain the corresponding coordinates of each keypoint from the
heatmaps. where 𝑦𝑖 and 𝑦̂𝑖 represent the 𝑖th ground-truth keypoint coordinates
Results on COCO val2017. As shown in Table 1, our model is and predicted keypoint coordinates respectively. 𝑦𝑙𝑠ℎ𝑜 and 𝑦𝑟ℎ𝑖𝑝 are the
compared with the stacked Hourglass [11], CPN [9] and SBN [8] on the ground-truth coordinates of left shoulder and right hip respectively. r
COCO val2017 dataset. The top-down pipeline is fully used in the above is a threshold value in the range of 0 to 1. ‖𝑦𝑖 − 𝑦̂𝑖 ‖2 means the torso
techniques. A objective detector is necessary. For fair comparison, we distance. In our results, PCKh@0.5 (𝑟 = 0.5) is used to report.
use the same human bounding box detector as SimpleBaseline [8], Training. Similar to experiments on COCO dataset, the Adam op-
namely faster-RCNN [51] detector, which has a detection AP of 56.4 timizer and Batchsize of 32 are used to iterate parameters. The start
in human. The detection AP of CPN [9] and stacked Hourglass [11] in learning rate is set to 5e−4, learning rate adjustment strategy and epoch
human is 55.3. of training are same as those in COCO dataset. The only difference
In Table 1, our models outperform the 8-stage Hourglass [11], CPN is that the size of the input human bounding box is 256 × 256. The
[9] and SBN [8]. Under the same input size and Backbone conditions, data augmentation includes flip, random scale (±40%), random rotation
our model achieves an AP score of 71.6, which is 4.7, 3.0 and 1.2 (±30◦ ).
higher than 8-stage Hourglass, CPN and SBN respectively. In CPN, Testing. The method of testing is the same as operating on the
using ‘‘Online Hard Keypoints Mining’’ (OKHM) [9] to help the model COCO dataset. The only difference is that we use the commonly
improve the AP of 0.8 (from 68.6 to 69.4), but still 2.2 percentage adopted approach, that is, we use the human position information
points lower than our model. (i) Compared with 8-stage Hourglass [11], provided by the dataset itself instead of the human position information
its AP has been improved when the input size is 256 × 256, but our detected by the object detector.
input 256 × 192 model outperforms it. (ii) Compared with CPN [9], Results on MPII dataset. The PCKh@0.5 results on MPII dataset
our models show excellent performance. When the input is 384 × 288, are shown in Table 3. We reproduced the results of SBN [8], which

5
H. Li et al. Microprocessors and Microsystems 96 (2023) 104739

Table 2
Comparison of final results on the COCO Test-Dev Dataset. The methods compared in the table are only trained on the COCO Trainval Dataset,
and no additional data is used.
Method Backbone Input size AP AP .5 AP .75 AP (M) AP (L) AR
CMU-Pose [24] – – 61.8 84.9 67.5 57.1 68.2 66.5
Mask-RCNN [30] ResNet-50-FPN – 63.1 87.3 68.7 57.8 71.4 –
Associative embedding [25] – 512 × 512 65.5 86.8 72.3 60.6 72.6 70.2
G-RMI [29] ResNet-101 353 × 257 64.9 85.5 71.3 62.3 70.0 69.7
CPN [9] ResNet-Inception 384 × 288 72.1 91.4 80.0 68.7 77.2 78.5
RMPE [53] PyraNet 320 × 256 72.3 89.2 79.1 68.0 78.6 –
Integral pose regression [54] ResNet-101 256 × 256 67.8 88.2 74.8 63.9 74.0 –
MultiPoseNet [27] – – 69.6 86.3 76.6 65.0 76.3 73.5
CSANet [44] ResNet-101 384 × 288 73.8 91.7 81.4 70.4 79.6 80.3
SBN [8] ResNet-50 256 × 192 70.0 90.9 77.9 66.8 75.8 75.6
SBN [8] ResNet-101 384 × 288 73.2 91.4 80.9 69.7 79.5 78.5
SBN [8] ResNet-152 384 × 288 73.7 91.9 81.1 70.3 80.0 79.0
SEN(Ours) ResNet-50 256 × 192 71.4 91.1 79.3 68.1 77.5 77.0
SEN(Ours) ResNet-50 384 × 288 72.8 91.3 80.0 69.1 79.3 78.0
SEN(Ours) ResNet-101 384 × 288 73.9 91.6 81.1 70.3 80.3 79.0
SEN(Ours) ResNet-152 384 × 288 74.3 92.1 81.5 70.8 80.6 79.5

Table 3 Feature Fusion Module: HPE is a fine pixel-level assignment. Like


Results on MPII dataset. SBN-152 and SEN-152 indicates that the backbone used is
semantic segmentation tasks, it needs to operate in the image pixel
ResNet-152.
space. It is more challenging than other jobs, so we need more in-
Method Head Shoulder Elbow Wrist Hip Knee Ankle Total
depth research. In order to locate keypoints of the human body more
Pishchulin et al. [57] 74.3 49.0 40.8 34.1 36.5 34.4 35.2 44.1
accurately, we propose Feature Fusion Module (FFM). We use SBN
Tompson et al. [35] 95.8 90.3 80.5 74.3 77.6 69.7 62.8 79.6
Carreira et al. [37] 95.7 91.7 81.7 72.4 82.8 73.2 66.4 81.3 [8] as the Baseline method, add FFM on this basis, and keep other
Tompson et al. [58] 96.1 91.9 83.9 77.8 80.9 72.3 64.8 82.0 parts unchanged to assess the performance of our raised module. The
Hu et al. [59] 95.0 91.6 83.0 76.6 81.9 74.5 69.5 82.4 performance of AP on COCO minival dataset is shown in Table 6. As
Pishchulin et al. [22] 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4 you can see from the table, the AP score of SBN [8] is 70.4, when FFM
Lifshitz et al. [60] 97.8 93.3 85.7 80.4 85.3 76.6 70.2 85.0
Gkioxary et al. [19] 96.2 93.1 86.7 82.1 85.2 81.4 74.1 86.1
is added, the AP score reaches 70.8, which increases by 0.4 points.
Rafi et al. [56] 97.2 93.9 86.4 81.3 86.8 80.6 73.4 86.3 This shows that the FFM module we raised can bring performance
Belagiannis et al. [55] 97.7 95.0 88.2 83.0 87.9 82.6 78.4 88.1 gain to the model. At the same time, such a simple operation is quite
Insafutdinov et al. [23] 96.8 95.2 89.3 84.4 88.4 83.4 78.0 88.5 cost-effective.
Wei et al. [36] 97.8 95.0 88.7 84.0 88.4 82.8 79.4 88.5
Feature Enhancement Module: In convolution neural network
SBN-152 [8] 97.0 95.9 90.0 85.0 89.2 85.3 81.3 89.6
(CNN) [7], the deep feature maps contain rich semantic information,
SEN-152(Ours) 97.0 95.9 90.4 85.3 89.4 86.0 82.4 90.0
which are especially important for predicting the keypoints that are
obscured or invisible. To take full advantage of the deep feature maps
Table 4 to detect invisible or obscured human keypoints, we put forward the
Ablation study of FFM, CEAM and FEM modules on COCO validation set. FFM means Feature Enhancement Module (FEM). It exploits bilinear interpolation
Feature Fusion Module, CEAM means Channel Enhancement Attention Module, FEM
for upsampling and complements the deconvolution method operation
means Feature Enhancement Module.
in SBN [8]. We use SBN [8] as the Baseline method, add FEM on
Method FFM CEAM FEM AP(OKS)
this basis, and keep other parts unchanged to assess the performance
SBN(baseline) [8] 70.4
SBN+FFM [8] ✓ 70.8
of our raised module. The results on the COCO minival dataset are
SBN+FEMa [8] ✓ 71.2 shown in Table 6. As you can see from the table, with the addition
SBN+FEMb [8] ✓ ✓ 71.4 of FEM module, the AP score reaches 71.2, which is 0.8 higher than
SBN+FFM+CEAM+FEM [8] ✓ ✓ ✓ 71.6 SBN [8]. The performance improvement of FEM shows that this module
a
is particularly useful. It is especially helpful for predicting invisible or
Indicates that FEM does not include the CEAM module.
b occluded keypoints.
Indicates that the FEM includes the CEAM module.
Channel Enhancement Attention Module: As we all know, SENet
[45] is a typical channel attention model. The SE block can judge
the importance of the channel from a specific task, so that the model
used ResNet-152 as backbone and input resolution was 256 × 256.
focuses on the channel with higher importance. However, the SE block
We reproduced the scores of SBN [8], which used resnet-152 as the
uses sigmoid function to generate channel importance weights with
backbone and input resolution was 256 × 256. Our model attains
values of 0 to 1, which will degrade the features after multiplying
90.0 PKCh@0.5 score, which outperforms the SimpleBaseline [8] and
them with the original feature maps, thus affecting the performance
other approaches [19,23,36,55,56]. Finally, our SEN model gains the
of the model. Based on SE block, we put forward Channel Enhance-
PCKh@0.5 score by 0.4 points compared to SBN [8], thus proving the
ment Attention Module (CEAM) to strengthen the expression ability of
effectiveness of our proposed model.
feature maps. Because CEAM needs to rely on FEM, when verifying the
effectiveness of the CEAM, we compare the AP in the presence of CEAM
4.3. Ablation study and that in the absence of CEAM when SBN [8] is added with FEM
module. The performance of AP on COCO minival dataset is shown in
We first verify the performance of our advanced module, and then Table 6. When only FEM is added, AP score is 71.2. When FEM and
explore the influence of different backbones and input resolutions CEAM are added at the same time, model AP reaches 71.4, which is
on the final results. All the experimental data are obtained on the 0.2 points higher than that of model AP. This shows that the proposed
COCO val2017 dataset. In addition to special instructions, the default CEAM is effective.
backbone of our model is ResNet-50, and the default input resolution As shown in Table 6, when we add FFM, CEAM and FEM to the
is 256 × 192 (see Table 4). SBN [8] model at the same time, the AP score achieves 71.6, which

6
H. Li et al. Microprocessors and Microsystems 96 (2023) 104739

Fig. 5. Qualitative examples of the predictions of our network on the COCO dataset. The two images in the first column have multiple people occluding each other. The two
images in the second column have other objects occluding the human keypoints. The two pictures in the third column contain the complex outdoor activities of the human.

Table 5 the input size is reduced from 256 × 192 to 128 × 96, the AP score
Comparison the results of different backbones on the COCO minival dataset. decreases significantly, a total of 9 points, which is very bad. When
Backbone AP(OKS) the input resolution is increased from 256 × 192 to 384 × 288, the AP
SEN(Resnet-50) [17] 71.6 score of the model is also improved significantly, with a total of 1.9
SEN(Resnet-101) [17] 72.5
points. At the same time, it can be found that the input size of the SBN
SEN(Resnet-152) [17] 73.1
[8] model 384 × 288 is also 1.8 points higher than the AP score in the
case of 256 × 192. In human pose estimation tasks, our commonly used
Table 6 input resolutions are 256 × 192 and 384 × 288.
Comparison the results of different input sizes on the COCO minival dataset.
Models Input size AP(OKS) 4.4. Qualitative results
SBN [8] 256 × 192 70.4
SBN [8] 384 × 288 72.2 As shown in Fig. 5, we present some of the visual results of our
SEN(Ours) 128 × 96 62.6
Simple and Effective Network (SEN) on the COCO dataset [12]. As we
SEN(Ours) 256 × 192 71.6
SEN(Ours) 384 × 288 73.5 can see, there are six sub images in this picture, each of which contains
multiple person. On the whole, our SEN model has excellent perfor-
mance in the multi-person pose estimation task. It can perfectly detect
the keypoints of all people in the image, including some keypoints
outperforms the baseline SBN [8] by 1.2 AP. The improvement of our with complex actions, mutual occlusion and other objects occlusion in
model AP is particularly obvious, which fully proves that our model is the case of multiple people. From a detailed point of view, the two
useful. pictures in the first column have multiple people occluding each other.
Effect of Backbone Network: In our work, we also explore the in- It is very challenging to detect such keypoints, but the SEN model we
fluence of different backbone networks on human pose estimation. The proposed alleviates this problem to a certain extent. The keypoints of
AP scores under different backbone conditions are shown in Table 5. mutual occlusion are well detected. The two pictures in the second
The AP scores of Resnet-50, Resnet-101 and Resnet-152 correspond to column show the situation where the keypoints of people are invisible
71.6, 72.5 and 73.1, respectively. The AP scores of Resnet-50 to 101 due to other objects. Especially in the picture in the second column
and Resnet-101 to 152 are increased by 0.9 and 0.6, respectively. From of the first row, many keypoints of the girl in the middle are blocked
the above results, it can be concluded that after the network reaches by the umbrella. In this case, our model also showed robust perfor-
a certain depth, the performance improvement brought by increasing mance, perfectly inferring the keypoints that were severely occluded.
the depth of the network is limited. Therefore, in different tasks, the The people in the two pictures in the third column contain complex
commonly used backbones are Resnet-50, Resnet-101 and Resnet-152, outdoor sports, and our model can also detect the keypoints very well.
instead of deeper backbones. On the whole, the backbone Resnet-50 is In summary, our model can alleviate the detection of keypoints of
the most cost-effective. the human body under occlusion or invisible conditions to a certain
Effect of input size: In our work, we also explored the influence extent, which biggest feature is simple and effective, which has certain
of different input sizes on the AP scores of the final model. In theory, reference significance for the future design of simple and lightweight
in order to pursue better performance, we will use a larger resolution models.
input size. But a larger input resolution means a higher amount of
calculation, which will significantly increase the time for model train- 4.5. Remark
ing. Therefore, it is necessary to balance the performance and training
time of the model, and input resolution used in general is 256 × 192. Our proposed HPE Network has broad application prospects in low
Here, we compared the model performance under three input sizes of power embedded devices. For example, it can be used in smart cities
128 × 96, 256 × 192 and 384 × 288. The AP results of different input to real-time monitor if elderly people fall on the streets. It can also
sizes on COCO minival dataset are shown in Table 6. The AP scores of be used as a virtual fitness trainer on smart phones to correct our
our model with input sizes of 128 × 96, 256 × 192 and 384 × 288 are movements. More importantly, it can be used in the field of security.
62.6, 71.6 and 73.5, respectively. Among them, we can see that when The ability to detect abnormal human behavior using smart cameras

7
H. Li et al. Microprocessors and Microsystems 96 (2023) 104739

has important implications for maintaining social peace and stability. [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C.
In short, HPE technology is used in human interaction scenarios, and it Zitnick, Microsoft coco: Common objects in context, in: European Conference on
Computer Vision, Springer, 2014, pp. 740–755.
has a huge application space in low-power embedded devices, such as
[13] M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2D human pose estimation:
smartphones, smart cameras, and smart watches. New benchmark and state of the art analysis, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
5. Conclusion [14] X. Chen, A. Yuille, Articulated pose estimation by a graphical model with image
dependent pairwise relations, in: Advances in Neural Information Processing
Systems, 2014, pp. 1736–1744.
In this paper, we use the top-down pipeline to deal with the multi- [15] M. Andriluka, S. Roth, B. Schiele, Pictorial structures revisited: People detection
person pose estimation problem. We design a novel Simple and Effec- and articulated pose estimation, in: 2009 IEEE Conference on Computer Vision
tive Network (SEN) to locate human keypoints, especially those with and Pattern Recognition, IEEE, 2009, pp. 1014–1021.
occlusion. This model is easy to apply to low power embedded devices, [16] M. Fischler, R. Elschlager, The representation and matching of pictorial
structures, IEEE Trans. Comput. 100 (1) (1973) 67–92.
and has the characteristics of simplicity, strong expansibility and wide
[17] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
application, which are very important in the world of Internet of things. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
In SEN, we propose Feature Fusion Module (FFM) to settle the problem 2016, pp. 770–778.
of information loss in the process of downsampling, and integrate spa- [18] X. Fan, K. Zheng, Y. Lin, S. Wang, Combining local appearance and holistic view:
tial information and semantic information, which can help the model to Dual-source deep neural networks for human pose estimation, in: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.
locate the keypoints more accurately. At the same time, we also propose 1347–1355.
the Channel Enhancement Attention Module (CEAM) to strengthen the [19] G. Gkioxari, A. Toshev, N. Jaitly, Chained predictions using convolutional neural
expression ability of feature maps and highlight the features useful for networks, in: European Conference on Computer Vision, Springer, 2016, pp.
locating occlusion keypoints. In addition, we also propose a Feature 728–743.
[20] K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-resolution representation learning
Enhancement Module (FEM) to take full advantage of the high-level
for human pose estimation, in: Proceedings of the IEEE Conference on Computer
feature maps, which are particularly useful for model inference of Vision and Pattern Recognition, 2019, pp. 5693–5703.
occluded or invisible keypoints. In general, our model achieves out- [21] U. Iqbal, J. Gall, Multi-person pose estimation with local joint-to-person as-
standing performance on the COCO dataset and MPII dataset. We hope sociations, in: European Conference on Computer Vision, Springer, 2016, pp.
that our work can provide a certain reference value for designing 627–642.
[22] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler,
simple, effective and lightweight human pose estimation models. In this B. Schiele, Deepcut: Joint subset partition and labeling for multi person pose
way, more and more applications will appear on embedded devices, estimation, in: Proceedings of the IEEE Conference on Computer Vision and
making our environment more intelligent. Pattern Recognition, 2016, pp. 4929–4937.
[23] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, B. Schiele, Deepercut:
A deeper, stronger, and faster multi-person pose estimation model, in: European
Declaration of competing interest
Conference on Computer Vision, Springer, 2016, pp. 34–50.
[24] Z. Cao, T. Simon, S. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation
The authors declare that they have no known competing finan- using part affinity fields, in: Proceedings of the IEEE Conference on Computer
cial interests or personal relationships that could have appeared to Vision and Pattern Recognition, 2017, pp. 7291–7299.
influence the work reported in this paper. [25] A. Newell, Z. Huang, J. Deng, Associative embedding: End-to-end learning for
joint detection and grouping, in: Advances in Neural Information Processing
Systems, 2017, pp. 2277–2287.
Data availability [26] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, B.
Schiele, Arttrack: Articulated multi-person tracking in the wild, in: Proceedings
No data was used for the research described in the article. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
6457–6465.
[27] M. Kocabas, S. Karagoz, E. Akbas, Multiposenet: Fast multi-person pose estima-
References tion using pose residual network, in: Proceedings of the European Conference
on Computer Vision, 2018, pp. 417–433.
[1] C. Torres, J. Fried, K. Rose, B. Manjunath, A multiview multimodal system for [28] G. Ning, Z. Zhang, Z. He, Knowledge-guided deep fractal neural networks for
monitoring patient sleep, IEEE Trans. Multimed. 20 (11) (2018) 3057–3068. human pose estimation, IEEE Trans. Multimed. 20 (5) (2017) 1246–1259.
[2] P. Buitelaar, I. Wood, S. Negi, M. Arcan, J. McCrae, A. Abele, C. Robin, V. [29] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler,
Andryushechkin, H. Ziad, H. Sagha, Mixedemotions: An open-source toolbox for K. Murphy, Towards accurate multi-person pose estimation in the wild, in:
multimodal emotion analysis, IEEE Trans. Multimed. 20 (9) (2018) 2454–2465. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
[3] X. Cai, W. Zhou, L. Wu, J. Luo, H. Li, Effective active skeleton representation 2017, pp. 4903–4911.
for low latency human action recognition, IEEE Trans. Multimed. 18 (2) (2015) [30] K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask r-cnn, in: Proceedings of the
141–154. IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.
[4] Z. Fan, X. Zhao, T. Lin, H. Su, Attention-based multiview re-observation fusion [31] T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid
network for skeletal action recognition, IEEE Trans. Multimed. 21 (2) (2018) networks for object detection, in: Proceedings of the IEEE Conference on
363–374. Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
[5] R. Marcos, D. Pizarro, R. Marron, P. Gatica, Let your body speak: Communicative [32] A. Toshev, C. Szegedy, Deeppose: Human pose estimation via deep neural
cue extraction on natural interaction using RGBD data, IEEE Trans. Multimed. networks, in: Proceedings of the IEEE Conference on Computer Vision and
17 (10) (2015) 1721–1732. Pattern Recognition, 2014, pp. 1653–1660.
[6] N. Cho, A. Yuille, S. Lee, Adaptive occlusion state estimation for human pose [33] X. Sun, J. Shang, S. Liang, Y. Wei, Compositional human pose regression, in:
tracking under self-occlusions, Pattern Recognit. 46 (3) (2013) 649–661. Proceedings of the IEEE International Conference on Computer Vision, 2017, pp.
[7] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convo- 2602–2611.
lutional neural networks, in: Advances in Neural Information Processing Systems, [34] D. Luvizon, H. Tabia, D. Picard, Human pose regression by combining indirect
2012, pp. 1097–1105. part detection and contextual information, Comput. Graph. 85 (2019) 15–22.
[8] B. Xiao, H. Wu, Y. Wei, Simple baselines for human pose estimation and tracking, [35] J. Tompson, A. Jain, Y. LeCun, C. Bregler, Joint training of a convolutional
in: Proceedings of the European Conference on Computer Vision, 2018, pp. network and a graphical model for human pose estimation, in: Advances in
466–481. Neural Information Processing Systems, 2014, pp. 1799–1807.
[9] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, J. Sun, Cascaded pyramid network [36] S. Wei, V. Ramakrishna, T. Kanade, Y. Sheikh, Convolutional pose machines, in:
for multi-person pose estimation, in: Proceedings of the IEEE Conference on Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Computer Vision and Pattern Recognition, 2018, pp. 7103–7112. 2016, pp. 4724–4732.
[10] W. Yang, S. Li, W. Ouyang, H. Li, X. Wang, Learning feature pyramids for [37] J. Carreira, P. Agrawal, K. Fragkiadaki, J. Malik, Human pose estimation with
human pose estimation, in: Proceedings of the IEEE International Conference iterative error feedback, in: Proceedings of the IEEE Conference on Computer
on Computer Vision, 2017, pp. 1281–1290. Vision and Pattern Recognition, 2016, pp. 4733–4742.
[11] A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose [38] Y. Chen, C. Shen, X.-S. Wei, L. Liu, J. Yang, Adversarial posenet: A structure-
estimation, in: European Conference on Computer Vision, Springer, 2016, pp. aware convolutional network for human pose estimation, in: Proceedings of the
483–499. IEEE International Conference on Computer Vision, 2017, pp. 1212–1221.

8
H. Li et al. Microprocessors and Microsystems 96 (2023) 104739

[39] J. Liu, G. Wang, P. Hu, L. Duan, A. Kot, Global context-aware attention lstm Hua Li received his B.Eng. degree from Western University
networks for 3d action recognition, in: Proceedings of the IEEE Conference on of China, Chengdu, China in 2019. He is currently working
Computer Vision and Pattern Recognition, 2017, pp. 1647–1656. towards the M.Eng. degree from University of Electronic
[40] J. Li, X. Liu, M. Zhang, D. Wang, Spatio-temporal deformable 3d convnets with Science and Technology of China, Chengdu, China. His
attention for action recognition, Pattern Recognit. 98 (2020) 107037. research interests include neural computer vision, deep
[41] E. Kim, K. On, J. Kim, Y. Heo, S. Choi, H. Lee, B. Zhang, Temporal attention learning, and pattern recognition.
mechanism with conditional inference for large-scale multi-label video clas-
sification, in: Proceedings of the European Conference on Computer Vision,
2018.
[42] Q. Liu, R. Jia, C. Zhao, X. Liu, H. Sun, X. Zhang, Face super-resolution
reconstruction based on self-attention residual network, IEEE Access 8 (2019)
4110–4121.
[43] X. Chu, W. Yang, W. Ouyang, C. Ma, A. Yuille, X. Wang, Multi-context attention Shiping Wen is a Professor at the Australian Artificial
for human pose estimation, in: Proceedings of the IEEE Conference on Computer Intelligence Institute, University of Technology Sydney, Aus-
Vision and Pattern Recognition, 2017, pp. 1831–1840. tralia. He received a M.Eng. degree in Control Science
[44] K. Su, D. Yu, Z. Xu, X. Geng, C. Wang, Multi-person pose estimation with and Engineering from the School of Automation, Wuhan
enhanced channel-wise and spatial information, in: Proceedings of the IEEE University of Technology, Wuhan, China, in 2010, and
Conference on Computer Vision and Pattern Recognition, 2019, pp. 5674–5682. received a Ph.D. degree in Control Science and Engineering
[45] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of from School of Automation, Huazhong University of Science
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. and Technology, Wuhan, China, in 2013. His research
7132–7141. interests include memristor-based neural networks, deep
[46] S. Woo, J. Park, J. Lee, S. KweonIn, Cbam: Convolutional block attention module, learning, computer vision, and their applications in medical
in: Proceedings of the European Conference on Computer Vision, 2018, pp. 3–19. informatics, et al. In 2018 and 2020, he was listed as a
[47] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for Clarivate Analytics Highly Cited Researcher in the Cross-
scene segmentation, in: Proceedings of the IEEE Conference on Computer Vision Field, respectively. He received the 2017 Young Investigator
and Pattern Recognition, 2019, pp. 3146–3154. Award of the Asian Pacific Neural Network Association
[48] D. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv and the 2015 Chinese Association of Artificial Intelligence
preprint arXiv:1412.6980. Outstanding Ph.D. Dissertation Award. He currently serves
[49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, as an Associate Editor for Knowledge-Based Systems, IEEE
A. Karpathy, A. Khosla, M. Bernstein, Imagenet large scale visual recognition Access, and Neural Processing Letters and has served as
challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252. Leading Guest Editor of Special Issues in IEEE Transactions
[50] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. on Network Science and Engineering, Sustainable Cities and
Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, 2017. Society, Environmental Research Letters, et al. He has also
[51] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object served as a general/publication chair or a member of the
detection with region proposal networks, Adv. Neural Inf. Process. Syst. 28 Technical Programming Committee for various international
(2015). conferences. He is also a Senior Member of IEEE.
[52] F. Zhang, X. Zhu, H. Dai, M. Ye, C. Zhu, Distribution-aware coordinate represen-
tation for human pose estimation, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2020, pp. 7093–7102. Kaibo Shi received the Ph.D. degree from the School of Au-
[53] H. Fang, S. Xie, Y. Tai, C. Lu, Rmpe: Regional multi-person pose estimation, in: tomation Engineering, University of Electronic Science and
Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. Technology of China. From September 2014 to September
2334–2343. 2015, he was a Visiting Scholar with the Department of
[54] X. Sun, B. Xiao, F. Wei, S. Liang, Y. Wei, Integral human pose regression, in: Applied Mathematics, University of Waterloo, Waterloo, ON,
Proceedings of the European Conference on Computer Vision, 2018, pp. 529–545. Canada. He was a Research Assistant with the Department
[55] V. Belagiannis, A. Zisserman, Recurrent human pose estimation, in: 2017 12th of Computer and Information Science, Faculty of Science
IEEE International Conference on Automatic Face & Gesture Recognition (FG and Technology, University of Macau, Taipa, from May
2017), IEEE, 2017, pp. 468–475. 2016 to Jun 2016, and from January 2017 to October
[56] U. Rafi, B. Leibe, J. Gall, I. Kostrikov, An efficient convolutional network for 2017. From December 2019 to January 2020, he was also
human pose estimation, in: BMVC, Vol. 1, 2016, p. 2. a Visiting Scholar with the Department of Electrical En-
[57] L. Pishchulin, M. Andriluka, P. Gehler, B. Schiele, Strong appearance and gineering, Yeungnam University, Gyeongsan, South Korea.
expressive spatial models for human pose estimation, in: Proceedings of the IEEE He is currently a Professor with the School of Information
International Conference on Computer Vision, 2013, pp. 3487–3494. Sciences and Engineering, Chengdu University. He is the
[58] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, C. Bregler, Efficient object local- author or coauthor of over 60 research articles. His current
ization using convolutional networks, in: Proceedings of the IEEE Conference on research interests include stability theorem, robust control,
Computer Vision and Pattern Recognition, 2015, pp. 648–656. sampled-data control systems, networked control systems,
[59] P. Hu, D. Ramanan, Bottom-up and top-down reasoning with hierarchical Lurie chaotic systems, stochastic systems, and neural net-
rectified gaussians, in: Proceedings of the IEEE Conference on Computer Vision works. He is a very active reviewer for many international
and Pattern Recognition, 2016, pp. 5600–5609. journals.
[60] I. Lifshitz, E. Fetaya, S. Ullman, Human pose estimation using deep consensus
voting, in: European Conference on Computer Vision, Springer, 2016, pp.
246–260.

You might also like