Deep Learning-Based Microexpression Recognition: A Survey

Neural Computing and Applications (2022) 34:9537–9560
https://doi.org/10.1007/s00521-022-07157-w (0123456789().,-volV)(0123456789().,-volV)
REVIEW
Deep learning-based microexpression recognition: a survey

Wenjuan Gong1 • Zhihong An1 • Noha M. Elﬁky2
Received: 9 July 2021 / Accepted: 28 February 2022 / Published online: 1 April 2022
Ó The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022
Abstract
With the recent development of microexpression recognition, deep learning (DL) has been widely applied in this field. In
this paper, we provide a comprehensive survey of the current DL-based microexpression (ME) recognition methods. In
addition, we introduce a novel dataset based on fusing all the existing ME datasets. We also evaluate a baseline DL for the
microexpression recognition task. Finally, we make the new dataset and the code publicly available to the community at
https://github.com/wenjgong/microExpressionSurvey.
Keywords Microexpression recognition Deep learning Survey
1 Introduction Because of its potential applications in many fields, such

as clinical diagnosis, business negotiation, forensic inves-
Facial microexpressions are transient and unconscious tigation, and security systems (Ekman, 2009a; Frank et al.,
expressions. They last between 1/25 and 1/5 of a second. 2009a; Weinberger, 2010) [58], microexpression recogni-
Microexpressions were first discovered by Haggard and tion has attracted growing attention. The low intensity and
Isaacs in 1966 [24]. According to their observations, subtlety of microexpressions make it difficult for human
micromomentary facial expressions are ‘‘facial expressions eyes to perceive. In recent years, many works have been
which are so short-lived that they seem to be quicker-than- carried out on automatic microexpression recognition using
the-eye’’. They believed that the expression is related to machine learning methods [53, 58, 97]. However, due to
self-defense mechanisms. In 1969, Ekman and Friesen also the lack of data and the class imbalance problem, automatic
reported the discovery of microexpression in their frame- microexpression recognition remains a great challenge.
by-frame analysis of interviews with depressed patients. Motivated by its considerable progress in areas such as
They noticed that there was a vivid and extremely painful speech recognition [33], image recognition [22], and game
expression in two frames, and they named it playing [32], various deep learning (DL) algorithms have
‘‘microexpression’’. been applied to microexpression recognition
Microexpression recognition is a much more difficult tasks [30, 34, 86]. One of the first microexpression
task than the thoroughly studied facial expression recog- recognition approaches based on DL was proposed by Patel
nition problem. However, psychologists still argue about et al. [62] in 2016. Subsequently, many related methods
whether facial expressions reliably communicate emo- have been explored.
tions [8]. Compared with facial expressions, facial Previous reviews mainly focused on traditional machine
microexpressions are more reliable and more difficult to learning-based microexpression recognition methods [58].
hide. With the recent developments in DL methods, DL-related
works exist for the microexpression recognition
task [90, 98]. However, there is a lack of detailed and
& Wenjuan Gong comprehensive studies. Therefore, this paper focuses on
wenjuangong@upc.edu.cn DL-based microexpression recognition methods.
1
China University of Petroleum (East China), No. 66, In this work, we provide a comprehensive study of
Changjiangxi Road, Huangdao, Qingdao, China existing works on DL-based microexpression recognition
2
Saint Mary’s College of California, 1928 St. Marys Road, methods. Due to small sample sizes and low action inten-
Moraga, CA 94575, USA sities of microexpression datasets, data preprocessing
123
9538 Neural Computing and Applications (2022) 34:9537–9560
becomes particularly important. In particular, how to pro- 1. Summary of the deep learning-related works of the
cess the data to improve recognition accuracy has become microexpression recognition task.
an essential problem for researchers. We study the data 2. Classify the current deep learning methods into five
preprocessing procedures employed for microexpression categories and compare the performances.
recognition (Sect. 2). The related works on microexpres- 3. Introduce a newly fused dataset for microexpression
sion recognition methods are categorized into five classes recognition and evaluate it using a CNN method, as the
based on the types of information they capture: (1) spatial benchmark. The fused dataset and the code are
deep learning methods (Sect. 3.1), (2) temporal deep available at https://github.com/wenjgong/
learning methods (Sect. 3.2), (3) spatial-temporal deep microExpressionSurvey.
learning methods (Sect. 3.3), (4) spatial-contextual deep
learning methods (Sect. 3.4), and (5) hybrid methods
(Sect. 3.5). Spatial deep learning methods are further
2 Data preprocessing
divided into convolutional neural network-based meth-
ods(Sect. 3.1.1), and multistream convolutional neural
Because microexpressions have short durations and low
network-based methods (Sect. 3.1.2). Spatial-temporal
intensities, effective feature extraction is a major challenge.
deep learning methods are further divided into 3D convo-
Hence, appropriate data preprocessing becomes necessary.
lutional neural network-based methods (Sect. 3.3.1) and
Many data preprocessing methods have been successfully
combined spatial-temporal methods (Sect. 3.3.2). Spatial-
introduced into microexpression recognition problems.
contextual deep learning methods are further divided into
This includes transfer learning (Sect. 2.1), temporal jitter-
capsule neural network-based methods (Sect. 3.4.1), graph
ing (Sect. 2.2), apex frame spotting (Sect. 2.3), optical flow
convolutional network-based methods (Sect. 3.4.2). Fig-
(Sect. 2.4), dynamic imaging (Sect. 2.5), motion magnifi-
ure 1 provides the visual hierarchy of the microexpression
cation (Sect. 2.6), and data augmentation (Sect. 2.7). Fig-
recognition methods.
ure 2 provides a visual summary of various data
One of the major challenges related to microexpression
preprocessing methods in the literature applied in the
studies is the lack of datasets [99]. In this paper, we studied
microexpression recognition task.
five public microexpression datasets. Because microex-
pressions data are difficult to collect and label, the size of
2.1 Transfer learning
the available datasets is usually small, which can greatly
affect the accuracy of the current microexpression recog-
Because the weights of the deep neural networks are ini-
nition task. To address the lack of training data problem,
tialized randomly, this might cause the network optimizer
we fused the publicly available datasets in this work. In
to fall into a local optimum and fail to learn meaningful
addition, we processed the newly fused dataset and
features. To overcome this problem, transfer learning
benchmarked it using a convolution neural network (CNN)
techniques are utilized to transfer the learned knowledge
method as a baseline. We also made both the dataset and
from general images (for instance, from the macroexpres-
the code available to the public. Therefore, the main con-
sion domain) to the microexpression domain. In particular,
tributions of this work are as follows:
deep learning models are pretrained on large macroex-
pression data designed for macroexpression recognition
Fig. 1 Microexpression recognition methods are categorized into five deep learning methods, spatial-contextual deep learning methods, and
classes based on the types of information they capture: spatial deep hybrid methods
learning methods, temporal deep learning methods, spatial-temporal
123
Neural Computing and Applications (2022) 34:9537–9560 9539
Fig. 2 Visual summary of data preprocessing methods in the literature for microexpression recognition
tasks. The trained parameters are used to initialize the domain adaptation technology, which bridges the data
parameters of the microexpression recognition network. distribution gap between the source domain and the target
Subsequently, these parameters are fine-tuned using domain. The input to the last fully connected layer of the
microexpression data. classifier module is then fed to the discriminator, which
Toward this direction, several attempts have utilized consists of two fully connected layers.
general-purpose image domain knowledge for microex-
pression recognition. Mayya et al. [52] used a pretrained 2.2 Temporal jittering
ImageNet [38] deep CNN to extract facial features. Zhi
et al. [101] adopted transfer learning for pretraining 3D Except for exploring knowledge from other domains using
convolutional neural networks (3D-CNN). First, 3D-CNN transfer learning techniques, data augmentation techniques
is pretrained on the Oulu-CASIA facial expression data- can also be used for data preprocessing tasks. Xia
set [72]. Afterward, the pretrained model is used for et al. [86] proposed the temporal jittering strategy, which is
microexpression recognition. Liu et al. [49] employed the a data augmentation technique. In particular, in their work,
domain knowledge of the macroexpression recognition they randomly selected 100%, 90%, 80%, 70%, 60%, and
Cohn-Kanade (CK?) dataset [51] using two regional 50% of the frames from the original image sequences using
adaptive technologies. In particular, expression enlarge- 8 sampling methods: the nearest neighbor method, a
ment and reduction are used as the domain adaptive tech- bilinear method, a bicubic method, box-shaped kernel-
nology. First, the apex frame of the microexpression is based method, triangular kernel-based method, cubic ker-
regarded as an inevitable intermediate process of the nel-based method, Lanczos-2 and Lanczos-3 kernel-based
macroexpression. The intermediate frame between the start methods. Each sampled frame sequence with a different
frame and the apex frame of the macroexpression video sampling percentage has a different length. Afterward, they
clip is selected, and the microexpressions are magnified. carried out upsampling and downsampling on each sam-
The second domain adaptive technique is adversarial pled frame sequence to generate a fixed-length sequence of
123
30 frames. Combining these two steps, they augmented the spotting method proposed by Li et al. [43]. In particular,
total number of sampled frame sequences by 48 times. Zhao et al. proposed transforming facial expression videos
to the frequency domain using a 3D fast Fourier transform
2.3 Apex frame spotting (3D FFT), where they located the apex frame as the frame
with the maximum amplitude.
Many deep learning studies use apex frames as input
instead of the original video for microexpression recogni- 2.4 Optical flow
tion from videos. The apex frame has the maximal varia-
tion in the whole sequence. Compared with the original Optical flow computes the changes in the pixel values
video frames, apex frame inputs are less redundant and between adjacent frames in the time domain. It is an
straightforward to learn domain knowledge from facial effective encoding of motion information. Temporal
macroexpression image datasets [43]. However, the information plays a major role in microexpression identi-
dynamic variations of the microexpressions can be lost by fication. Optical flow can greatly help the network analyze
only considering the apex frame. temporal information, thereby improving accuracy.
Toward this direction, various methods exist to locate The performances of various types of optical flow fea-
apex frames based on frame differences. For example, V. tures (i.e., robust local optical flow(RLOF) [70], total
Quang et al. [77] proposed locating the apex frame by variation-L1 (TV-L1) [82], Farneback [18], Lucas-
computing the absolute pixel intensity difference between Kanade [2], and Horn & Schunck [26]) were compared
the current frame and the onset/offset frame in the under the same experimental setting [46]. The experi-
microexpression video sequence. After obtaining the per- mental results show that TV-L1 outperforms all other
pixel average value for each frame, the frame with the features and achieves an accuracy of 62:20% and F1-score
largest value was selected as the apex frame. of 0.6179. However, the Horn & Schunck feature (accu-
In addition, the apex frame can be located with the help racy: 50.64%, F1-score: 0.4970) and the Lucas Kananade
of handcrafted features, such as the local binary pattern feature (accuracy: 53.04%, F1-score: 0.5284) achieved
(LBP) [60]. For instance, Liong et al. [47] proposed the lower accuracies.
D&C-RoIs method to locate the apex frame. Optical flow features can also be provided as auxiliary
The method first calculates the LBP features of the three information. For example, Peng et al. [63] proposed a
facial subregions (‘‘left eye ? eyebrows’’, ‘‘right eye ? novel apex-time network (ATNet) to recognize microex-
eyebrows’’, and ‘‘mouth’’) for each image frame. Then, the pressions based on the spatial information from the apex
difference between the LBP features of the start frame and frame together with the temporal information from the
the remaining frames is obtained based on the correlation respective adjacent frames. They used the optical flow
coefficient principle. Finally, the divide and conquer feature to extract auxiliary temporal information for the
strategy is used to search for the apex frame according to proposed ATNet. They designed an efficient statistical
feature difference variations. The frame with the local optical flow representation named the average direction-
maximum variation is selected as the apex frame. Song magnitude feature. The obtained optical flow sequence is
et al. [71] also designed an LBP feature-based apex frame divided into 8 8 feature blocks for each frame. Then, the
spotting solution. They choose an uniform pattern LBP average optical flow magnitude and direction for each
(UP-LBP) method proposed by Ojala et al. [59] to describe feature block are computed. This helps ATNet to better
facial textures and shape appearances with UP-LBP his- recognize microexpressions.
tograms. Similarly, the apex frame is located as the frame More efficient methods of estimating the optical flow
with the maximum feature difference (FD). features are also explored by Zhou et al. [104]. In this
Detection-based methods are also introduced into apex work, they proposed using the intermediate frames instead
frame spotting.The work of Zhou et al. [103] employed the of the apex frames for estimating the optical flow features.
automatic apex frame spotting strategy proposed by T. In particular, they split the video footage into three clips of
Liong et al. [47] for extracting apex frames. First, land- equal length and count the occurrence of the apex frame in
mark detection and regions-of-interest (RoIs) extraction each clip. They discovered that 70% of the apex frames
methods are applied to each frame. Then, the differences appear in the second clip of the microexpression video.
between the start frame and each image frame are calcu- Without the apex frame spotting step, using the interme-
lated to locate the apex frame. diate frames is more time-efficient.
To incorporate temporal information into the apex frame
representation, Zhao et al. [100] enhanced the apex frame
123
2.5 Dynamic imaging 3 Deep learning-based microexpression

recognition
Verma et al. [79] proposed a four-stream microexpression
recognition network (LEARNet) based on dynamic imag- This section presents and analyzes the current methodolo-
ing. Dynamic imaging converts a video sequence into an gies of deep learning-based microexpression recognition.
image frame by preserving the temporal and spatial We categorize the current methods based on the types of
information. information they capture: spatial deep learning methods
(Sect. 3.1), temporal deep learning methods (Sect. 3.2),
2.6 Motion magnification spatial-temporal deep learning methods (Sect. 3.3), spatial-
contextual deep learning methods (Sect. 3.4), and hybrid
The muscle movements of microexpressions are at very methods (Sect. 3.5). However, we note that some methods
low intensities and are difficult to observe with the naked may fall into multiple categories.
eye. To this end, Wu et al. [85] proposed an Euler motion
magnification method. The method first applies spatial 3.1 Spatial deep learning methods
filtering and decomposes the video sequence into mul-
tiresolution pyramids. To obtain several frequency bands of The most commonly used spatial deep learning models for
interest, time-domain bandpass filtering is carried out on microexpression recognition are convolutional neural net-
each image. Then, the filtered images are enlarged, where works (Sect. 3.1.1). Multistream CNNs (Sect. 3.1.2) are
the signal of each frequency band is approximated by a also utilized to extract information, including spatial
Taylor series and a magnification factor a is given to the information.
filtered spatial frequency band to linearly amplify the sig-
nal of interest. Finally, the enlarged images are fused. 3.1.1 Convolutional neural network-based methods
Many researchers use this motion magnification technology
to improve recognition accuracies [49, 88, 100]. The convolutional neural network, also named CNN or
ConvNet, is a class of feedforward neural networks. The
2.7 Data augmentation architecture is composed of three types of layers: convo-
lutional layers, pooling layers, and fully connected layers.
Data augmentation technologies have always been an CNNs have become the dominant model in many machine
important means to overcome the problem of a lack of data. learning tasks, especially in computer vision. CNNs lack
Some data augmentation technologies in computer vision the mechanisms of handling temporal information. To
are based on affine transformations, such as rotation, overcome this limitation, optical flow has been used as
scaling, and displacement. There are also data augmenta- inputs to these CNNs. The network structure of CNN-based
tion technologies that are based on image processing methods is summarized in Fig. 3.
methods, such as lighting conversion and noise addition. Various CNN models are utilized in the literature. The
Recently, generative adversarial networks (GANs) [21] most frequently used backbone network is ResNet [23].
have attracted widespread attention due to their excellent Liu et al. [49] proposed a network based on ResNet. After
performance in generating new image data. Liong detecting and aligning the face, the optical flow between
et al. [46] used an auxiliary classifier GAN (AC- the first frame and the apex frame is calculated and fed into
GAN) [57] and a self-attention GAN (SAGAN) [96] to the network. The work uses ResNet18 as their backbone
construct data samples. The horizontal and vertical optical network and modifies the architecture by dividing the
flow images were processed by the GAN model separately. feature maps extracted from the last convolutional layer
The microexpression classes and random noise were input into a top eye part and a bottom mouth part. Motion
into the generator. The authentic images and the fake magnification methods and adversarial-based domain
images produced by the generator were both passed to the adaptation technology [20] are also employed to address
discriminator. To generate any number of fake images, two the lacking data problem. The code is available1. Belaiche
individual network structures processing the horizontal and et al. [3] proposed a ResNet-based cost-efficient CNN
vertical optical flow features can be used. architecture for microexpression recognition. They modi-
fied the ResNet architecture by removing the residual
layers. They evaluated four different optical flow repre-
sentations (i.e., horizontal optical flow, vertical optical
flow, optical flow magnitude level, and horizontal-vertical
1
https://github.com/xiaobaishu0097/MEGC2019.
123
Fig. 3 The Structure of

Convolutional Neural Network-
based Methods
optical flow pairs) and discovered that the vertical optical the Inception net as the backbone network, and filters of
flow representations achieve the best performance. Zhou different sizes (1 1, 3 3, 5 5) are utilized. To reduce
et al. [103] proposed a ResNet-based cross-database the number of calculations, a 1 1 convolutional layer is
microexpression recognition method. They located the added before the 3 3 and 5 5 convolutional layers to
apex frame automatically, and images of different styles reduce the number of channels. The Inception net solves
were aggregated with CycleGAN [105]. Auxiliary facial how to select the appropriate convolution kernel size when
expression data were utilized to train a ResNet-based tea- processing different images. Takalkar et al. [73] explored
cher model, and then the attention of the teacher model was solutions to the lacking data problem and proposed a
transferred to train the student model on microexpression VGGFace [61] model-based network. The backbone
datasets with limited samples. VGGFace network was trained on a very large-scale face
AlexNets [39] are also effective backbone networks for image dataset. By using VGGFace, the method leveraged
microexpression recognition. Borza et al. [4] proposed an the domain knowledge from large-scale face recognition
AlexNet-based microexpression detection and recognition datasets. The proposed method significantly enhances the
framework. Frame difference features are computed: the recognition accuracy, but due to the labeling errors and the
difference between the start frame and the current frame class imbalance, microfacial recognition is still a difficult
and the difference between the current frame and the offset problem. Wang et al. [80] proposed a residual network-
frame. These features are fed into the AlexNet-based net- based network with a microattention mechanism. The
work. The network is composed of three convolutional microattention mechanism enables the network to learn to
layers and three fully connected layers. The proposed focus on the facial area of interest. The network is trained
method iterates the video sequence and the network output on ImageNet [38] and four macroexpression datasets.
using sliding windows to eliminate false alarms. Khor Except for building network models based on popular
et al. [34] proposed a lightweight AlexNet-based network. backbone networks, there are also works that design their
The proposed architecture truncates AlexNet by retaining own CNN structures. For example, Verma et al. [78] pro-
only the first 3, 4, or more convolution blocks. This reduces posed a hybrid local receptive field-based augmented
overfitting by decreasing the network depths and the learning network (OrigiNet). They introduced the concept
number of tunable parameters. of active imaging and separate changes in the expressive
The networks take optical flow, optical flow magnitude, regions of a video into a single frame while retaining facial
and grayscale features as inputs and combine two different appearance information. OrigiNet utilizes multiscale con-
features to form heterogeneous pairs. Through the last volution filters to extract features of various scales.
convolution block, feature maps are merged through OrigiNet also introduces a new activation function RReLU.
operations such as feature mapwise multiplication, feature RReLU expands the range of the derivative. RRLU not
mapwise addition, and depthwise concatenation. The code only introduces nonlinearity but also captures the true
is available2. edges with additive and multiplicative properties. By
Other commonly used backbone networks include the embedding two parallel fully connected layers, the
Inception nets [4], VGGFace [73], and residual net- enhanced learning block also improves the learning ability.
works [80]. Zhou et al. [104] proposed a dual inception CNNs are also exploited to extract features. For exam-
network for processing composite microexpression data- ple, Li et al. [41] proposed a microexpression recognition
sets. Features are extracted from the start frame and the algorithm based on a deep multitask convolutional network
apex frame of the sequence, and then the optical flow is and a fusion convolutional network. The three channels of
extracted using the total variation-L1 (TV-L1) method. The the detected face area are fed into the deep multitask net-
horizontal and vertical components of the optical flow work for face landmark detection, and the face area is
feature are fed into the dual inception network, and the two divided into 36 regions of interest. Additionally, each
processed feature maps are concatenated. The network uses frame in the video sequence together with the first frame is
fed into FlowNet 2.0 [29] to compute the optical flow
2
https://github.com/IcedDoggie/DSSN-MER.
123
feature. Then, the method selects the features in the region apex frame. The dynamic-temporal recognition flow takes
of interest and calculates the corrected histogram of ori- the optical flow displacement fields as input.
ented optical flow (HOOF) [6] feature. Finally, Verma et al. [79] proposed a four-stream microexpres-
LIBSVM [5] with a polynomial kernel function is used for sion recognition network (LEARNet) based on dynamic
per frame expression classification. imaging. The dynamic imaging is generated from the
microexpression sequence and then fed into the proposed
3.1.2 Multi-stream convolutional neural networks based lateral accretive hybrid network (LEARNet) to extract
methods microfeatures. The architecture uses a hybrid and decou-
pled feature learning mechanism and filters with different
One technique for handling the lacking data problem is to template sizes to extract micro- and high-level features.
extract multiple features from the limited data. The mul- These features are extracted with four independent data
tistream framework can combine multiple feature extrac- processing streams, and two of the data processing streams
tion techniques. The network structures of multistream use a hybrid response map method, in which feature maps
CNN-based methods are summarized in Fig. 4. from other data processing streams are added.
Many multistream CNN-based methods are proposed for
microexpression recognition. For example, Liong 3.2 Temporal deep learning methods
et al. [19] proposed a two-stream apex frame network
(OFF-ApexNet). They first located the apex frame and then Many CNN-based microexpression recognition approaches
calculated the optical flow between the start frame and the use 2D-CNN to extract features. Considering the temporal
apex frame. The horizontal component and the vertical constraint between consecutive frames in the microex-
component are input to the two-stream CNN model for pression video sequences, capturing dynamic information
feature enhancement and expression classification. is also an essential process.
Later, Liong et al. [46] improved the previously pro- Recurrent neural networks (RNNs) are designed for
posed OFF-ApexNet and added a processing stream for processing sequential data. RNNs are commonly used for
other optical flow features (i.e., magnitude and direction). processing temporal information for microexpression
The output is multiplied to reduce the feature dimensions recognition. RNN and CNN are combined by incorporating
and maintain the feature quality. Song et al. [71] also recurrent connections into each convolutional layer [44].
proposed a three-stream convolutional neural network The combined recurrent convolutional neural network
(TSCNN). The network consists of three streams: the sta- (RCNN) model [102] enhances its ability to integrate
tic-spatial stream, the local-spatial stream, and the contextual information. In the case of static input images,
dynamic-temporal stream. the RCNN units evolve over time in the sense that the
Among the three data processing streams, the static- activity of each unit is regulated by the activities of its
spatial recognition stream takes the grayscale image of the neighbors.
apex frame as input and effectively performs action RCNN models are widely used in facial microexpres-
recognition using static images. The local-spatial recogni- sion recognition. For example, Xia et al. [88] proposed an
tion stream processes the blocks taken from the grayscale RCNN-based microexpression recognition method: spa-
tiotemporal recurrent convolutional networks. The optical
Fig. 4 The structure of multistream convolutional neural network-based methods
123
flow of the cropped and enlarged facial image was calcu- 3.3 Spatial-temporal deep learning methods
lated and fed into the recurrent convolutional layer for
further processing. Three normalization methods (i.e., The arousal of microexpressions is a continuous process, so
samplewise, subjectwise, and datasetwise methods) were the current datasets collect emotional videos. How to
proposed to process the composite dataset. Based on this effectively integrate spatial and temporal information has
work, Xia et al. designed a novel RCNN [86] for micro- naturally become one of the research focuses in microex-
facial expressions. The new network consisted of a feed- pression recognition.
forward convolutional layer and several recurrent
convolutional layers. The first convolutional layer was the 3.3.1 3D convolutional neural network-based methods
only feedforward convolutional layer without recurrent
connections. The following four recurrent convolutional The 3D CNN model was proposed by Ji et al. [30] and was
layers (RCLs) extracted visual features for expression capable of performing spatiotemporal information analysis.
recognition. The maximum pooling operation was applied The 3D CNN performs convolutions over 3D cubes rather
between the feedforward and the recurrent connection of than 2D image patches. Many works explore the applica-
each RCL to reduce the dimensionality. All feature maps tions of 3D CNN models to microexpression recognition.
were concatenated using a global average pooling layer. The network structure of 3D CNN-based methods is sum-
The final softmax layer calculates the prediction. marized in Fig. 6.
Xia et al. [87] proposed a spatiotemporal recurrent Reddy et al. [67] proposed two 3D CNN methods:
convolutional network (STRCN)-based microexpression MicroExpSTCNN and MicroExpFuseNet. The Micro-
recognition method. The model consisted of multiple ExpSTCNN model took the entire face as input, and the
cyclic convolutional layers for visual feature extraction and model was composed of 3D convolutional layers, 3D
a classification layer for recognition. The method proposed pooling layers and fully connected layers, activation layers,
two temporal connections. The type 1 connection (STRCN- and dropout layers. The MicroExpFuseNet model took
A) vectorized a channel of an image channel into a column only the eye and mouth regions and fed them to two
of a matrix and fed it to the STRCN to extract appearance independent 3D spatiotemporal CNNs. The two CNNs
features. The type 2 connection (STRCN-G) extracted were then merged into one network. Based on fusion
geometric (spatial-temporal) features from the optical flow strategies, two versions of MicroExpFuseNet models were
fields. Similar to the attention mechanism, the method proposed: intermediate MicroExpFuseNet and late Micro-
proposed a mask generation method to locate microex- ExpFuseNet. The code is available3.
pression-aware areas using the mask. Finally, they used a Multiple stream processing structures are employed to
multiclass balanced loss to train the STRCNs. process various features or video inputs with different
Xia et al. [89] also proposed a recurrent convolutional configurations. For example, Peng et al. [64] proposed a
network (RCN)-based microexpression recognition method dual-time scale convolutional neural network (DTSCNN),
for composite datasets. They evaluated different model and a dual-stream network with 3D convolution and pooling
input image scales and chose to use shallower models and units. The method takes the CASME and CASME II
low-resolution input images based on the experimental datasets as inputs. Each stream processes a dataset with a
results. They also developed three parameter-free modules specific video frame rate. Li et al. [40] proposed a 3D flow-
(i.e., wide expansion, shortcut connection and attention based CNN model. The model is a three-stream 12-layer
unit) to enhance the representation ability. The wide deep network consisting of 5 pairs of convolution and
expansion module consisted of several streams, performing pooling layers, 1 pair of fully connected layers and 1 pair
expansion operations with different dilation sizes. These of output softmax layers and takes the optical flow together
streams were then combined to form one descriptor. To with standard grayscale frames as input data. Liong
prevent gradient explosion, the shortcut connection module et al. [45] proposed a shallow triple-stream 3D CNN
added one shortcut connection to the second layer in the (STSTNet) model. The model has lower computational
RCN. The attention unit module added a soft attention complexity and can extract high-level features. The model
mechanism to achieve better recognition performance. inputs the grayscale, horizontal and vertical optical flow
Then, a neural architecture search (NAS) strategy [48] was field features into the parallel streams and each stream is
utilized to determine the best locations or combinations. composed of a convolutional layer of different numbers of
The network structure of RCN-based methods is summa- kernels (namely, 3, 5, 8) and a maximum pooling layer.
rized in Fig. 5.
3
https://github.com/bogireddytejareddy/micro-expression-
recognition.
4
https://github.com/christy1206/STSTNet.
123
Fig. 5 The structure of recurrent convolutional network-based methods
Fig. 6 The structure of 3D convolutional neural network-based methods
Then, the output features are merged into a 3D feature transfer learning methods. Zhi et al. [101] proposed a
block. After being processed by the average pooling layer microexpression recognition method that combined a 3D
and the fully connected layer, they are labeled with cor- convolutional neural network with transfer learning by
responding emotion classes in the final softmax layer. The supervised pretraining. First, the facial expression dataset
code is available4. Oulu-CASIA [72] was used in supervised learning. The
3D CNN models combined with other methods are also trained 3D CNN model was then transferred to the facial
explored. Some works combine 2D CNN and 3D CNN microexpression domain. Finally, feature vectors extracted
models. For example, C. Wu et al. [84] proposed a three- from the fully connected layer were sent to a linear support
stream combining 2D and 3D convolutional neural net- vector machine (SVM) to predict expression classes. There
works (TSNNs), which combined 2D-CNN and 3D-CNN. are also works that combine 3D CNNs with attention
The network had two variants: intermediate fusion (TSNN- mechanisms. For example, Chen et al. [7] proposed a
IF) and late fusion (TSNN-LF) based on fusion positions. convolutional block attention module network (CBAM-
Specifically, the TSNN first extracted spatiotemporal fea- Net), a 3D spatiotemporal CNN with a convolutional block
tures in different temporal domains, and the extracted attention module (CBAM) [83]. First, the image sequence
features were further processed through the 2D layer to was input to a medium-sized CNN to extract visual fea-
extract high-level features. Finally, the generated feature tures. The network included six 3D convolutional layers
map was processed by the global average pooling layer and and six 3D pooling layers. The extracted feature map was
was forwarded to the fully connected layer for microex- fed into the CBAM, which included two modules: the
pression classification. Others combine 3D CNNs with channel attention module and the spatial attention module.
These two modules learned the key information along the
4
https://github.com/christy1206/STSTNet.
123
channel dimension and the space dimension and adaptively expression. The proposed objective function considers the
allocated the feature weights. expression states. Wang et al. [81] proposed a transferring
long-term convolutional neural network (TLCNN), which
3.3.2 Combined spatial-temporal methods first extracted features using deep CNN and then fed these
features into the LSTM to learn temporal information. To
Combined spatial-temporal methods are divided into two address the problem of missing data, they incorporated
types. One extracts spatial and temporal features sepa- facial expression data in training.
rately, i.e., extracting temporal features using LSTM/GRU/ An ensemble of models was combined into one model.
RNN, etc., and extracting spatial features using CNNs, and Nag et al. [55] proposed a spatiotemporal network com-
two types of features are then fused. The second type first posed of 4 main modules: a spatial network, temporal
extracts the spatial features using CNNs and then extracts network, context-contrast module and visual memory
the final features through LSTM/GRU/RNN. The network module. The spatial network module employed the Dee-
structure of the combined spatial-temporal methods is pLab large FOV [10], which retained relatively high res-
summarized in Fig. 7. olution and encoded context information into each pixel’s
By combining LSTM [25]/GRU [11]/RNN with CNN, representation. The temporal network is a motion pattern
the deep learning model can learn temporal contextual network (MP-Net) [76], that takes optical flow as input.
information from the video sequences. For example, M. The context-contrast module performed intramap differ-
Peng et al. [63] proposed a dual-stream neural network ence processing on the consecutive spatial network feature
(apex-time network, ATNet), in which the spatial stream maps, which greatly improved the recognition accuracy of
was composed of the ResNet10 network, and the temporal subtle microexpressions. The memory module chose GRU,
stream used a stacked vanilla LSTM network. The code is a special form of lightweight RNN, which was used to
available5. decode microexpressions from the temporal frame
Sequential processing is also utilized to combine deep sequences, and thus, the information from the past frames
learning models. For example, Aouayeb et al. [1] cropped was incorporated to predict the expression classes.
subregions of facial images: eyes, eyebrows, nose, cheeks, Multistage learning is also a method to combine deep
and mouth, and fed them to a convolutional neural net- learning models. Nistor et al. [56] proposed a multistage
work. The extracted spatial features were further processed training method for microexpression recognition. The first
through a long-term short-term memory (LSTM), followed stage trains convolutional autoencoders (CAEs) for facial
by fully connected layers (FCL) for classification. Khor image reconstruction. The second stage initializes the
et al. [35] proposed an enriched long-term RNN (ELRCN) learnable parameters of the encoder with the learned values
model. The model used a CNN to extract features from in the first stage and trains the model using a macroex-
video frames and then predicted the microexpression by pression recognition task. In the final stage, recurrent
passing the feature vector through an LSTM. The frame- neural networks process the feature sequence to identify
work consisted of two network variants: spatial dimension microexpressions. The multistage method starts with facial
enrichment (SE) and temporal dimension enrichment (TE). image reconstruction, then macroexpression recognition,
SE stacks various input channels, and TE stacks various and finally microexpression recognition. The advantage of
features. The combined model improved the microexpres- this multistage method is that it acquires a large number of
sion recognition accuracies. The code is available6. Choi facial images that do not require annotations, and it also
et al. [12] proposed a CNN-LSTM-based network to pro- acquires a large number of facial images with annotated
cess landmark feature map (LFM)/a compressed version emotions. This method helps to solve the problem of the
LFM (CLFM) feature. The CNN first converted each LFM/ lack of training data.
CLFM into a 1D feature vector. Then, the continuous 1D
feature vectors pass through the LSTM to determine the 3.4 Spatial-contextual deep learning methods
microexpression class. Kim et al. [36] proposed a method
that learned the spatial features through a CNN and then CNNs are inspired by findings in the biological study of
used an LSTM recurrent neural network to learn the tem- vision and employ restricted receptive fields and a hierar-
poral features. In learning spatial features, five expression chy of layers for feature extraction. However, CNNs lose
states (i.e., onset, onset to apex transition, apex, apex to the contextual information of a data sequence. To further
offset transition, and offset) were learned for each incorporate the spatial relationship of objects or the spatial
relationship among object parts, capsule networks
5
https://github.com/CodeShareBot/ACII19-Apex-Time-Network. (Sect. 3.4.1) and graph convolutional neural networks
6
https://github.com/IcedDoggie/Micro-Expression-with-Deep- (Sect. 3.4.2) are also introduced to solve the microex-
Learning. pression recognition problem.
123
Fig. 7 The structure of combined spatial-temporal methods
Fig. 8 Structure of capsule

neural network-based methods.
‘‘MeCaps’’ for microexpression
recognition corresponds to
‘‘DigitCaps’’ for digit
recognition. [68]
3.4.1 Capsule neural network-based methods 3.4.2 Graph convolutional networks based methods
The capsule neural network was proposed by Sabour Graph convolutional networks (GCNs) [37] also model
et al. [68]. Capsule network-based models are introduced spatial relationships. While the CNN extracts information
into microexpression recognition to encode spatial rela- around pixels in an image, the GCN learns the relationship
tionships among parts. The network structure of capsule between objects from non-Euclidean data in the form of a
neural network-based methods is summarized in Fig. 8. graph. The standard convolution operation filters image
Quang et al. [77] proposed a capsule network-based blocks. Graph convolution applies convolutional opera-
microexpression recognition framework. The apex frame tions to the graph nodes. Action units (AUs) are the basic
was detected, and the face area was cropped from the apex actions that reflect the movement of facial muscles. The
frame. Next, they were fed into the ResNet model. The GCN layers describe the dependency between AU nodes to
weights of the model were initialized with the parameters perform microexpression recognition. The network struc-
from the ResNet18 model trained over the ImageNet ture of GCN-based methods is summarized in Fig. 9.
dataset [38]. The extracted features were fed into the Lo et al. [50] proposed an end-to-end AU-oriented
capsule layer for further processing. The code is available7. GCN-based network (MER-GCN). The model used a 3D
Capsule networks are also utilized as discriminators in convolutional neural network to extract AU features and
the generative adversarial network framework. For exam- used the GCN layer to describe the dependencies between
ple, Yu et al. [95] proposed an identity-aware and capsule- AU nodes.
enhanced generative adversarial network (ICE-GAN). The
network consists of an encoder-decoder generator for 3.5 Hybrid approaches combining traditional
synthesizing identity-aware expressions and a capsule-en- and deep learning methods
hanced discriminator for identifying true/fake face images
and identifying microexpressions. Compared with traditional methods, deep learning methods
provide a unified feature extraction framework and avoid
7
https://github.com/quangdtsc/megc2019
123
Fig. 9 The structure of graph convolutional network-based methods
the tedious process of manual feature extraction. However, 4 Datasets

with a limited quantity of data, deep learning methods may
have overfitting problems. It relies on large-scale datasets Currently, there are only a limited number of microex-
to obtain reliable deep features. Many works choose to pression datasets, including Polikovsky’s dataset, the USF-
combine traditional methods with deep learning methods to HD dataset, the SMIC database [66], the SMIC2 data-
take advantage of both. The network structure of hybrid base [42], the CASME database [92], the CASME II
approaches combining traditional and deep learning database [91], and the SAMM dataset [14]. These datasets
methods is summarized in Fig. 10. contain a relatively small number of sample videos. Among
Some works apply traditional and deep learning meth- these datasets, Polikovsky’s dataset and the USF-HD
ods sequentially. For example, Mayya et al. [52] proposed database are elicited involuntarily and are unavailable to
combining temporal interpolation with DCNN. First, tem- the public. In the following, we introduce the public
poral interpolation is applied to the video sequences, and datasets. The detailed configurations of the five public
then the DCNN model pretrained on ImageNet [38] is used datasets are illustrated in Table 1.
for feature extraction. The DCNN architecture uses a large
number of parallel processors to greatly reduce the calcu- 4.1 SMIC database
lation time. Finally, microexpression labels are predicted
with an SVM. Patel et al. [62] combined deep learning The SMIC database [66] was collected in 2011 by Pfister
feature extraction with an evolutionary algorithm. The et al. from the University of Oulu. SMIC was the first
evolutionary algorithm was used to search for the best deep spontaneous microexpression dataset in the world. The
feature set for classification. dataset collected microexpression videos from six subjects
One school of methods combines traditional and deep (three men and three women, with four wearing glasses).
learning methods through early fusion (feature fusion) or The video frame resolution is 480 640, and the frame
late fusion (model fusion). Takalkar et al. [74] combined rate is 100 fps. The dataset contains 7 spontaneous
handcrafted features with deep features. The CNN extracts microexpressions, including disgust, fear, misfortune,
features and local binary patterns-three orthogonal planes sadness, and surprise. These subjects watched 16 carefully
(LBP-TOP); extracted features are combined and passed to selected movie clips to elicit microexpressions and were
the SVM and softmax classifiers. Takalkar et al. [75] also asked to suppress their facial expressions.
employed early fusion of features. The LBP-TOP features
and CNN features were combined and sent to the softmax 4.2 SMIC2 database
function for classification. Hu et al. [27] also proposed
combining handcrafted features with deep features using The SMIC2 database [42] was collected in 2012 by Pfister
early fusion. The handcrafted feature was extracted using a et al. from the University of Oulu. SMIC2 was an expan-
local Gabor binary pattern from three orthogonal panels sion of the SMIC database. The dataset includes 20 sub-
(LGBP-TOP), and the deep features were extracted using a jects, but only 16 of them (six women and ten men, eight
CNN. Finally, a multitask learning framework was used to Caucasians and eight Asians, with an average age of 28.1)
predict microexpression classes from the fused features. were able to produce microexpressions. Microexpression
123
Fig. 10 The structure of hybrid approaches combining traditional and deep learning methods
Table 1 Summary of the publicly available microexpression datasets

Dataset SMIC SMIC2 CASME CASMEII SAMM
Resolution 640 480 640 480 640 480=1280 720 640 480 2040 1088
Face Resolution – 190 230 150 190 280 340 400 400
FPS 100 100 60 200 200
Video Numbers 77 164 195 247 159
Spontaneous/Posed Spontaneous Spontaneous Spontaneous Spontaneous Spontaneous
Mean Age (SD) – 28.1 22.03 (sd=1.60) 22.03 (sd=1.60) 33.24 (sd=11.32)
Subjects 6 16 35 35 32
Ethnicities – 2 1 1 12
Emotion Classes 5 3 7 5 7
*The corresponding information is missing for the relevant dataset
videos of 10 subjects were captured with a 100 fps high- of 60 fps and a resolution of 1280 720. The subjects were
speed camera, and microexpression videos of another 10 recorded under natural lighting. The subjects in Group B
subjects were captured (after 5 months) not only using the were recorded by a Point Gray GRAS-03K2C camera with
high-speed camera but also using an integrated camera box. a frame rate of 60 fps and a resolution of 640 480. The
The camera box includes a normal vision (VIS) camera and onset, apex and offset frames and the action units (AUs) are
a near-infrared (NIR) camera. The frame rate is 25 fps, and all labeled.
the resolution is 480 640. The database contains 164
microexpression videos from 16 subjects and 71 videos 4.4 CASME II database
from 8 subjects recorded using both the VIS and the NIR
cameras. Subjects watched 16 carefully selected video clips The CASME II database [91] was collected in 2014 by Fu
equipped with loudspeakers for audio output that could et al. from the Institute of Psychology, Chinese Academy
trigger strong emotions to generate microexpressions. of Sciences. CASME II is an upgraded version of CASME.
The dataset used a high-speed camera with a frame rate of
4.3 CASME database 200 fps. The captured image resolution is 640 480, and
the face area resolution is up to 280 340. The dataset
The CASME database [92] was collected in 2013 by Fu consists of 255 microexpression videos of 7 categories (i.e.,
et al. from the Institute of Psychology, Chinese Academy happy, disgusted, surprised, depressed, others, fearful, sad)
of Sciences. The CASME consists of 35 subjects (22 males from 26 subjects.
and 13 females) with an average age of 22.03 years. The
dataset contains 195 video clips and is divided into two
groups based on the configurations. The subjects in Group
A were recorded by a BenQ M31 camera with a frame rate
123
4.5 SAMM dataset 6 The fused datasets and the baseline
The SAMM dataset [14] was collected in 2016 by Davison 6.1 The fused datasets
et al. from the Manchester Metropolitan University.
SAMM contains 32 participants (17 White British, 3 Chi- In this work, five datasets (the CASME database, the
nese, 2 Arab, 2 Malay, 1 African, 1 Afro-Caribbean, 1 CASME II database, the CAS(ME)2 database, the SMIC
Black British, 1 White British/Arab, 1 Indian, 1 Nepalese, database, and the SAMM dataset) were fused to form two
1 Pakistani, and 1 Spanish, with 16 men and 16 women), new datasets. In addition, a neural network-based
and their average age was 33.24. Seven stimuli were used microexpression recognition baseline method was evalu-
to elicit the participants’ microexpressions. The camera ated on the fused datasets.
used is Basler Ace acA2000-340 km, with a grayscale The first fused dataset divided microexpressions into
sensor, recording speed of 200 fps, and resolution of four categories, including 125 positive, 444 negative, 112
2040 1088. surprise, and 122 others. The second fused dataset used all
datasets except the SMIC database (because the database is
labeled positive, negative, and surprise). In particular, we
5 Method comparisons removed three small categories (namely, contempt, pain
and helpless), and reserved nine categories, including 63
To bring researchers from broader research areas together, anger, 17 sadness, 69 surprise, 16 fear, 75 happiness, 127
the Facial MicroExpressions Grand Challenge disgust, 66 repression, 68 tense, and 122 others.
(MEGC) [31, 69, 94] was held in conjunction with the IEEE To overcome the imbalanced class problem, we ran-
Conference on Automatic Face and Gesture Recognition domly sampled from the original datasets to ensure that all
(FG). The latest MEGC (MEGC 2021) includes two tasks: the classes had the same sample size for evaluation. In the
the facial microexpression generation task and the facial fused 4-class dataset, the smallest sample size was 112
macro- and microexpression spotting task. (from surprise). Therefore, we randomly sampled 112
Some methods use the evaluation criteria proposed in images from all other three categories. For the fused
MEGC. Table 2 summarizes the performances of the 9-class dataset, we sampled 16 images from all categories,
existing methods using the composite database evaluation as the smallest sample number was 16 (from sad).
(CDE) method. Table 3 summarizes the performances of
the existing methods using the holdout-database evaluation 6.2 Baseline performance
(HDE) method. For the CDE method, two datasets (the
CASME II database and the SAMM dataset) are combined We implemented a baseline algorithm to evaluate the fused
into a single composite dataset, and leave-one-subject-out datasets. The baseline was a one-stream convolutional
(LOSO) cross-validation is used for evaluation. The neural network, as shown in Fig. 11. The apex frame (or
merged dataset consists of 55 subjects (26 from the the intermediate frame of the SMIC database for which the
CASME II database and 29 from the SAMM dataset). For apex frames are not labeled, with dimensions 28 28 3)
the HDE method, the training and test sets are sampled and was input into a convolutional layer with a kernel size of
evaluated from different microexpression datasets. Then, 3 3. After the convolutional layer, a rectified linear unit
the training set and the test set are swapped similarly to a (ReLU) was applied, followed by a batch normal layer.
2-fold cross-validation method. LOSO and leave-one- Afterward, a max-pooling layer was applied, followed by
video-out (LOVO) are also commonly used evaluation an average pooling layer. In the end, a fully connected (FC)
criteria in microexpression recognition. LOVO uses only layer was carried out before a final softmax classification
one video at a time as the test set and trains the model on layer. The model parameters were optimized using the
all other videos. LOSO takes all the videos of one subject Adam optimization algorithm.
as the test set and trains the model on all the videos of other For baseline verification, we employed the LOVO cross-
subjects. Table 4 summarizes the results using the LOSO validation evaluation method, in which all videos were
evaluation, and Table 5 summarizes the results using the used for training except the one left out for testing.
LOVO evaluation. Recognition accuracy, i.e., the percentage of correctly
recognized samples over the total number of test samples,
as shown in Equation (1), was used to calculate the
accuracy.
T
Accuracy ¼ 100%; ð1Þ
N
123
Table 2 Performance of the

Method Composite-2 CASME II
existing methods using the
composite database evaluation Acc F1- UAR WAR Acc UF1 UAR F1-score
(CDE) method score
ATNet [63] - - - - 83.40 79.80 77.50 -

Quang et al. [77] - - - - - 70.68 70.18 -
ICE-GAN [95] - - - - - 87.60 86.80 -
Peng et al. [65] 74.70 64.00 - - 75.68 - - 65.00
SA-AT [103] - - - - - 76.07 75.52 -
Dual-Inception [104] - - - - - 86.21 85.60 -
ELRCN-SE [35] 57.00 41.07 39.00 - - - - -
ELRCN-TE [35] 47.00 36.16 33.00 - - - - -
LFM-based(6868 - - - - - 87.00 84.00 -
LFM) [12]
CLFM(2121 LFM) [12] - - - - - 72.00 77.00 -
TSNN-IF [84] - 60.75 55.40 75.47 - - - -
TSNN-LF [84] - 61.01 57.60 69.43 - - - -
RCN-WN [88] - - - - - 36.13 36.61 -
RCN-SA [88] - - - - - 47.29 51.53 -
RCN-SU [88] - - - - - 43.11 46.41 -
RCN-DB [88] - - - - - 55.96 57.97 -
Method SAMM SMIC Composite-3
Acc UF1 UAR F1- Acc UF1 UAR Acc UF1 UAR
score
ATNet [63] 70.10 49.60 48.20 - 56.10 55.30 54.30 69.30 63.10 61.30
Quang et al. [77] 62.09 59.89 - - - 58.20 58.77 - 65.20 65.06
ICE-GAN [95] 85.50 82.30 - - - 79.00 79.10 - 84.50 84.10
Peng et al. [65] 70.59 - - 54.00 - - - - - -
SA-AT [103] - 44.76 48.68 - - 55.12 54.63 - 59.36 59.58
Dual-Inception [104] - 58.68 56.63 - - 66.45 67.26 - 73.22 72.78
ELRCN-SE [35] - - - - - - - - - -
ELRCN-TE [35] - - - - - - - - - -
LFM-based(6868 - 67.00 60.00 - - 72.00 71.00 - 77.00 75.00
LFM) [12]
CLFM(2121 - 65.00 51.00 - - 71.00 71.00 - 75.00 72.00
LFM) [12]
TSNN-IF [84] - - - - - - - - - -
TSNN-LF [84] - - - - - - - - - -
RCN-WN [88] - 27.76 31.32 - - 46.95 46.37 - 43.39 42.87
RCN-SA [88] - 43.28 43.14 - - 53.88 55.63 - 51.84 53.54
RCN-SU [88] - 45.85 45.86 - - 50.64 52.13 - 49.89 51.43
RCN-DB [88] - 53.03 52.34 - - 63.30 63.27 - 59.79 60.37
1
- denotes that the corresponding evaluation term is missing in this work.
2
Composite-2 database consists of CASME II and SAMM databases; Composite-3 database consists of
CASME II, SAMM and SMIC databases
123
Table 3 Performances of the existing methods using the holdout-database evaluation (HDE) method
Method Composite-2 CASME II
F1-score UAR WAR Acc UF1 UAR WAR F1-score Acc
ATNet [63] - - - - 52.30 50.10 - - -

Peng et al. [65] - - - - - - - - -
ELRCN-SE [35] 34.11 35.22 - 43.45 - - - - -
ELRCN-TE [35] 23.89 22.21 - 23.20 - - - - -
LFM-based(6868 LFM) [12] - - - - - - - 60.66 63.57
CLFM(2121 LFM) [12] - - - - - - - 57.79 62.79
TSNN-IF [84] 45.21 39.10 49.47 - - - - - -
TSNN-LF [84] 43.67 40.22 44.69 - - - - - -
Method SAMM SMIC
UF1 UAR WAR UF1 UAR F1-score Acc
ATNet [63] 42.90 42.70 - 49.70 48.90 - -

Peng et al. [65] - - - - - - -
ELRCN-SE [35] - - - - - - -
ELRCN-TE [35] - - - - - - -
LFM-based(6868 LFM) [12] - - - - - 58.76 58.54
CLFM(2121 LFM) [12] - - - - - 49.78 55.49
TSNN-IF [84] - - - - - - -
TSNN-LF [84] - - - - - - -
1
2
Composite-2 database consists of CASME II and SAMM databases
where T and N are the number of correctly predicted test shows the training curves of the 9-class fused dataset. In
samples and the total number of test samples, respectively. particular, we set the training epoch in our experiments as
Table 6 summarizes the baseline performances. 150 epochs.
The prediction accuracy is 47.32% for the 4-class fused Moreover, for the 4-class fused dataset, the training time
dataset and 26.39% for the 9-class fused dataset. The was 5.7430 s, and the test time was 0.0079 s. For the
recognition accuracy of the 4-class fused dataset is much 9-class fused dataset, the training and test times were
better than that of the 9-class fused dataset. This is prob- 4.4630 s and 0.0067 s, respectively. The training time
ably because there are more categories in the 9-class fused difference between the two fused datasets is due to the
dataset, and the number of samples in each category is less. different number of training samples. We carried out our
This poses a challenge to the model training. experiments using an I5-9600K CPU@3.70 GHz and an
We further illustrate the baseline performances with NVIDA GeForce RTX 2070 GPU with 16 GB memory.
confusion matrices. Figure 12 shows the confusion matrix
of the 4-class fused dataset. The 9-class dataset confusion
matrix is shown in Fig. 13. Each row in the confusion 7 Possible future works
matrix represents a ground truth class label, each column
represents a predicted class, and the diagonal values denote Deep learning-based methods have advanced the current
the correctly recognized samples. microexpression recognition research. However, microex-
For the sake of completeness, we also report below the pression recognition remains a challenging problem due to
number of training epochs and the training and testing the lack of training data, imbalanced classes, and other
times of our experiments on both the 4-class and the 9-class problems. In this section, we will discuss possible future
fused datasets. Fig. 14 shows the training curves of one works that might contribute to the advancement of
training instance of the 4-class fused dataset. Figure 15 microexpression recognition.
123
Table 4 The performance of the existing methods using the leave-one-subject-out (LOSO) evaluation method
Method SMIC CASME CAS(ME)2
UF1 UAR Acc F1- Acc Acc F1-
score score
Liu et al. [49] 74.61 75.30 - - - - -

Aouayeb et al. [1] 88.86 88.28 - - - - -
Zhi et al. [101] - - 66.30 - - - -
Li et al. [41] - - 60.67 - 56.60 - -
SSSN [34] - - 63.41 63.29 - - -
DSSN [34] - - 63.41 64.62 - - -
ELRCN-SE [35] - - - - - - -
ELRCN-TE [35] - - - - - - -
Li et al. [40] - - 55.49 - 54.44 - -
LFM-based(6868 - - 71.34 71.34 - - -
LFM) [12]
CLFM(2121 LFM) [12] - - 71.34 71.34 - - -
STSTNet [63] 68.01 70.13 - - - - -
TSCNN-I [71] - - 72.74 72.36 - 84.47 84.21
TSCNN-II [71] - - - - - 86.22 86.18
MER-GCN [50] - - - - - - -
Kim et al. [36] - - - - - - -
OFF-ApexNet [19] - - 67.68 67.09 - - -
RCN-A [89] 63.26 64.41 - - - - -
RCN-S [89] 65.19 65.72 - - - - -
RCN-W [89] 65.84 66.00 - - - - -
RCN-F [89] 59.91 59.80 - - - - -
MagGA [43] - - - - - - -
MagSA [43] - - - - - - -
Joint?Multi-task [27] - - 65.10 - - - -
STRCN-A [87] - - 53.10 51.40 - - -
STRCN-G [87] - - 72.30 69.50 - - -
OrigiNet [78] - - - - 66.30 54.56 -
Method CASME II SAMM
UF1 UAR F1-score Acc UF1 UAR F1-score Acc
Liu et al. [49] 82.93 82.09 - - 77.54 71.52 - -

Aouayeb et al. [1] 98.57 98.57 - - 78.55 81.03 - -
Zhi et al. [101] - - - 65.90 - - - -
Li et al. [41] - - - 56.94 - - - -
SSSN [34] - - 71.51 71.19 - - 45.13 56.62
DSSN [34] - - 72.97 70.78 - - 46.44 57.35
ELRCN-SE [35] - 38.95 45.47 47.15 - - - -
ELRCN-TE [35] - 43.96 50.00 52.44 - - - -
Li et al. [40] - - - 59.11 - - - -
LFM-based(6868 LFM) [12] - - 71.65 73.98 - - - -
CLFM(2121 LFM) [12] - - 70.26 71.54 - - - -
STSTNet [63] 83.82 86.86 - - 65.88 68.10 - -
TSCNN-I [71] - - 73.27 74.05 - - 60.65 63.53
TSCNN-II [71] - - 80.70 80.97 - - 69.42 71.76
MER-GCN [50] - - - 42.71 - - - -
Kim et al. [36] - - - 60.98 - - - -
123
Table 4 (continued)
Method CASME II SAMM
UF1 UAR F1-score Acc UF1 UAR F1-score Acc
OFF-ApexNet [19] - - 86.97 88.28 - - 54.23 68.18

RCN-A [89] 85.12 81.23 - - 76.01 67.15 - -
RCN-S [89] 83.60 79.14 - - 76.47 65.65 - -
RCN-W [89] 85.22 81.31 - - 71.64 61.64 - -
RCN-F [89] 85.63 80.87 - - 69.76 67.71 - -
MagGA [43] - - - 63.30 - - - -
MagSA [43] - - - 60.64 - - - -
Joint?Multi-task [27] - - - 66.20 - - - -
STRCN-A [87] - - 54.20 56.00 - - 49.20 54.50
STRCN-G [87] - - 74.70 80.30 - - 74.10 78.60
OrigiNet [78] - - - 62.09 - - - 34.89
Method Composite-3
UF1 UAR Acc F1-
score
Liu et al. [49] 78.85 78.24 - -

Aouayeb et al. [1] 90.22 90.18 - -
Zhi et al. [101] - - - -
Li et al. [41] - - - -
SSSN [34] - - - -
DSSN [34] - - - -
ELRCN-SE [35] - - - -
ELRCN-TE [35] - - - -
Li et al. [40] - - - -
LFM-based(6868 - - - -
LFM) [12]
CLFM(2121 LFM) [12] - - - -
STSTNet [63] 73.53 76.05 76.92 73.89
TSCNN-I [71] - - - -
TSCNN-II [71] - - - -
MER-GCN [50] - - - -
Kim et al. [36] - - - -
OFF-ApexNet [19] - - - -
RCN-A [89] 74.32 71.90 - -
RCN-S [89] 74.66 71.06 - -
RCN-W [89] 74.22 71.00 - -
RCN-F [89] 71.64 70.52 - -
MagGA [43] - - - -
MagSA [43] - - - -
Joint?Multi-task [27] - - - -
STRCN-A [87] - - - -
STRCN-G [87] - - - -
OrigiNet [78] - - - -
1
2
Composite-3 database consists of CASME II, SAMM and SMIC databases
123
Table 5 Performances of the

Method CASME II SMIC SAMM
existing methods using the
leave-one-video-out (LOVO) Acc F1-score Acc F1-score Acc F1-score
evaluation method
Mayya et al. [52] 64.90 - 65.85 - - -
STRCN-A [87] 84.10 78.40 75.80 71.40 83.60 79.20
STRCN-G [87] 83.30 80.70 74.90 71.00 82.70 78.10

- denotes that the corresponding evaluation term is missing in this work
Fig. 11 The structure of the

proposed convolutional neural
network-based baseline
Table 6 Baseline performance on the fused datasets

Evaluation Method 4-class Fused Dataset 9-class Fused Dataset
LOVO 47.32% 26.39%
Fig. 13 TThe confusion matrix of the 9-class fused dataset
Constrained by the current data collection settings, current

microexpression datasets are mostly small-scale. The lack
of data hinders the application of deep learning-based
methods in microexpression recognition. Future works on
large-scale microexpression datasets might address this
Fig. 12 The confusion matrix of the 4-class fused dataset problem.
Most of the existing microexpression datasets are col-
7.1 Microexpression dataset collection lected in the laboratory. Microexpressions in their natural
settings vary in head poses (e.g., frontal or nonfrontal
Deep learning has achieved considerable success in many poses), occlusion conditions, etc. Collecting and recog-
fields. The successful application of deep learning-based nizing microexpression data in the wild are interesting
methods requires a large amount of training data. problems worth exploring.
123
Fig. 14 Training curves of the

4-class fused dataset
Fig. 15 Training curves of the

9-class fused dataset
Psychological studies have found that people produce performance. Auxiliary modules are added to enhance the
complex expressions in daily life and that the emergence of performance of the feature extraction module or the whole
one emotion may be accompanied by others [16, 17]. network [13]. For example, to improve network represen-
These expressions are named compound expressions. tation for facial images, squeeze-and-excitation block
Compound expressions are more realistic. Current (SE) [28] network modules are introduced to learn feature
microexpression datasets focus on basic microexpressions, weights. The main idea of SE is to improve the expressive
namely, anger, sadness, surprise, fear, other, happiness, ability of the network by explicitly modeling the interde-
disgust, repress, and tense. Datasets on compound expres- pendence between feature channels. It will be worth
sion can provide valuable evaluation data in this direction. exploring this type of module within the microexpression
recognition network.
7.2 Optimizing network structure Cascade classifiers can also be used to optimize the
network structure. Cascade classifiers are often used in face
Other than data enrichment, optimizing the network detection [15, 54, 93]. The main idea of the cascade
structure will also boost the microexpression recognition structure is to filter out negative regions using a classifier
123
sequence. The input images are sequentially processed by a References

cascade of classifiers. Only the candidates that pass through
a previous classifier are fed into the subsequent classifier 1. Aouayeb M, Hamidouche W, Kpalma K, Benazza-Benyahia A
for further processing. Those that go through all classifiers (2019) A spatiotemporal deep learning solution for automatic
micro-expressions recognition from local facial regions. In:
are considered face regions. For microexpression recogni- 2019 IEEE 29th International Workshop on Machine Learning
tion, we argue that cascade structures can be beneficial. For for Signal Processing (MLSP), pp 1–6. IEEE
example, if we design the classifiers in the later stage so 2. Barron JL, Fleet DJ, Beauchemin SS (1994) Performance of
that they handle hard negatives and the early-stage classi- optical flow techniques. Int J Comput Vis 12:43–77. https://doi.
org/10.1007/BF01420984
fiers handle other negative examples, cascade classifiers 3. Belaiche R, Liu Y, Migniot C, Ginhac D, Yang F (2020) Cost-
can be trained to learn these specific tasks. effective CNNs for real-time micro-expression recognition.
Appl Sci (Switzerland). https://doi.org/10.3390/app10144959
7.3 Multimodal information for microexpression 4. Borza D, Itu R, Danescu R (2018) Micro expression detection
and recognition from high speed cameras using convolutional
recognition neural networks. VISIGRAPP 2018 - Proceedings of the 13th
International Joint Conference on Computer Vision, Imaging
The current microexpression recognition algorithms rely and Computer Graphics Theory and Applications 5(January),
on one modality of data, i.e., videos or image sequences. In 201–208. https://doi.org/10.5220/0006548802010208
5. Chang CC, Lin CJ (2011) Libsvm: a library for support vector
reality, we perceive emotions from multimodal informa- machines. ACM Trans Intell SystTechnol (TIST) 2(3):1–27
tion, including gestures and speech. Inspired by this 6. Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) His-
observation, we argue that further research into multimodal tograms of oriented optical flow and binet-cauchy kernels on
microexpression recognition might bring breakthroughs to nonlinear dynamical systems for the recognition of human
actions. In: 2009 IEEE Conference on Computer Vision and
this field. Pattern Recognition, pp 1932–1939. IEEE
As a nonverbal communication cue, postures are also 7. Chen B, Zhang Z, Liu N, Tan Y, Liu X, Chen T (2020) Spa-
able to convey emotions. A microgesture dataset [9] was tiotemporal convolutional neural network with convolutional
proposed in 2019, which includes 3,692 artificially labeled block attention module for micro-expression recognition.
Information (Switzerland). https://doi.org/10.3390/
gesture fragments collected from 40 participants. Micro- INFO11080380
gestures are body movements that are generated when 8. Chen C, Crivelli C, Garrod O, Schyns PG, Jack RE (2018)
hidden expressions are triggered in unconstrained condi- Distinct facial expressions represent pain and pleasure across
tions. In future research, microgestures can be used as an cultures. Proc Natl Acad Sci 115(43):201807862
9. Chen H, Liu X, Li X, Shi H, Zhao G (2019) Analyze sponta-
additional modality of information for microexpression neous gestures for emotional stress state recognition: a micro-
recognition. gesture dataset and analysis with deep learning. In: 2019 14th
IEEE International Conference on Automatic Face & Gesture
Recognition (FG 2019)
10. Chen L.C, Papandreou G, Kokkinos I, Murphy K, Yuille A.L
8 Conclusions (2014) Semantic image segmentation with deep convolutional
nets and fully connected crfs. arXiv preprint arXiv:1412.7062
In the past decade, many new methods have emerged in 11. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares
microexpression recognition research, but it is still a F, Schwenk H, Bengio Y (2014) Learning phrase representations
using rnn encoder-decoder for statistical machine translation.
challenging problem. This paper provides an overview of arXiv preprint arXiv:1406.1078
public databases and studies deep learning-based 12. Choi DY, Song BC (2020) Facial micro-expression recognition
microexpression recognition methods. We also fused five using two-dimensional landmark feature maps. IEEE Access
publicly available datasets to form two new datasets and 8:121549–121563. https://doi.org/10.1109/ACCESS.2020.
3006958
evaluated the baseline CNN method on the newly fused 13. Dang M, Wang H (2020) A survey of facial expression recog-
datasets. Furthermore, we discuss possible research direc- nition methods based on deep leaming. Sci Technol Eng
tions for the task of microexpression recognition in the 20(24):9724
future. 14. Davison AK, Lansley C, Costen N, Tan K, Yap MH (2016)
Samm: a spontaneous micro-facial movement dataset. IEEE
Trans Affect Comput 9(1):116–129
15. Diyasa G, Fauzi A, Idhom M, Setiawan A (2021) Multi-face
recognition for the detection of prisoners in jail using a modified
Declarations cascade classifier and cnn. J Phys Conf Ser 1844:012005
16. Du S, Martinez AM (2015) Compound facial expressions of
Conflict of interest The authors declare that they have no conflict of emotion: from basic research to clinical applications. Dial Clin
interest. Neurosci 17(4):443
17. Du S, Tao Y, Martinez AM (2014) Compound facial expressions
of emotion. Proc Natl Acad Sci U S A 111(15):E1454
123
18. Farnebck G (2003) Two-frame motion estimation based on 38. Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classifi-
polynomial expansion. In: 13th Scandinavian Conference on cation with deep convolutional neural networks. In: Proceedings
Image Analysis (SCIA 2003) Advances in Neural Information Processing Systems,
19. Gan YS, Liong ST, Yau WC, Huang YC, Tan LK (2019) OFF- pp 1097–1105
ApexNet on micro-expression recognition system. Signal Pro- 39. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classi-
cess Image Commun 74(May):129–139. https://doi.org/10.1016/ fication with deep convolutional neural networks. Adv Neural
j.image.2019.02.005 Inform Process Syst 25:1097–1105
20. Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, 40. Li J, Wang Y, See J, Liu W (2019) Micro-expression recogni-
Laviolette F, Marchand M, Lempitsky V (2017) Domain-ad- tion based on 3D flow convolutional neural network. Pattern
versarial training of neural networks. J Mach Learn Res Anal Appl 22(4):1331–1339. https://doi.org/10.1007/s10044-
17(1):2096 018-0757-5
21. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley 41. Li Q, Yu J, Kurihara T, Zhang H, Zhan S (2020) Deep convo-
D, Ozair S, Courville A, Bengio Y (2014) Generative adver- lutional neural network with optical flow for facial micro-ex-
sarial networks. In: Proceedings of Advances in Neural Infor- pression recognition. J Circuits, Syst Comput 29(1):1–18.
mation Processing Systems, pp 2672–2680 https://doi.org/10.1142/S0218126620500061
22. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for 42. Li X, Pfister T, Huang X, Zhao G, Pietikäinen M (2013) A
image recognition. In: Proceedings of the IEEE conference on spontaneous micro-expression database: Inducement, collection
computer vision and pattern recognition, pp 770–778 and baseline. In: Proceedings of IEEE International Conference
23. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for and Workshops on Automatic Face and Gesture Recognition
image recognition. In: Proceedings of IEEE Conference on (FG), pp 1-6
Computer Vision and Pattern Recognition, pp 770–778 43. Li Y, Huang X, Zhao G (2018) Can Micro-Expression be
24. Heaven D (2020) Why faces don’t always tell the truth about Recognized Based on Single Apex Frame? Proceedings -
feelings. Nature 578(7796):502–504. https://doi.org/10.1038/ International Conference on Image Processing, ICIP
d41586-020-00507-5 pp. 3094–3098. https://doi.org/10.1109/ICIP.2018.8451376
25. Hochreiter S, Schmidhuber J (1997) Long short-term memory. 44. Liang M, Hu X (2015) Recurrent convolutional neural network
Neural Comput 9(8):1735–1780 for object recognition. In: IEEE Conference on Computer Vision
26. Horn B, Schunck BG (1981) Determining optical flow. Artificial & Pattern Recognition, pp 3367–3375
Intell 17(1–3):185–203 45. Liong S.T, Gan Y.S, See J, Khor H.Q, Huang Y.C (2019)
27. Hu C, Jiang D, Zou H, Zuo X, Shu Y (2018) Multi-task micro- Shallow triple stream three-dimensional CNN (STSTNet) for
expression recognition combining deep and handcrafted fea- micro-expression recognition. In: Proceedings of IEEE Inter-
tures. In: 2018 24th International Conference on Pattern national Conference on Automatic Face & Gesture Recognition,
Recognition (ICPR), pp. 946–951. IEEE pp 1–5
28. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. 46. Liong ST, Gan YS, Zheng D, Li SM, Xu HX, Zhang HZ, Lyu
In: Proceedings of the IEEE conference on computer vision and RK, Liu KH (2019) Evaluation of the spatio-temporal features
pattern recognition, pp. 7132–7141 and GAN for micro-expression recognition system. J Sign Pro-
29. Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T cess Syst (92):705–725. https://doi.org/10.1007/s11265-020-
(2017) Flownet 2.0: Evolution of optical flow estimation with 01523-4
deep networks. In: Proceedings of the IEEE conference on 47. Liong S.T, See J, Wong K.S, Le Ngo A.C, Oh Y.H, Phan R
computer vision and pattern recognition, pp. 2462–2470 (2016) Automatic apex frame spotting in micro-expression
30. Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural database. Proceedings - 3rd IAPR Asian Conference on Pattern
networks for human action recognition. IEEE Trans Pattern Recognition, ACPR 2015 (June 2016), 665–669. https://doi.org/
Anal Mach Intell 35(1):221–231 10.1109/ACPR.2015.7486586
31. Jingting L, Wang S.J, Yap M.H, See J, Hong X, Li X (2020) 48. Liu H, Simonyan K, Yang Y (2018) Darts: Differentiable
Megc2020-the third facial micro-expression grand challenge. In: architecture search. arXiv preprint arXiv:1806.09055
2020 15th IEEE International Conference on Automatic Face 49. Liu Y, Du H, Zheng L, Gedeon T (2019) A neural micro-ex-
and Gesture Recognition (FG 2020), pp. 777–780. IEEE pression recognizer. Proceedings - 14th IEEE International
32. Justesen N, Bontrager P, Togelius J, Risi S (2020) Deep learning Conference on Automatic Face and Gesture Recognition, FG
for video game playing. IEEE Trans Games 12(1):1–20. https:// 2019 pp. 1–4. https://doi.org/10.1109/FG.2019.8756583
doi.org/10.1109/TG.2019.2896986 50. Lo L, Xie H.X, Shuai H.H, Cheng W.H (2020) MER-GCN:
33. Karmakar P, Teng S.W, Lu G (2021) Thank you for attention: a Micro-Expression Recognition Based on Relation Modeling
survey on attention-based artificial neural networks for auto- with Graph Convolutional Networks. Proceedings - 3rd Inter-
matic speech recognition. arXiv preprint arXiv:2102.07259 national Conference on Multimedia Information Processing and
34. Khor H.Q, See J, Liong S.T, Phan R.C, Lin W (2019) Dual- Retrieval, MIPR 2020 pp. 79–84. https://doi.org/10.1109/
stream shallow networks for facial micro-expression recogni- MIPR49039.2020.00023
tion. In: 2019 IEEE International Conference on Image Pro- 51. Lucey P, Cohn J.F, Kanade T, Saragih J, Matthews I (2010) The
cessing (ICIP), pp. 36–40. IEEE extended cohn-kanade dataset (ck?): A complete dataset for
35. Khor H.Q, See J, Phan R.C.W, Lin W (2018) Elrcn. In: IEEE action unit and emotion-specified expression. In: Computer
International Conference on Automatic Face & Gesture Vision & Pattern Recognition Workshops
Recognition 52. Mayya V, Pai R.M, Pai M.M (2016) Combining temporal
36. Kim D.H, Baddar W.J, Ro Y.M (2016) Micro-expression interpolation and DCNN for faster recognition of micro-ex-
recognition with expression-state constrained spatio-temporal pressions in video sequences. 2016 International Conference on
feature representations. MM 2016 - Proceedings of the 2016 Advances in Computing, Communications and Informatics,
ACM Multimedia Conference pp. 382–386. https://doi.org/10. ICACCI 2016 pp. 699–703. https://doi.org/10.1109/ICACCI.
1145/2964284.2967247 2016.7732128
37. Kipf T.N, Welling M (2016) Semi-supervised classification with
graph convolutional networks. arXiv preprint arXiv:1609.02907
123
53. Merghani W, Davison A.K, Yap M.H (2018) A Review on 70. Senst T, Eiselein V, Sikora T (2012) Robust local optical flow
Facial Micro-Expressions Analysis: Datasets, Features and for feature tracking. IEEE Trans Circuits Syst Video Technol
Metrics pp. 1–19. http://arxiv.org/abs/1805.02397 22(9):1377–1387
54. Minu M, Arun K, Tiwari A, Rampuria P (2020) Face recogni- 71. Song B, Li K, Zong Y, Zhu J, Zheng W, Shi J, Zhao L (2019)
tion system based on haar cascade classifier. Int. J. Adv. Sci. Recognizing spontaneous micro-expression using a three-stream
Technol. 29(5):3799–3805 convolutional neural network. IEEE Access 7:184537–184551.
55. Nag S, Bhunia A.K, Konwer A, Roy P.P (2019) Facial micro- https://doi.org/10.1109/ACCESS.2019.2960629
expression spotting and recognition using time contrasted fea- 72. Taini M, Zhao G, Li S.Z, Pietikainen M (2008) Facial expres-
ture with visual memory. In: ICASSP 2019-2019 IEEE Inter- sion recognition from near-infrared video sequences. In: 19th
national Conference on Acoustics, Speech and Signal Processing International Conference on Pattern Recognition (ICPR 2008),
(ICASSP), pp. 2022–2026. IEEE December 8-11, 2008, Tampa, Florida, USA
56. Nistor S.C (2020) Multi-Staged Training of Deep Neural Net- 73. Takalkar M.A, Xu M (2017) Image based facial micro-expres-
works for Micro-Expression Recognition. SACI 2020 - IEEE sion recognition using deep learning on small datasets. In: 2017
14th International Symposium on Applied Computational international conference on digital image computing: techniques
Intelligence and Informatics, Proceedings pp. 29–34. https://doi. and applications (DICTA), pp. 1–7. IEEE
org/10.1109/SACI49304.2020.9118811 74. Takalkar MA, Xu M, Chaczko Z (2020) Manifold feature inte-
57. Odena A, Olah C, Shlens J (2017) Conditional image synthesis gration for micro-expression recognition. Multimedia Syst
with auxiliary classifier gans. In: Proceedings of International 26(5):535–551. https://doi.org/10.1007/s00530-020-00663-8
Conference on Machine Learning, Vol 70, pp 2642–2651 75. Takalkar M.A, Zhang H, Xu M (2019) Improving micro-ex-
58. Oh YH, See J, Le Ngo AC, Phan RC, Baskaran VM (2018) A pression recognition accuracy using twofold feature extraction.
survey of automatic facial micro-expression analysis: databases, In: Proceedings of International Conference on MultiMedia
methods, and challenges. Front Psychol. https://doi.org/10.3389/ Modeling, pp 652–664
fpsyg.2018.01128 76. Tokmakov P, Alahari K, Schmid C (2017) Learning motion
59. Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray- patterns in videos. In: Proceedings of the IEEE conference on
scale and rotation invariant texture classification with local computer vision and pattern recognition, pp. 3386–3394
binary patterns. IEEE Trans Pattern Anal Mach Intell 77. Van Quang N, Chun J, Tokuyama T (2019) CapsuleNet for
24(7):971–987 micro-expression recognition. Proceedings - 14th IEEE Inter-
60. Ojala T, Pietik Inen M, Harwood D (1996) A comparative study national Conference on Automatic Face and Gesture Recogni-
of texture measures with classification based on feature distri- tion, FG 2019 pp. 1–7. https://doi.org/10.1109/FG.2019.
butions. Pattern Recogn 29(1):51–59 8756544
61. Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recog- 78. Verma M, Vipparthi SK, Singh G (2020) Non-Linearities
nition. In: Proceedings of the British Machine Vision Confer- Improve OrigiNet based on Active Imaging for Micro Expres-
ence, pp 41.1–41.12 sion Recognition. Proceedings of the International Joint Con-
62. Patel D, Hong X, Zhao G (2016) Selective deep features for ference on Neural Networks. https://doi.org/10.1109/
micro-expression recognition. Proceedings - International Con- IJCNN48605.2020.9207718
ference on Pattern Recognition 0(i), 2258–2263. https://doi.org/ 79. Verma M, Vipparthi S.K, Singh G, Subrahmanyam M (2019)
10.1109/ICPR.2016.7899972 LEARNet: Dynamic Imaging Network for Micro Expression
63. Peng M, Wang C, Bi T, Chen T, Shi Y (2019) A Novel Apex- Recognition. arXiv preprint arXiv: 1904.09410
Time Network for Cross-Dataset Micro-Expression Recogni- 80. Wang C, Peng M, Bi T, Chen T (2020) Micro-attention for
tion. In: 8th International Conference on Affective Computing micro-expression recognition. Neurocomputing 410:354–362.
and Intelligent Interaction (ACII) https://doi.org/10.1016/j.neucom.2020.06.005
64. Peng M, Wang C, Chen T, Liu G, Fu X (2017) Dual temporal 81. Wang SJ, Li BJ, Liu YJ, Yan WJ, Ou X, Huang X, Xu F, Fu X
scale convolutional neural network for micro-expression (2018) Micro-expression recognition with small sample size by
recognition. Front Psychol. https://doi.org/10.3389/fpsyg.2017. transferring long-term convolutional neural network. Neuro-
01745 computing 312:251–262. https://doi.org/10.1016/j.neucom.
65. Peng M, Wu Z, Zhang Z, Chen T (2018) From macro to micro 2018.05.107
expression recognition: Deep learning on small datasets using 82. Wedel Pock, Braun Franke, (2009) Cremers: Duality tv-l1 flow
transfer learning. Proceedings - 13th IEEE International Con- with fundamental matrix prior. In: Image & Vision Computing
ference on Automatic Face and Gesture Recognition, FG 2018 New Zealand, Ivcnz International Conference
pp. 657–661. https://doi.org/10.1109/FG.2018.00103 83. Woo S, Park J, Lee J.Y, Kweon I.S (2018) Cbam: Convolutional
66. Pfister T, Li X, Zhao G, Pietikäinen M (2011) Recognising block attention module. In: Proceedings of the European con-
spontaneous facial micro-expressions. In: Proceedings of Inter- ference on computer vision (ECCV), pp. 3–19
national Conference on Computer Vision, pp 1449–1456 84. Wu C, Guo F (2021) TSNN: three-stream combining 2D and 3D
67. Reddy SPT, Karri ST, Dubey SR, Mukherjee S (2019) Sponta- convolutional neural network for micro-expression recognition.
neous facial micro-expression recognition using 3d spatiotem- IEEJ Trans Electric Electron Eng 16(1):98–107. https://doi.org/
poral convolutional neural networks. In: Proceedings of 10.1002/tee.23272
International Joint Conference on Neural Networks, pp 1–8 85. Wu H.Y, Rubinstein M, Shih E, Guttag J, Freeman W (2012)
68. Sabour S, Frosst N, Hinton G.E (2017) Dynamic routing Eulerian video magnification for revealing subtle changes in the
between capsules. arXiv preprint arXiv:1710.09829 world. In: SIGGRAPH
69. See J, Yap M.H, Li J, Hong X, Wang S.J (2019) MEGC 2019 - 86. Xia Z, Feng X, Hong X, Zhao G (2019) Spontaneous facial
The second facial micro-expressions grand challenge. Proceed- micro-expression recognition via deep convolutional network.
ings - 14th IEEE International Conference on Automatic Face 2018 8th International Conference on Image Processing Theory,
and Gesture Recognition, FG 2019 pp. 1–5. https://doi.org/10. Tools and Applications, IPTA 2018 - Proceedings pp. 1–6.
1109/FG.2019.8756611 https://doi.org/10.1109/IPTA.2018.8608119
123
87. Xia Z, Hong X, Gao X, Feng X, Zhao G (2020) Spatiotemporal 97. Zhang L, Arandjelovic O (2021) Review of automatic micro-
recurrent convolutional networks for recognizing spontaneous expression recognition in the past decade. Mach Learn Knowl
micro-expressions. IEEE Trans Multi 22:626–640 Extract. https://doi.org/10.3390/make3020021
88. Xia Z, Liang H, Hong X, Feng X (2019) Cross-database micro- 98. Zhang L, Arandjelović O (2021) Review of automatic
expression recognition with deep convolutional networks. ACM microexpression recognition in the past decade. Mach Learn
International Conference Proceeding Series pp. 56–60. https:// Knowl Extract 3(2):414–434. https://doi.org/10.3390/
doi.org/10.1145/3345336.3345343 make3020021
89. Xia Z, Peng W, Khor HQ, Feng X, Zhao G (2020) Revealing the 99. Zhao G, Li X (2019) Automatic micro-expression analysis: open
invisible with model and data Shrinking for composite-database challenges. Front Psychol 10:1–4. https://doi.org/10.3389/fpsyg.
micro-expression Recognition. IEEE Trans Image Process 2019.01833
29(AUGUST):8590–8605. https://doi.org/10.1109/TIP.2020. 100. Zhao Y, Xu J (2019) A convolutional neural network for com-
3018222 pound micro-expression recognition. Sensors (Switzerland).
90. Xie H.X, Lo L, Shuai H.H, Cheng W.H (2020) An Overview of https://doi.org/10.3390/s19245553
Facial Micro-Expression Analysis: Data, Methodology and 101. Zhi R, Xu H, Wan M, Li T (2019) Combining 3d convolutional
Challenge pp. 1–20. http://arxiv.org/abs/2012.11307 neural networks with transfer learning by supervised pre-train-
91. Yan WJ, Li X, Wang SJ, Zhao G, Liu YJ, Chen YH, Fu X ing for facial micro-expression recognition. IEICE Trans Inform
(2014) Casme ii: an improved spontaneous micro-expression Syst 102(5):1054–1064
database and the baseline evaluation. Plos One 9(1):e86041 102. Zhou J, Hong X, Su F, Zhao G (2016) Recurrent convolutional
92. Yan W.J, Wu Q, Liu Y.J, Wang S.J, Fu X (2013) Casme data- neural network regression for continuous pain intensity esti-
base: A dataset of spontaneous micro-expressions collected from mation in video. In: Proceedings of the IEEE conference on
neutralized faces. In: IEEE International Conference & Work- computer vision and pattern recognition workshops, pp. 84–92
shops on Automatic Face & Gesture Recognition 103. Zhou L, Mao Q, Xue L (2019) Cross-database micro-expression
93. Yang H, Wang XA (2016) Cascade classifier for face detection. recognition: A style aggregated and attention transfer approach.
J Algorithms Comput Technol 10(3):187–197 Proceedings - 2019 IEEE International Conference on Multi-
94. Yap M.H, See J, Hong X, Wang S.J (2018) Facial micro-ex- media and Expo Workshops, ICMEW 2019 pp. 102–107. https://
pressions grand challenge 2018 summary. Proceedings - 13th doi.org/10.1109/ICMEW.2019.00025
IEEE International Conference on Automatic Face and Gesture 104. Zhou L, Mao Q, Xue L (2019) Dual-Inception Network for
Recognition, FG 2018 pp. 675–678. https://doi.org/10.1109/FG. Cross-Database Micro-Expression Recognition. In: 2019 14th
2018.00106 IEEE International Conference on Automatic Face & Gesture
95. Yu J, Zhang C, Song Y, Cai W (2020) ICE-GAN: Identity-aware Recognition (FG 2019)
and capsule-enhanced gan for micro-expression recognition and 105. Zhu J.Y, Park T, Isola P, Efros A.A (2017) Unpaired image-to-
synthesis. arXiv preprint arXiv: 2005.04370 image translation using cycle-consistent adversarial networks.
96. Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-at- In: Proceedings of the IEEE international conference on com-
tention generative adversarial networks. In: Proceedings of puter vision, pp. 2223–2232
International Conference on Machine Learning, vol 97,
pp 7354–7363 Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
123

Deep Learning-Based Microexpression Recognition: A Survey

Uploaded by

Copyright:

Available Formats

You might also like

Deep Learning-Based Microexpression Recognition: A Survey

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning-Based Microexpression Recognition: A Survey

Uploaded by

Copyright:

Available Formats

Neural Computing and Applications (2022) 34:9537–9560

Deep learning-based microexpression recognition: a survey

Keywords Microexpression recognition Deep learning Survey

1 Introduction Because of its potential applications in many fields, such

2.5 Dynamic imaging 3 Deep learning-based microexpression

Fig. 3 The Structure of

Fig. 4 The structure of multistream convolutional neural network-based methods

Fig. 5 The structure of recurrent convolutional network-based methods

Fig. 6 The structure of 3D convolutional neural network-based methods

Fig. 7 The structure of combined spatial-temporal methods

Fig. 8 Structure of capsule

Fig. 9 The structure of graph convolutional network-based methods

the tedious process of manual feature extraction. However, 4 Datasets

Table 1 Summary of the publicly available microexpression datasets

4.5 SAMM dataset 6 The fused datasets and the baseline

Table 2 Performance of the

ATNet [63] - - - - 83.40 79.80 77.50 -

ATNet [63] - - - - 52.30 50.10 - - -

ATNet [63] 42.90 42.70 - 49.70 48.90 - -

Liu et al. [49] 74.61 75.30 - - - - -

Liu et al. [49] 82.93 82.09 - - 77.54 71.52 - -

OFF-ApexNet [19] - - 86.97 88.28 - - 54.23 68.18

Liu et al. [49] 78.85 78.24 - -

Table 5 Performances of the

Fig. 11 The structure of the

Table 6 Baseline performance on the fused datasets

LOVO 47.32% 26.39%

Fig. 13 TThe confusion matrix of the 9-class fused dataset

Constrained by the current data collection settings, current

Fig. 14 Training curves of the

Fig. 15 Training curves of the

sequence. The input images are sequentially processed by a References

You might also like