Professional Documents
Culture Documents
Accurate and Reliable Facial Expression Recognition Using Advanced Softmax Loss With Fixed Weights
Accurate and Reliable Facial Expression Recognition Using Advanced Softmax Loss With Fixed Weights
Accurate and Reliable Facial Expression Recognition Using Advanced Softmax Loss With Fixed Weights
Abstract—An important challenge for facial expression recogni- learned features are likely to be separable in the angular space but
tion (FER) is that real-world training data are usually imbalanced. not sufficiently discriminative. Hence, many deep embedding
Although many deep learning approaches have been proposed to approaches have been presented to enhance the discriminative
enhance the discriminative power of deep expression features and
enable a good predictive effect, few works have focused on the mul- power of deep features by reducing intraclass variation and
ticlass imbalance problem. When supervised by softmax loss (SL), enhancing interclass differences. Wen et al. [13], [14] proposed
which is widely used in FER, the classifier is often biased against classical center loss (CL) to force deep features of the same class
minority categories (i.e., smaller interclass angular distances). In to their centers. Li et al. [15], [16] reduced intraclass variation
this letter, we present advanced softmax loss (ASL) to mitigate the by minimizing the distance between a sample and the center
bias induced by data imbalance and hence increase accuracy and
reliability. The proposed ASL essentially magnifies the interclass of its neighbors. Based on the CL and the locality-preserving
diversity in the angular space to enhance discriminative power loss, Luo et al. [17] designed local subclass loss to constrain
in every category. The proposed loss can easily be implemented intraclass variation. On the basis of SL and CL, Cai et al. [18]
in any deep network. Extensive experiments on the FER2013 and added island loss (IL) to simultaneously increase the interclass
real-world affective faces (RAF) databases demonstrate that ASL is angular distance. The exclusive regularization in [19] is actually
significantly more accurate and reliable than many state-of-the-art
approaches and that it can easily be plugged into other methods a special case of IL. However, almost no methods have addressed
and improves their performance. the problem that real-world expression datasets are usually
imbalanced [20].
Index Terms—Deep learning, convolutional neural networks,
softmax loss, multiclass imbalance, facial expression recognition.
Under the supervision of SL, the angular space of minority
classes is often compressed and thus leads to poor discrimination
I. INTRODUCTION results. That is, minority categories usually have less interclass
diversity. Numerous equidistant prototype embeddings [21]–
ACIAL expression recognition (FER) is a popular topic in
F computer vision and machine learning because of its vast
array of potential applications in human-computer interfaces [1],
[23] have been used to constrain the target vectors to be equidis-
tant in Euclidean space, but could not guarantee that they were
uniformly distributed in the angular domain. Therefore, this
[2]. In recent years, the focus of FER has transitioned from work proposes advanced softmax loss (ASL) to mitigate the
controlled laboratory environments to real-world conditions due bias against minority categories by using fixed (unlearnable)
to the success of deep learning techniques [3]–[7]. FER in weights in which all weight vectors are uniformly distributed in
realistic settings is still a challenging problem, since real-world the angular space (where a weight vector corresponds to a class).
data contain additional factors that are unrelated to expressions, Our major contributions are as follows:
such as head pose, illumination, and gender. - An analysis of interclass diversity illustrates that it is ideal
In the FER literature, the most widely used loss function is for all classes to be uniformly distributed in the angular
softmax loss (SL) [8]–[12]. Under the supervision of SL, deeply domain.
- A novel method ensures that the deeply learned expression
Manuscript received January 2, 2020; revised April 4, 2020; accepted April
11, 2020. Date of publication April 22, 2020; date of current version May features are uniformly distributed and hence enhances the
21, 2020. This work was supported in part by the National Natural Science discriminative ability for minority classes.
Foundation of China (NSFC) under Grant 61702395, Grant 61972302, and Grant - Almost all works have reported the best scores, but these
61711530248, in part by the Key Program of Shaanxi Technology Committee
of China under Grant 2019NY-182 and Grant 2020NY-167, and in part by the results may be incorrect because the recognition results
High-Level Culture Program of Yulin College under Grant 207010074. The vary across implementations (e.g., in our experiments, the
associate editor coordinating the review of this manuscript and approving it for difference between the best score and the worst is more than
publication was Prof. Mylene Q. Farias. (Corresponding author: Gang Liu.)
Ping Jiang is with the School of Computer Science and Technology, Xidian 2% for some methods). Thus, we present a new performance
University, Xi’an 710071, China, and also with the School of Information metric (reliability) to measure the variation in classification
Engineering, Yulin University, Yulin 719000, China (e-mail: jiangping@yulinu. scores and employ a fairer metric (the mean score) to
edu.cn).
Gang Liu and Quan Wang are with the School of Computer Science and Tech- evaluate the conventional recognition accuracy.
nology, Xidian University, Xi’an 710071, China (e-mail: gliu_xd@163.com; - Exhaustive experiments on the FER2013 dataset [24] and
qwang@xidian.eud.cn). real-world affective faces (RAF) database [15] demonstrate
Jiang Wu is with the School of Information Engineering, Yulin University,
Yulin 719000, China (e-mail: wujiang@yulinu.eud.cn). that the proposed method is more accurate and reliable
Digital Object Identifier 10.1109/LSP.2020.2989670 than many state-of-the-art approaches and that it can easily
1070-9908 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:25:00 UTC from IEEE Xplore. Restrictions apply.
726 IEEE SIGNAL PROCESSING LETTERS, VOL. 27, 2020
be incorporated into other methods (such as loss func- Algorithm 2: Training Strategy for ASL.
tions [25]–[28] that are extensions of SL). Input: Training data {Ii , yi }m
i=1 , learning rate η, number of
iterations T
II. PROPOSED METHOD Output: Network layer parameters θ
1: Randomly initialize network layer parameters θ 1 ,
A. Softmax Loss
generate ASL parameters wk according to Algorithm 1
SL is widely used in FER works and is defined as follows: 2: for t = 1 to T do
3: Compute the loss LtS as in Eq. (1):
1
T
ewyi xi
m
∂Lt
LS = − log wT
, (1) 4: Compute the back-propagation error: ∂xSt
m i=1 k xi
k∈N e
i
5: Update the network layer parameters:
t
∂LS ∂xit
where N denotes the set of sample labels and (·)T denotes 6: θ t+1 = θ t − η m i=1 ∂xti ∂θ t
the transpose operator. xi ∈ Rd is the deep feature of the ith 7: end for
sample, and yi ∈ N is the corresponding class label. wk ∈ Rd
denotes the kth column (vector) of the weight parameters W =
[w1 , w2 , . . . , wn ] ∈ Rd×n in the SL. n and m are the numbers Based on the concept of the polytope [31], [32], we know that
of classes and training samples, respectively. ŵTk ŵl = 1/(1 − n) is a solution of Eq. (3), i.e., these points are
It is well known that the deeply learned features supervised by the vertexes of a regular (n − 1)-simplex (note that d ≥ n − 1
SL exhibit a ‘radial’ distribution and are likely to be separable in must hold, where d is the feature dimension). That is, to enhance
the angular domain. For example, Fig. 1 shows the 2D features the discriminative power between different classes, it is ideal that
trained on the training set of MNIST [29] (there are only 1,000 the weight vectors wk are uniformly distributed in the angular
samples for digits 2 and 5, respectively) using LeNet++ networks space (i.e., every Akl equals 1/(1 − n) radians). Hence, the bias
as in [13], [14], [30]. However, SL usually compresses the due to the imbalanced training data is mitigated by forcing wk
angular space of minority classes (see digits 2 and 5 in Fig. 1(a)) to be uniformly distributed in the angular domain.
and thus leads to poor generalization to the test samples.
In well-trained convolutional neural networks (CNNs), the B. Advanced Softmax Loss
weight vector wk can be regarded as the cluster center of all xi As mentioned above, if every Akl is equal to arccos(1/(1 −
with yi = k [19]. To augment the discriminative power of the n)) (where arccos denotes the inverse of the cosine function),
CNN, different classes should be as widely separated as possible; this is an optimal scenario for enhancing discriminative power;
therefore, we examine the loss function below, which is similar thus, we aim to ensure that the learned features are uniformly
to IL [18]. distributed in the angular domain.
wT wl Note that ŵTk ŵl = 1/(1 − n) is one solution of Eq. (3) but not
Linter = k
= ŵTk ŵl , (2)
wk wl the only one. For example, the vectors (1;0;0), (0;1;0), (-1;0;0),
k∈N l∈N k∈N l∈N
l=k l=k and (0;-1;0) satisfy Eq. (3) but are not the desired solution. Thus,
we cannot achieve the ideal case by optimizing the loss Linter
where ŵk and ŵl are normalized weight vectors and ·
or by solving Eq. (3). This is why IL is not very efficient (see
denotes the Euclidean norm. ŵk T ŵl is the cosine value of Akl ,
Tables I-II).
which denotes the vectorial angle between wk and wl , and the
For convenience, we suppose that wk is a unit vector (i.e.,
loss Linter is the sum of the cosine values of all pairwise angles.
wk = 1), and then we have cos(Akl ) = wTk wl . Since Akl ∈
Therefore, Linter magnifies the interclass angular distances.
[0, π], our final goal Akl = 1/(1 − n) is the unique solution of
Linter takes its minimal value −n only when the following
the equation J = 0, where J is defined as follows:
equation holds:
1 1
2
ŵk = − ŵl . (3) J = wk wl −
T
. (4)
2n(n − 1) 1−n
l∈N ,l=k k∈N l∈N ,l=k
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:25:00 UTC from IEEE Xplore. Restrictions apply.
JIANG et al.: ACCURATE AND RELIABLE FACIAL EXPRESSION RECOGNITION USING ADVANCED SOFTMAX LOSS WITH FIXED WEIGHTS 727
as in Section II-A, where one can clearly see that all classes where zc represents the number of samples that belong to the cth
are nearly uniformly distributed in the angular space. The ASL class and are correctly classified, nc is the number of samples in
enlarges the interclass angular distance of minority categories the cth class, and |N | denotes the number of categories. OA is the
(digits 2 and 5) and hence improves the recognition performance. most widely used criterion, and ACA can index the precision of
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:25:00 UTC from IEEE Xplore. Restrictions apply.
728 IEEE SIGNAL PROCESSING LETTERS, VOL. 27, 2020
each category. For a good classifier, both OA and ACA should TABLE III
THE RESULTS OF JOINT SUPERVISION FOR THE FER2013 DATASET. THE BEST
be as large as possible. RESULTS ARE MARKED IN BOLDFACE
In our experiments, every model was repeatedly tested N
(N = 10) times, and there are thus N results. Therefore, we
measure the recognition performance through the following two
statistical values of each evaluation criterion EC (i.e., OA or
ACA):
1
N
M ean = ECi , (8)
N i=1
1 N
ST D = (ECi − M ean)2 , (9)
N i=1
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:25:00 UTC from IEEE Xplore. Restrictions apply.
JIANG et al.: ACCURATE AND RELIABLE FACIAL EXPRESSION RECOGNITION USING ADVANCED SOFTMAX LOSS WITH FIXED WEIGHTS 729
REFERENCES [17] Z. Luo, J. Hu, and W. Deng, “Local subclass constraint for facial expression
recognition in the wild,” in Proc. 24th Int. Conf. Pattern Recognit., 2018,
[1] R. K. Gupta and S. D. Senturia, “Real time face detection and facial pp. 3132–3137.
expression recognition: Development and applications to human computer [18] J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O. Reilly, and Y. Tong, “Island loss
interaction,” in Proc. Conf. Comput. Vis. Pattern Recognit. Workshop, for learning discriminative features in facial expression recognition,” in
2003. Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., 2018, pp. 302–309.
[2] J. Deng, C. Pang, Z. Zhang, Z. Pang, H. Yang, and G. Yang, “CGAN based [19] K. Zhao, J. Xu, and M.-M. Cheng, “RegularFace: Deep face recognition
facial expression recognition for human-robot interaction,” IEEE Access, via exclusive regularization,” in Proc. IEEE Conf. Comput. Vision Pattern
vol. 7, pp. 2169–3536, 2019. Recognit., 2019, pp. 1136–1144.
[3] G. E. Hinton, A Practical Guide to Training Restricted Boltzmann Ma- [20] S. Li and W. Deng, “A deeper look at facial expression dataset bias,”
chines. Berlin, Germany: Springer, 2012, pp. 599–619. Apr. 2019, arXiv:1904.11150v1.
[4] A. Majumder, L. Behera, S. Member, and V. K. Subramanian, “Automatic [21] W. Deng, Y. Liu, J. Hu, and J. Guo, “The small sample size problem
facial expression recognition system using deep network-based data fu- of ICA: A comparative study and analysis,” Pattern Recognit., vol. 45,
sion,” IEEE Trans. Cybern., vol. 48, no. 1, pp. 103–114, Jan. 2018. pp. 4438–4450, 2012.
[5] Y.-H. Lai and S.-H. Lai, “Emotion-preserving representation learning via [22] W. Deng, J. Hu, X. Zhou, and J. Guo, “Equidistant prototypes em-
generative adversarial network for multi-view facial expression recogni- bedding for single sample based face recognition with generic learning
tion,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit. Workshops, and incremental learning,” Pattern Recognit., vol. 47, pp. 3738–3749,
2018, pp. 263–270. 2014.
[6] H. Yang, Z. Zhang, and L. Yin, “Identity-adaptive facial expression [23] M. Hayat, S. Khan, W. Zamir, J. Shen, and L. Shao, “Max-margin class im-
recognition through expression regeneration using conditional generative balanced learning with Gaussian affinity,” Jan. 2019, arXiv:1901.07711v1.
adversarial networks,” in Proc. IEEE Int. Conf. Autom. Face Gesture [24] J. Ian et al., “Challenges in representation learning: A report on three
Recognit. Workshops, 2018, pp. 294–301. machine learning contests,” Neural Netw., vol. 64, pp. 59–63, Apr. 2015.
[7] F. Zhang, T. Zhang, Q. Mao, and C. Xu, “Joint pose and expression [25] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: L2 hypersphere
modeling for facial expression recognition,” in Proc. IEEE Conf. Comput. embedding for face verification,” in Proc. ACM Int. Conf. MultiMed., 2017,
Vis. Pattern Recognit., 2018, pp. 3359–3368. pp. 1041–1049.
[8] S. Li and W. Deng, “Deep facial expression recognition: A survey,” [26] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hy-
Oct. 2018, arXiv:1804.08348v2. persphere embedding for face recognition,” in Proc. IEEE Conf. Comput.
[9] B. Fasel, “Robust face analysis using convolutional neural networks,” Ob- Vision Pattern Recognit., 2017, pp. 6738–6746.
ject Recognit. Supported User Interact. Service Robots, vol. 2, Aug. 2002, [27] H. Wang et al., “CosFace: Large margin cosine loss for deep face recog-
pp. 40–43. nition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2018,
[10] B. Sun, L. Li, G. Zhou, and J. He, “Facial expression recognition in the pp. 5265–5274.
wild based on multimodal texture features,” J. Elect. Imag., vol. 25, no. 6, [28] J. deng, J. Guo, and N. Xue, “ArcFace: Additive angular margin loss
2016, Art. no. 061407. for deep face recognition,” in Proc. IEEE Conf. Comput. Vision Pattern
[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies Recognit., 2019, pp. 4685–4694.
for accurate object detection and semantic segmentation,” in Proc. IEEE [29] Y. Lecun and C. Cortes, “The MNIST database of handwritten digits,”
Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. Oct. 2019.
[12] K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing uncertainties [30] R. Gao, F. Yang, W. Yang, and Q. Liao, “Margin loss: Making faces
for large-scale facial expression recognition,” 2020, arXiv:2002.10392v2. more separable,” IEEE Signal Process. Lett., vol. 25, no. 2, pp. 308–312,
[13] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning Feb. 2018.
approach for deep face recognition,” in Proc. Eur. Conf. Comput. Vis., [31] W. Rudin, Principles of Mathematical Analysis, vol. 3. New York, NY,
2016, pp. 499–515. USA: McGraw-Hill, 1964, ch. 10.
[14] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A comprehensive study on center [32] M. A. EI-Gebeily and Y. A. Fiagbedzi, “On certain properties of the regular
loss for deep face recognition,” Int. J. Comput. Vision, vol. 127, pp. 668– n-simplex,” Int. J. Math. Educ. Sci. Technol., vol. 35, no. 4, pp. 617–629,
683, Jun. 2019. 2004.
[15] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality- [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
preserving learning for expression recognition in the wild,” in Proc. Conf. recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2016,
Comput. Vision Pattern Recognit., 2017, pp. 2584–2593. pp. 770–778.
[16] S. Li and W. Deng, “Reliable crowdsourcing and deep locality-preserving [34] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
learning for unconstrained facial expression recognition,” IEEE Trans. Surpassing human-level performance on imagenet classification,” in Proc.
Image Process., vol. 28, no. 1, pp. 356–370, Jan. 2019. IEEE Int. Conf. Comput. Vision, 2015, pp. 1026–1034.
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:25:00 UTC from IEEE Xplore. Restrictions apply.