Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Computers in Human Behavior 127 (2022) 107063

Contents lists available at ScienceDirect

Computers in Human Behavior


journal homepage: www.elsevier.com/locate/comphumbeh

Detecting deception through facial expressions in a dataset of videotaped


interviews: A comparison between human judges and machine
learning models
Merylin Monaro a, *, Stéphanie Maldera a, Cristina Scarpazza a, Giuseppe Sartori a,
Nicolò Navarin b
a
Department of General Psychology, University of Padova, Via Venezia 8, 35131, Padova, Italy
b
Department of Mathematics, University of Padova, Via Trieste 63, 35131, Padova, Italy

A R T I C L E I N F O A B S T R A C T

Keywords: In the last decades, research has claimed facial micro-expressions are a reliable means to detect deception.
Facial micro-expressions However, experimental results showed that trained and naïve human judges observing facial micro-expressions
Deception detection can distinguish liars from truth-tellers with an accuracy just slightly above the chance level. More promising
Machine learning
results recently came from the field of artificial intelligence, in which machine learning (ML) techniques are used
Deep learning
to identify micro-expressions and are trained to distinguish deceptive statements from genuine ones.
Low-stakes lies
In this paper, we test the ability of different feature extraction methods (i.e., improved dense trajectories,
OpenFace) and ML techniques (i.e., support vector machines vs. deep neural networks) to distinguish liars from
truth-tellers based on facial micro-expressions, using a new video data set collected in low-stakes situations.
During the interviews, a technique to increase liars’ cognitive load was applied, facilitating cues of lies to
emerge.
Results highlighted that support vector machines (SVMs) coupled with OpenFace resulted in the best per­
forming method (AUC = 0.72 videos without cognitive load; AUC = 0.78 videos with cognitive load). All the
tested classifiers performed better when a cognitive load was imposed on the interviewee, confirming that the
technique of increasing cognitive load during an interview facilitates deception recognition. In the same task,
human judges obtained an accuracy of 57%.
Results are discussed and compared with the previous literature, confirming that artificial intelligence per­
forms better than humans in lie-detection tasks do, even when humans have more information to make a
decision.

1. Introduction According to the inhibition hypothesis, the movements of some facial


muscles are mostly involuntary, which means it is difficult to control
Since the first studies by Paul Ekman, facial micro-expressions have them deliberately (Darwin, 1873; Ekman, 2006). These muscles are
been proposed as a means to detect deception (Ekman, 2006). considered as cues of deception because humans are not able to inhibit
Micro-expressions are “full-face emotional expressions that are compressed their activation with the purpose of concealing an expression (Porter
in time, lasting only a fraction of their usual duration, so quick they are et al., 2012). Among the micro-expressions that are difficult to control,
usually not seen” (Ekman, 1985). Each micro-expression is produced by a some involve the forehead (e.g., when the inner corners of the eyebrow
specific pattern of facial muscle activation. The Facial Coding Action are pulled upward) and the mouth area (e.g., narrowing the lips).
System (FACS) codes all possible facial expressions describing the Conversely, facial movements such as eyebrow raising and lowering are
pattern of action units (AUs) – the fundamental actions of individual easily falsified, as the muscles that are responsible for them are under
muscles or groups of muscles, which are activated during each expres­ voluntary control (ten Brinke et al., 2012). Many experiments (Slowe &
sion (Ekman et al., 2002). Govindaraju, 2007; Zhang et al., 2007) on lie detection used the FACS to

* Corresponding author.
E-mail address: merylin.monaro@unipd.it (M. Monaro).

https://doi.org/10.1016/j.chb.2021.107063
Received 1 May 2021; Received in revised form 30 September 2021; Accepted 16 October 2021
Available online 20 October 2021
0747-5632/© 2021 Elsevier Ltd. All rights reserved.
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

code the expressions and find out the differences in the AUs activation of statements during real court trials. Thus, these videos refer to
between people reporting genuine versus fake statements or emotions. high-stakes situations. The authors labelled video clips as deceptive or
For example, a true smile, called a Duchenne smile, simultaneously truthful according to the trial verdicts: guilty, non-guilty, and exonera­
stimulates the AU12 and the AU6. The AU12 corresponds to the zygo­ tion. The authors manually labelled facial micro-expressions and other
matic major muscle, which draws the angle of the mouth superiorly and non-verbal features using the multimodal coding scheme MUMIN (All­
posteriorly to allow one to smile. Instead, the AU6 corresponds to the wood et al., 2007). MUMIN is a standard multimodal annotation scheme
orbicularis oculi muscle, which acts to close the eye. A faked smile (or for interpersonal interaction. It considers facial expressions from
non-Duchenne smile), stimulates only the zygomatic major muscle and different categories such as eyebrows, eyes and mouth movements, gaze
not the orbicularis oculi muscle, which cannot be moved voluntarily direction and head movements; it also codes general face displays, like a
(Slowe & Govindaraju, 2007). smile, laughter, scowl, and others. Then, two ML classification algo­
In the last decades, human scientists tested the ability of human rithms (i.e., decision trees [DTs] and random forest [RF]) were trained
judges to detect deception from the observation of facial micro- to detect deception automatically, and three naïve human evaluators
expressions. Results consistently found that the accuracy of human be­ were asked to judge the same videos as truthful or deceptive. Results
ings, either those trained for micro-expression observation or naïve showed that the automatic classification, trained on visual contents
judges, to distinguish liars from truth-tellers is slightly above the chance only, reached an accuracy of 70%. By training the algorithms on visual
level (Bond & DePaulo, 2006; Curci et al., 2019; DePaulo et al., 2003; and audio contents, the accuracy improved up to 75.20%, versus the
Pérez-Rosas et al., 2015; Porter & Brinke, 2008). Why is it so challenging 51% achieved by humans.
for humans to detect deception from facial cues? The previous literature On the same data set, Jaiswal et al. (2016) used OpenFace (Bal­
assumed that this task is extremely expensive for the cognitive system, as trusaitis et al., 2018), that is a machine learning based tool to extract
it requires paying attention to a huge number of details in a very short different sets of features from videos of a person’s face, including the
time (ten Brinke et al., 2012; Vrij, 2008). The limited resources of the AUs from the facial expressions manually annotated in the original
cognitive system do not allow a human judge to catch enough infor­ study. Training a support vector machine (SVM) classifier on AUs, they
mation. Moreover, humans are subject to biases that affect their abilities obtained a classification accuracy of 67% for visual modality.
to compute reliable evaluations (e.g., the Othello error, the Brokaw ef­ Combining the lexical, acoustic, and visual features with the
fect) (Burgoon et al., 2008; Ekman, 1985; Vrij, 2008; Vrij et al., 2010). feature-level fusion, the accuracy reached by the classifier was approx­
For example, the Othello error consists in judging a truthful person as a imately 78% (Jaiswal et al., 2016). Wu et al. (2018) employed improve
liar, often confusing the signs of fear to being falsely accused as a fear to dense trajectory (IDT) in combination with the histogram of oriented
be discovered (Bond & Fahey, 1987); the Brokaw effect consists in gradients (HOG), histogram of optical flow (HOF), and motion boundary
underestimating individual differences in displaying facial expressions histogram (MBH) to extract features from videos. The best performance
and assuming the lack of cues of deception as the evidence of truthful­ was obtained by training linear SVM (Area Under the ROC Curve [AUC]
ness (Frank & Ekman, 1997). = 77%) and RF on automatically extracted micro-expression features
The fact that people are generally bad to detect deception is not true (AUC = 80%). On the same data set, 10 human naïve judges reached an
just for the analysis of facial expression analysis, but also when the de­ AUC of 81% (mainly relying on audio information), which dropped to
cision is taken analysing behavioural and verbal cues (Bond & DePaulo, 60% when people watched videos in silent mode (Wu et al., 2018). A
2006). This evidence seems to be in contrast with the evolutionary different approach was adopted by Krishnamurthy et al. (2018), who
perspective: the presence of deceit should have led humans to develop extracted additional visual features with a 3D convolutional neural
the ability to detect deception to effectively determine whom and what network (C3D). Training a multilayer perceptron (MLP), they respec­
to trust (von Hippel & Trivers, 2011). Frankenhuis et al. (2018) studied tively reached an AUC of approximately 96% with the features extracted
whether psychosocial adversity enhances the ability to detect deception, by C3D network and 75% with the facial micro-expressions manually
but no support was found for this hypothesis (Frankenhuis et al., 2018). labelled by Pèrez-Rosas et al. (Krishnamurthy et al., 2018). On the same
Some authors have argued that the discrepancy between the evolu­ data set, Avola et al. (2019) recently extracted AUs using OpenFace after
tionary need to detect deceits and the effective ability of people to do it detecting face landmarks, estimating head poses, and computing eye
could be explained as the social cost of asserting that one is a liar (ten gazes. Then, the authors computed the AU occurrence detection with
Brinke et al., 2016). Branding someone as a liar can lead to the deteri­ SVM and the AU intensity estimation with support vector regression
oration of the social relationship. Thus, humans cope with these (SVR). Considering only visual features, they obtained an accuracy of
competing costs by keeping lie detection information outside conscious 76.84% in the classification task (deceptive vs. truthful behaviour) with
awareness. In other words, according to ten Brinke et al. (2016) the poor a SVM by using a radial basis function (RBF) kernel (Avola et al., 2019).
performance in detecting lies could be explained as the unconscious Rill-Garcia et al. (2019) trained a long short-term memory network
recognition of cues of lies that do not consciously emerge. (LSTM), obtaining an AUC of 56% in detecting deception through visual
For all these reasons, some authors recently attempted to develop cues. Ding et al. (2019) tested a face-focused cross-stream network
artificial intelligence systems to automate deception detection (Monaro (FFCSN). This method trains a faster R–CNN for face detection and
et al., 2017a, b, 2020b). For what concerns the deception detection expression feature extraction. With this deep learning approach, they
through facial micro-expressions, the approach uses computer vision reached an accuracy of 93.16% when considering the visual modality
algorithms (usually based on machine learning techniques) to identify only (Ding et al., 2019). Finally, Venkatesh et al. (2020) used a pre­
micro-expressions from faces; then, machine learning (ML) models are trained deep CNN GoogleNet to extract the features and then they
trained to distinguish faked from genuine emotions or deceptive from connected a bidirectional LSTM, before the last layer of the GoogleNet,
truthful statements (Porter & Brinke, 2008; Su & Levine, 2016; Zhang which performed the classification. The authors reported an accuracy,
et al., 2007). for the visual modality, of 100% (Venkatesh et al., 2020). However, the
way the authors decided to split the training and test sets strongly
1.1. Automatic detection of deception through facial micro-expressions influenced the evaluation procedure. In fact, videos from the same
using ML algorithms subject appear in both the training and the test sets. In this setting, the
neural network can learn to recognize just the subjects and to predict the
To date, few studies have approached the issue of lie detection from label associated to each during training, instead of actually evaluating
facial micro-expressions using ML algorithms. Moreover, most of them the deception clues in the video.
analysed the same data set (Pérez-Rosas et al., 2015). Pérez-Rosas et al. Su and Levine (2016) tested a different data set. The authors
(2015) collected a data set of 121 videos (61 deceptive and 60 truthful) collected 324 high-stakes deception videos (51.23% guilty suspects and

2
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

48.77% innocents) from YouTube (Su & Levine, 2016). In these videos, authors prefer collecting data directly from authentic high-stakes cir­
suspects pled for the safe return of their missing relatives and denied cumstances (e.g., criminal trials). Moreover, in real settings emotions
their involvements in the deaths or disappearances of the victims. The are more intense and, thus, deception cues are more visible (Ekman,
videos were labelled as deceptive when the suspects were found guilty 2006). However, collecting data from authentic high-stakes situations
by overwhelming evidence (e.g., blood or DNA matching, security have some methodological limits. One of the general issues in lie
videos, or witness testimony). The authors trained an RF classifier to detection studies is establishing the ground truth or having videos of
detect deception automatically, combining features from facial region people who are genuinely lying or telling the truth (Frank et al., 2008).
localization, eye blink, eyebrow motion, wrinkles, and mouth motion. In high-stakes situations (real setting), more than low-stakes situations
Considering facial macro- and micro-expressions, they reached an ac­ (laboratory setting), it is difficult to establish the ground truth. For
curacy of 76.92%; with only micro-expressions as features, the accuracy instance, a liar may not be identified by the court as such, or the same
dropped to 56.92% (Su & Levine, 2016). video can contain both true and false declarations. Thus, it is difficult to
Rill-Garcia et al. (2019) collected a new data set of 42 videos, in obtain a classification model completely trained on unbiased data.
which interviewees described their real and deceitful opinions about Differently, in laboratory the experimenter has a greater degree of
abortion; with this data set, analysing just visual contents, they obtained control over the true and false contents, as the subject is instructed to lie
an AUC of approximately 70% with an SVM classifier (Rill-Garcia et al., about precise contents. Moreover, in high-stake situations the variability
2019). between the video clips regarding the interviewee’s pose, the camera
To sum up, state of art reports that (i) overall, artificial intelligence resolution, and other technical aspects could introduce further bias
performs better than humans do; (ii) visual and textual transcripts seem sources in the classification task. For example, in the data set of
to be the most informative modalities; and (iii) learning methods based Pérez-Rosas et al., the video clips differed for quality, background, light,
on neural networks seem to give better results. However, almost all the face angle, and other characteristics that could have impacted the
studies in literature to date applied ML techniques to the same data set, automatic elaboration of the data set. Furthermore, as multiple decep­
which consisted of videos of liars and truth-tellers in high-stakes situa­ tive and truthful clips were collected from the same trials, the same
tions. High-stakes situations are defined as situations that involve a lot of persons appeared in more clips, introducing a possible bias in ML models
concrete risks or serious consequences for a person, which is likely to due to person recognition, if this issue is not explicitly considered in the
either get a disadvantage or lose an advantage. Conversely, in low-stakes model evaluation. Although it is a low-stakes setting, collecting data in
situations people have nothing to lose when their lies are exposed. the laboratory allows the experimenter to overcome such limitations,
Generally, it is very difficult to recreate high-stakes situations in a reducing the variability between the videos and collecting clips from a
controlled environment (i.e., laboratory) because subjects do not larger number of individuals.
perceive the consequences and losses as concrete and harmful, even Finally, few studies evaluated the accuracy of a large sample of
when a symbolic reward or punishment is given. For this reason, some human judges by using the same data set to train and test ML algorithms.

Table 1
Overview of the state of the art about the automatic detection of deception through facial micro-expressions using ML algorithms. The table reports a comparison of the
data sets used by each study, a comparison of the ML methods applied, and the related accuracy.
Authors (year) Data set Visual features Algorithm Performance Comparison with
extraction human judges
Visual Multimodal
features only analysis

Pérez-Rosas et al. 121 videos during real court MUMIN scheme DT ACCa = ACC = 75.20% ACC = 46.50% (Silent
(2015) trials 68.59% video)
RF ACC = ACC = 50.41% ACC = 56.47% (Full
73.55% video)
Jaiswal et al. (2016) Pérez-Rosas et al. OpenFace SVM ACC = ACC = 78.95% –
67.20%
Su and Levine (2016) 324 high-stakes deception PittPatt RF ACC = – –
videos from YouTube 76.92%
Gogate et al. (2018) Pérez-Rosas et al. (2015) C3D ACC = ACC = 96.42% –
78.57%
Wu et al. (2018) Pérez-Rosas et al. (2015) IDT Linear SVM AUC = AUC = 87.73% AUC = lower than 60%
77.31% (Silent video)
RF (on automatically extracted AUC = AUC = 84.77% AUC = 81.02% (Full
micro-expression features) 80.64% video)
Krishnamurthy et al. Pérez-Rosas et al. (2015) C3D MLP ACC = ACC = 90.99% –
(2018) 93.08% AUC = 93.48%
AUC =
95.96%
Avola et al. (2019) Pérez-Rosas et al. (2015) MUMIN scheme, RBF-SVM ACC = – –
OpenFace 76.84%
SVM - Sigmoid ACC =
71.99%
Rill-Garcia et al. Pérez-Rosas et al. (2015) OpenFace LSTM AUC = – –
(2019) 56.0%
Linear SVM AUC = AUC = 67.5%
57.40%
42 videos of people giving a LSTM AUC = – –
position towards abortion 38.40%
Linear SVM AUC ≥60% AUC = 62.5%
Ding et al. (2019) Pérez-Rosas et al. (2015) R–CNN FFCSN ACC = 93.16 ACC = 97.00 –
AUC = AUC = 99.78
96.71%
a
ACC = accuracy.

3
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

Table 1 reports an overview of the state of the art. reaction times, mouse dynamics, and keystroke dynamics (Monaro et al.,
2017a, b, 2019, 2020b). Results demonstrated that liars took more time
1.2. Aim of the study to respond and showed different behavioural patterns compared to
truth-tellers, especially to unexpected questions. For this reason, here
The first aim of this study is to test the ability of ML models in dis­ we expect ML algorithms analysing facial micro-expressions to be more
tinguishing liars from truth-tellers, based on facial micro-expressions, in efficient in detecting liars when they respond to unexpected questions
a new data set of videos that were collected in low-stakes situations. For than during free speech.
this reason, here we analyse a data set of videos collected in the labo­
ratory through a between-subject experimental design (Monaro et al., 2. Method
2020a). As mentioned above, despite the data set choice being a possible
limitation in terms of the study’s ecology, this ensures a solid ground In this section, data set, experimental procedure, and data analysis
truth on which to train the algorithms. In this study, we compare the methodology are described. The ethics committee for psychological
performance of different feature extraction methods (i.e., improved research of the University of Padova, in accordance with the Declaration
dense trajectory, automatically extracted AUs) and different ML tech­ of Helsinki, approved the experimental procedure.
niques (i.e., SVMs vs. deep neural networks). Then, results are discussed
and compared with the performance obtained by the methods in pre­ 2.1. Data set
vious literature. We expect machine learning methods to perform
significantly better than chance in this task. Among the various ML In this study, we analysed the data set previously collected by
models, we expect the ones exploiting high-level (i.e., abstract) features Monaro et al. (2020a), which consisted of 62 videos of Italian volunteer
to be the best performing ones. Such features can be obtained in two participants (43 females and 19 males, aged between 20 and 29 years)
ways: (i) extracting meaningful high-level features using other machine who were interviewed about a past holiday. Thirty-two participants
learning models, such as the AUs predicted using previously trained were assigned to the “truth-teller” condition (i.e., they were asked to tell
models; and (ii) learning features suited for the task directly from the a real holiday that happened in the last 12–18 months), and the other 30
raw data thanks to the end-to-end training fashion typical of deep were assigned to the “liar” condition (i.e., they were asked to report a
learning, that is the neural network learns to extract high-level features fake holiday or a holiday they had not experienced). This is not the first
relevant for the considered task directly from the raw data during research in the literature using past holidays as a target event of lying
training. (Sartori et al., 2008). Indeed, recalling a holiday involves the same
The second goal of this study is to investigate the ability of a sample cognitive processes as telling an alibi during a criminal investigation.
of naïve judges to identify liars versus truth-tellers in the same data set of Thus, recalling a past holiday is like tell a false alibi, but in a low-risk
low-stakes situations videos. Moreover, we investigated the presence of situation.
differences in the ability to detect deception according to gender and To reduce the possibility that truth-tellers produced biased memories
educational level, the presence of practice effects, and the elements on – due to the time elapsed between the authentic holiday experience and
which naïve judges based their decisions. Previous works revealed that the interview, participants were asked to recall some information about
men are less accurate, as well as less sensitive than women in processing their holiday by filling a form before the interview; they were encour­
emotional faces and labelling facial expressions (Montagne et al., 2005). aged to omit the details that they were not able to remember and they
Hill and Craig (2004) examined the individual differences in the ability were suggested to help the memory (e.g., with photographs and videos)
to discriminate genuine and deceptive pain from facial expressions, just to be sure that what they remembered was correct. Similarly, to
finding significant individual differences in accuracy, with females more limit the risk that liars inserted in their story true details, they were
accurate than males (Hill & Craig, 2004). The authors assumed that men provided with a pre-compiled form with some pre-established infor­
and women differed in the use of intuitive vs. analytical cue-based mation about the faked holiday that they had to pretend that they had
decision-making strategies. A few studies also investigated the link be­ lived.
tween education and emotion recognition, reporting that people with a Each video consisted of three interview phases:
higher education level perform better on emotion recognition tasks,
probably thanks to the educational environment that exposed them to a 1) Baseline – the interviewee said his/her autobiographical data;
larger number of social interactions (Trauffer et al., 2013). Considering 2) free speech – free telling phase, in which the holiday was freely
these results, we hypothesize that the increased ability in processing recalled for approximately 2 min; and
facial expressions of women and people with a high educational level 3) unexpected questions – with the aim to increase cognitive load, the
may play a role also in the ability of detecting fakers through the facial interviewee was asked unexpected questions about the holiday. The
expression analysis. unexpected questions were equal in amount and content for truth-
Human performance is then compared with the ML performance and tellers and liars.
is discussed considering the results obtained by human judges in pre­
vious studies. In line with the previous literature, we expect (i) humans The average length of the videos was 9.56 min. A more detailed
to perform near to the chance level and (ii) ML models to reach an ac­ description of the data set is reported in Monaro et al. (Monaro et al.,
curacy higher than humans in detecting deception. 2020a). It should be noted that the same experimental procedure was
Finally, for the first time in literature, we introduce the automatized validated also by Curci et al. (2019).
analysis of facial micro-expressions to detect deception during in­
terviews where a technique to increase liars’ cognitive loads is applied. 2.2. Machine learning methods
Specifically, the data set of videos we analyse contains free speech and
responses to unexpected questions (Hartwig and Bond, 2007; Monaro To detect deception automatically through facial micro-expressions,
et al., 2020a). This type of question does not allow liars to prepare we decided to split each video in two parts: the part containing free
themselves or anticipate responses to foreseeable requests (Monaro speech and the one in which subjects answer the interviewer’s questions.
et al., 2020b). Indeed, planning makes lying easier, and planned lies We created the “Free Speech” data set from the first part of the videos
typically contain fewer cues to deceit compared to spontaneous lies and the “Questions” data set from the second part.
(DePaulo et al., 2003). Previous studies successfully applied the tech­ We then processed both data sets using machine learning algorithms.
nique of asking unexpected questions to detect fake versus genuine We defined the task in such a way that it was the same between ML
identities by analysing the examinees’ behavioural responses, such as algorithms and human judges, namely a binary classification task over

4
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

the collected videos. We asked the ML algorithms to predict if the person motion of objects in an image) such as the Histogram of Optical Flow
in a video was a liar or truth teller. In this study, we explored several (HOF) (Perš et al., 2010), that computes a quantization of them, or
machine learning pipelines, one based on a standard machine learning Motion Boundary Histograms (MBH) (Wang et al., 2013) that separates
approach and three exploiting deep learning techniques, which are a the horizontal and vertical components of the optical flow and consider
type of machine learning algorithm that has been successfully applied to its derivatives.
several perception tasks in the last few years. Notice that different ap­ SVM (OpenFace AUs) We exploit expression features as a high-level
proaches consider different representations for the video in input. video representation. We estimate such features using OpenFace (Bal­
The standard machine learning approach consists in computing an a- trusaitis et al., 2018), an open source tool for face recognition and
priori defined set of features that are descriptive of the input data, and feature extraction from images, sequences of images and videos based on
then exploiting the resulting feature vector as a representation for the machine learning. This method identifies the presence of different AUs,
input examples, to be fed to ML algorithms. specifically AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10, AU12, AU14,
In the first approach, we consider common algorithmic feature AU15, AU17, AU20, AU23, AU25, AU26, AU28, and AU45. The tool
extraction techniques for videos (more details follow) being fed to a predicts the intensity of each AU in each frame of the video. While the
linear SVM classifier, that is a maximum-margin linear classifier with method implemented for estimating facial expressions is itself based on
generalization guarantees rooted in statistical learning theory (Vapnik, machine learning (Baltrusaitis et al., 2015), exploiting HOG and face
1999). We choose SVM because it is a simple yet effective classifier that landmark features (also detected by OpenFace using deep neural net­
has shown to be very effective even in the small-data setting in many works), it has been shown to perform consistently well on several
applications. datasets, making it one of the state-of-the-art approaches for such a task.
The second approach uses the same classifier, but it is applied on For each action unit, an average of frequency and intensity in the
high-level features (AUs) extracted in turn by a machine learning based video clip has been calculated and used as feature for the whole video.
tool (OpenFace). LSTM (OpenFace AUs). We decided to test the same features
The first two approaches require to define a single feature vector that extracted with OpenFace with the LSTM network (Hochreiter &
aggregates information from the whole video in input. Schmidhuber, 1997). LSTM is a representative of the class of recurrent
The third approach uses the same OpenFace AU features but replaces neural networks, that are designed for the classification and elaboration
the SVM with a more expressive (i.e., non-linear) Long Short-Term of sequential data, such as the sequence of frames in a video. With LSTM,
Memory (LSTM) network classifier, that is a neural classifier suited for there is no need to compute an aggregation of the AUs because it can
the classification of sequences of feature vectors. In this case, the input directly deal with the sequence of AU activations. In fact, this network
video is represented as a sequence of feature vectors (one for each bases its predictions on all frames of the video clips and their reciprocal
frame). positions. In particular, the LSTM network considers as input the
Finally, we considered a fully neural approach based on 3D con­ sequence of AUs extracted by OpenFace for each frame. As the training
volutional neural networks (C3D), in which the network receives in of this kind of neural networks required a long time, the model has been
input the raw video frames and the features representing the video are validated only on the data set Questions because we assume that it has
learned automatically from the neural network during training. In more more evident cues of deception instead of FreeSpeech. Thus, we used the
detail, a neural network is organised in layers, each of which receives its same network architecture or the FreeSpeech data set.
input from the previous one and supplies its output to the next. Let us C3D (end-to-end features). C3D can process space–time informa­
consider a neural network of n layers. The first n-1 layers can be seen as tion, and it has been created to meliorate the identification of images in
feature extractors, that encode increasingly complex features in deeper movement and 3D images. This network allows for the obtainment of
layers. The representation of the (n − 1) -th layer can then be considered features from video clips without the need to perform feature extraction
the set of features the network learned to extracted from the input. The from the data. It automatically extracts features from image sequences
last layer then performs a logistic regression on such representation (in or video frames by considering the two spatial dimensions in each frame
the case of binary classification). Notice that the extracted features are and the temporal dimension (i.e., working in three dimensions). In this
specifically tailored to the considered task because they are learned via case, although the C3D can extract features from videos, we pre­
backpropagation and stochastic gradient descent (that is the algorithm processed the data with the OpenFace tool in such a way that videos
used to train neural networks) together with the last layer. were cut in frame sequences in which only faces were present (i.e.,
In this case, no pre-defined feature extraction is needed, as the input instead of considering a whole frame, we considered just the bounding
to the C3D is the raw input (i.e. the sequence of frames represented as box of the face). Each image in the sequences has been scaled to a fixed
images). dimension of 112 × 112. Thus, each video is represented by a 3D matrix.
For all the machine learning models, we adopted 10-fold cross vali­ The length of the videos in our data set made it impossible to consider all
dation on the subjects to estimate their predictive performances. In this the frames from a video at the same time due to GPU memory
way, we ensured that all the videos from the same subject were either in constraints.
the training, validation, or test set. In our experimental results, we report For this reason, we decided to split each video in subsets of 16
the average AUC on the 10-fold tests. We decided to focus on the AUC frames, and to produce the prediction for the whole video by aggre­
measure since it is the most popular in the related works, and it is widely gating the various predictions of the 16-frame blocks (in a number that
adopted in machine learning (Ling et al., 2003). may vary among videos but was averaged over the various predictions).
More details about each ML method follow. This choice of using the average prediction can be interpreted as
SVM (handcrafted features). Regarding the feature extraction for considering the average intensity of deception clues in the whole video.
SVM with the handcrafted feature classifier, following the work of Wu Similar to the LSTM network, we performed the validation process to
et al. (2018), we employed Improved Dense Trajectories (IDT) features build the architecture of the C3D network in the Questions data set.
(Wang & Schmid, 2013). These features have been shown to perform
well in action recognition. In a nutshell, IDT densely samples points in a 2.3. Human judgments data collection
frame and tracks them in the following ones, obtaining their respective
trajectories (after some corrections, e.g. correcting for camera motion). To test the human ability to detect deception through facial micro-
For each trajectory, several descriptors can be computed considering a expressions, we created 12 questionnaires in Google Forms. The video
region of neighbouring pixels, such as the Histogram of Oriented Gra­ clips of the data set were randomly distributed between the 12 ques­
dients (HOG) of each frame, that captures static information, and de­ tionnaires as follows: 10 questionnaires included five videos (five
scriptors based on the optical flow (Farnebäck, 2003) (i.e., the apparent questionnaires with three videos of liars and two of truth-tellers and five

5
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

questionnaires with two videos of liars and three videos of truth-tellers), Table 3
and two questionnaires included six videos (one with two videos of liars AUC results of the experiment with deep learning techniques.
and four videos of truth-tellers and one with three videos of liars and Data set OpenFace + SVM OpenFace + LSTM C3D
three videos of truth-tellers).
FreeSpeech 0.72 ± 0.02 0.57 ± 0.03 0.64 ± 0.03
One-hundred and twenty Italian-speaking participants took part in Questions 0.78 ± 0.03 0.72 ± 0.02 0.75 ± 0.03
the study (60 males and 60 females). The average age was 23.12 (SD =
2.56; range 19–42 years), and the average educational level was 15.26
years (SD = 1.45). Participants were volunteers recruited among the in a loss of information. In fact, this approach results in a good predictive
students of the University of Padova. The experiment took place in the performance on the Questions data set. Differently, for the FreeSpeech
laboratories of the Department of General Psychology. All participants data set, this network seems not to be as discriminative. These results are
signed the informed consent and provided their demographical infor­ probably due to the small size of the data set compared to the ones
mation (age, gender, and educational level) before starting the ques­ usually adopted to train such networks.
tionnaire. Then, we randomly gave each participant one of the 12 C3D results. The last column in Table 3 reports the results of the C3D
questionnaires. We instructed the naïve judges to watch the videos fully neural network. The results show that the C3D network obtains good
and carefully one by one. After each video, we asked them to guess if the results in the considered lie detection task, particularly in the Questions
person interviewed about the holiday was a liar or a truth-teller. Finally, data set (also in this case, the performance with the FreeSpeech data set
we asked them to indicate the elements that supported their decisions is lower). Compared to LSTM, C3D performs consistently better in both
(open-ended question); to point out which interview phase (free speech data sets.
or unexpected questions) was the most useful to the decision; to indicate
on which among the listed elements (multiple-choice question) they 3.2. Human performance
based their judgment: tone of voice, eye contact, kind of words, facial
expressions, colour of cheeks, speed of speech, lip corners, number of When the entire video data set was considered, results showed that
hesitations, presence/absence of agitation, presence/absence of smiles, human judges obtained an average accuracy and AUC slightly above the
head position, number of repetitions, amount of information, number of chance level (accuracy = 57.92% ± 20.08, AUC = 0.57 ± 0.07) in
reported details, hands movements, and readiness of response. identifying liars and truth-tellers (see Table 4). A one sample t-test
revealed that the human performance statistically differs from the
3. Results chance level (t(119) = 4.30, p < 0.001, d = 0.39, 95% CI for Cohen’s
d [0.21, 0.58]).
3.1. Machine learning models performance Analysing the human performance separately for the videos
belonging to the two experimental conditions (truth-tellers vs. liars),
In this section, we analyse the performance of the considered ma­ participants reached a better performance in detecting truth-tellers than
chine learning models (presented in Section 2.2) on the data sets pre­ liars (sensitivity = 0.49, specificity = 0.65, see Table 4).
sented in Section 2.1. The performance seemed not to change based on gender and
SVM on handcrafted features results. Table 2 reports the results educational level (Table 5).
(in 10-fold cross validation) of SVM coupled with different feature Considering that the task required judges to watch five or six vid­
extraction methods. The voice “All” in the table means the concatena­ eotapes, we evaluated the possibility of performance improvement due
tion of all features extracted with IDT (MBH, HOG, and HOF). to practice. Fig. 1 compares the average accuracy obtained by partici­
We considered the same classifiers of Wu et al. (2018). We found pants by taking into consideration the order of presentation of the video
that, concerning the single features, the MBH features obtain the highest clips (we excluded the sixth video watched in only two questionnaires).
AUC for both data sets, given values comparable to the literature studies A linear trend analysis confirms that the accuracy of human judges
that employed this method. Combining the features provides good re­ does not improve during the task (F = 3.08, p = 0.113, ηp2 = 0.255).
sults only on the Questions data set. Overall, the accuracy obtained in the last video dropped by 11.66%
SVM with OpenFace AU results. When considering the same SVM compared with the performance in the first one, although this trend is
method trained on the AU extracted by OpenFace, we can observe a not statistically significant.
huge leap in the obtained predictive performance. In fact, using Open­ Regarding the elements that participants declared useful to support
Face AUs instead of handcrafted features results in an improvement of their decisions in distinguishing liars from truth-tellers, the number of
0.1 (out of one) in the FreeSpeech data set and 0.09 in the Questions data details of the holiday (frequency = 55.00%), the eye gaze (frequency =
set (see the second column of Table 3). Thus, AU features extracted with 33.33%), the facial expressions (frequency = 25.83%), the gestures and
deep learning techniques seem to be the most discriminative features for movements (frequency = 25.00%), and the emotions exhibited by the
both data sets. interviewee during the narration (frequency = 23.33%) were the most
LSTM results. The third column in Table 3 shows the results ob­ frequently reported. Supplementary Material (Table S1) contains all the
tained with the LSTM network. In principle, we expect this approach to elements that emerged from the open-ended questions and their relative
outperform the SVM + OpenFace one because, in this case, we do not frequencies.
need to aggregate the features from different frames, which may result In the multiple-choice questions about the video elements they
watched to make their decisions, the most frequently reported element
Table 2 was the number of details of the holiday (frequency = 81.67%), followed
Results of the experiment with SVM classifier with handcrafted features. by the facial expressions (frequency = 70.00%), the amount of infor­
mation (frequency = 66.67%), the readiness of response (frequency =
Data set Features AUC
66.67%), and the eye contact (frequency = 63.33%). Supplementary
FreeSpeech MBH 0.62 ± 0.01
HOG 0.53 ± 0.07
HOF 0.56 ± 0.04 Table 4
All 0.56 ± 0.07 Human judges’ performances in the classification (liar vs. truth-teller) task.
Questions MBH 0.65 ± 0.01
Liars Truth-tellers
HOG 0.58 ± 0.01
HOF 0.56 ± 0.01 Liars 48.75% ± 9.08 51.25% ± 9.08
All 0.69 ± 0.01 Truth-tellers 34.86 ± 13.43 65.14% ± 13.43

6
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

Table 5 Table 6
Accuracy of human judges based on gender and educational level. The difference Comparison between artificial intelligence and human performance in terms of
between males and females was tested running an independent samples t-test, AUC.
and the difference between the three educational level groups was tested with a Data set AUC
one-way independent ANOVA.
SVM IDT features FreeSpeech 0.62 ± 0.01
Accuracy (M ± Groups’ difference Questions 0.69 ± 0.01
SD) SVM OpenFace AUs FreeSpeech 0.72 ± 0.02
Gender Males 55.28% ± 21.62 t(118) = 1.44, p = 0.152, Questions 0.78 ± 0.03
Females 60.56% ± 18.02 Cohen’s d = 0.263 LSTM OpenFace AUs FreeSpeech 0.57 ± 0.03
Educational High school 59.60% ± 19.91 F(2,117) = 0.156, p = Questions 0.72 ± 0.02
level 0.856, C3D FreeSpeech 0.64 ± 0.03
Bachelor’s 57.28% ± 20.35 η2 = 0.003 Questions 0.75 ± 0.03
degree Human judges Whole data set 0.57 ± 0.07
Master’s degree 57.33% ± 15.55

4. Discussion and conclusion

In the current study, we assessed the ability of naïve judges versus


the ability of machine learning techniques to discriminate between
truthful and deceitful information by watching a videotaped interview.
We also aimed to identify the criteria human judges adopted to make
their decision; to compare the performance of different artificial intel­
ligence approaches; and to understand which phase of the videotaped
interview (free speech or unexpected questions) makes the detection of
deception easier. For these purposes, we tested machine learning models
and human judges in the same lie detection task.

4.1. Artificial intelligence performance

The automatic classification with algorithms of artificial intelligence


have shown important differences between the methods that exploited
handcrafted features (SVM with IDF features) and the approaches using
Fig. 1. Average accuracy of human judges based on the presentation order of
deep learning-based features. Among the latter kind of models, SVM
the video clips.
coupled with OpenFace AU features was the best performing method,
probably because, in this family of methods, it is the least data-hungry.
Material (Table S2) contains a complete list of the assessed elements and
These methods mostly showed results coherent with the literature in this
their frequencies.
field. Specifically, the SVM classifier trained with the MBH features
Finally, we asked participants to indicate which phase of the inter­
resulted in an AUC comparable (although lower) to the results found by
view (free speech vs. unexpected questions) was more helpful for their
Wu et al. (2018) (AUC = 0.62 for the FreeSpeech data set and AUC =
decisions. Almost half of participants stated they found both the inter­
0.65 for the Questions data set), which used the same classifier in their
view phases to be useful. Fig. 2 shows the results.
study. Instead, extracting features through OpenFace, as in Jaiswal et al.
(2016) and Rill-Garcia et al. (2019), this model performed better, with a
3.3. Human judges versus artificial intelligence higher AUC for the Questions data set (0.78) than with the FreeSpeech
data set (0.72).
To compare the performance of naïve judges with the performance of Training an LSTM network with OpenFace features, we obtained an
artificial intelligence, we adopted the receiver-operating-characteristic AUC of 0.72 for the Questions data set, far above the AUC obtained by
(ROC) analysis. The ROC analysis uses the AUC as a metric to express Rill-Garcia et al.
accuracy in the discrimination of the genuine from the deceptive in Testing the two data sets with a C3D network, the obtained perfor­
videotaped interviews. Table 6 reports the results of humans compared mance was similar to the same classifier trained in Gogate et al. (2018).
with the ones obtained by machine learning techniques. In this case, we also achieved better results with the Questions data set
(AUC = 0.75) rather than with the FreeSpeech data set (AUC = 0.64).
Comparing the performance of the four classifiers trained in this
study, each of them shows higher results in the Questions data set. This
finding is consistent with the cognitive approach to lie detection, which
argues that, during a face-to-face interview, lying involves more
cognitive resources than telling the truth (Vrij, 2008). Indeed, the sec­
ond phase of the interview consisted of unexpected questions about the
holiday, which were aimed at increasing the cognitive load of the
interviewee: liars needed more cognitive effort to give a response that
was consistent with the rest of the narrative. As a consequence, liars
showed more deception cues, making the lie detection task easy (Vrij
et al., 2008). This mechanism explains the better performance of the
automatic classifiers in the Questions data set. As in previous studies
(Monaro et al., 2019, 2020b), this result shows that, by increasing the
cognitive load with the unexpected question technique, we have a
greater chance of spotting liars.
Fig. 2. Distribution of the responses to the question, “Which part of the video clip
was more useful for you to decide if the person was saying the truth or was lying?”

7
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

4.2. Human judges’ performance audio. It follows that human judges performed a multimodal analysis,
and the machine learning models analysed data only for the visual
A sample of human judges obtained an accuracy of 57% in the same modality. Despite video clips watched by human judges containing more
lie detection task performed by machine learning models. This result is deceit indexes (both free speech and unexpected questions, and both
coherent with previous studies that showed the accuracy of humans in a video and audio), which gave them more information to make their
lie detection task is around or slightly above the chance level (Bond & decisions, machines performed better than humans did.
DePaulo, 2006; Curci et al., 2019; DePaulo et al., 2003; Pérez-Rosas
et al., 2015; Porter & Brinke, 2008). Moreover, humans revealed a better 4.4. Limitations and future perspectives
performance in detecting videos of truth-tellers than liars, as already
reported by literature (Bond & DePaulo, 2006). This finding might be The current study suffers from some limitations. First, the amount of
because of the lie aversion phenomenon (López-Pérez & Spiegelman, data analysed by the automatic classifiers is limited. Secondarily, the
2013), which consists in assuming to tell the truth in ordinary situations data set we analysed was created in the laboratory. On one hand, this
(as in a holiday narration) and in thinking that others also behave in this study has a series of advantages (i.e., the ground truth is solid, all video
way. clips have the same quality and camera position to facilitate the auto­
Observing the judges’ performances according to the presentation matic processing of data); on the other hand, the laboratory context did
order of the video clips, we found that the accuracy (from the first to the not allow us to obtain the same high-stakes lies found in real situations
last video) decreased by approximately 11.66%. This finding can be (e.g., a trial). However, it is reasonable to think that, if promising results
attributed to a sort of “fatigue effect”, in which participants drop their are obtained with video clips recorded in an experimental setting, the
attention during the task, because the average length of each video was automatic systems of lie detection could increase their performances in a
approximately 410 s. real setting where emotions are stronger in intensity and, thus, decep­
Concerning the two phases of the interview, almost half of partici­ tion cues are more visible (Ekman, 2006). Moreover, in a meta-analysis,
pants said, during the task, they focused their attention on the entire Hartwig and Bond (2014) observed that, independently from the setting
video clip. The other half was divided almost equally between the free of the study, the lie detection ability of a system is quite stable.
speech phase and the unexpected question phase. A third limitation regards the unfair comparison between trained
Among the criteria considered by judges as the basis for their clas­ machines and naïve human judges: algorithms were extensively trained,
sifications, they mostly reported the number of details in the holiday but human participants were not. Although previous studies already
narration, the eye contact, and the facial expressions of the interviewee. confirmed that human judges, even the expert ones, achieve perfor­
Specifically, judges said that interviewees’ spontaneous facial expres­ mances that are close to chance in identifying liars from the analysis of
sions when answering open-ended questions was a criterion for their facial expressions (Bond & DePaulo, 2006; Curci et al., 2019; DePaulo
decisions (33.33%). When the same question was asked with a multiple et al., 2003; Pérez-Rosas et al., 2015; Porter & Brinke, 2008), a future
choice, this percentage increased to 70%. Other important indicators to work should compare the performance of trained and naïve human
the decision included emotions (emphasis, agitation, tension, embrace, judges. This would allow to assess the hypothesis that the superiority of
etc.) and the modality of narration (confidence, spontaneity, number of machine learning models is due to limitations of the human cognitive
repetitions, readiness of response, etc.). Generally, participants seemed system which is unable to catch facial expressions of such short duration,
to base their decisions on the content of narration and non-verbal lan­ rather than the simple fact that algorithms are trained and human par­
guage. Finally, we found no substantial accuracy differences among ticipants are not.
people with different genders and education levels. It is also worth noting that each human judge watched just five or six
videos. Five trials are usually few for a psychological experiment, so the
4.3. Human judges versus artificial intelligence results should be carefully interpreted. However, it should be also
considered that each video lasted about 9.5 min, namely each subject
Consistent with previous studies, the performance comparison be­ looked at more than 45 min of videos. This is a very long time in terms of
tween machine learning techniques and human judges shows that arti­ sustained attention. Presenting a greater number of videos would not
ficial intelligence performs better than humans in the lie detection task, have produced reliable results, as there would have been an inevitable
even when considering visual information only (Bartlett et al., 2014; loss of attention and tiredness effect.
Pérez-Rosas et al., 2015; Wu et al., 2018). Thus, a machine learning Another variable that should be analysed in future studies regards
model trained adequately with visual features may be able to reach a the personality and the morality of the subjects: people with a higher
high level of performance, which overcomes the human performance. level of morality can feel intrinsically guilty when they are forced to tell
The difference between human and machine performance could be a lie, even if they are in a low-stake situation. As consequence, they react
explained in terms of limited resources of the cognitive system, which do as in a high-stake setting, introducing a confounding variable in the
not allow humans to catch a quite high amount of information in a short research.
time (ten Brinke et al., 2012; Vrij, 2008). Catching facial Other limits regard the analysed data set: interviewees belong to a
micro-expressions is a highly demanding task for humans as their limited age range. Because it has been shown that the ability of lie
duration is very short. Conversely, automatic classifiers can process very detection through visual cues is attenuated in the elderly compared with
fast a large amount of information with a precision a human cannot young people (Stanley & Blanchard-Fields, 2008), it would be inter­
reach. In other words, the main difference between artificial intelligence esting to extend the study to include videos of older people. Moreover,
and humans is that algorithms, having greater computational power, although Monaro et al. (2020a) took some precautions to avoid that
can be trained receiving inputs from all the video frames. truth-tellers produced biased memories (asking participants to recall the
It should be noted that, although the video clips processed by ma­ holiday before the interview, omitting uncertain information, and
chine learning models and observed by naïve judges are the same, to helping memory with photos and videos), a certain risk of bias still ex­
submit the data set to the analysis of automatic classifiers, the original ists. There is also no way to ruled out with absolute certainty that there is
data set was divided into two: one consisting of the first phase of the not a certain degree of truth in the lies (or vice versa i.e., false mem­
interview (FreeSpeech) and the other containing the second phase ories). However, the laboratory situation allows, at least partially, to
(Questions). Instead, human judges watched the whole interview and control this problem. In the study of Monaro et al. (2020a), participants
based their judgments on both phases (free speech and unexpected assigned to the “liar” condition were provided with a partially
questions). Moreover, while machine learning models processed data for pre-compiled form with some information about a holiday that was
the visual modality only, naïve judges observed the video clip with the checked they have never experienced before, which they were asked to

8
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

learn. This guaranteed that the information that liars had to report international conference and workshops on automatic face and gesture recognition (FG)
(pp. 1–6). https://doi.org/10.1109/FG.2015.7284869
during the interview was fake, obtaining a ground truth more solid than
Baltrusaitis, T., Zadeh, A., Lim, Y. C., & Morency, L. P. (2018). OpenFace 2.0: Facial
what can be obtained from authentic high-stake situations (e.g., criminal behavior analysis toolkit. In Proceedings - 13th IEEE international conference on
trials). Future studies should take care of controlling this bias, including automatic face and gesture recognition, FG (pp. 59–66). https://doi.org/10.1109/
the phenomenon of false memories, as strongly as possible. FG.2018.00019, 2018.
Bartlett, M. S., Littlewort, G. C., Frank, M. G., & Lee, K. (2014). Automatic decoding of
It should be mentioned that a limitation concerning the machine facial movements reveals deceptive pain expressions. Current Biology. https://doi.
learning approach regards the explainability of the machine learning org/10.1016/j.cub.2014.02.009
models. We don’t know exactly which features are relevant for the task, Bond, C. F., & DePaulo, B. M. (2006). Accuracy of deception judgments. Personality and
Social Psychology Review, 10(3), 214–234. https://doi.org/10.1207/
especially in terms of duration of the facial expressions. In other words, s15327957pspr1003_2
we don’t know if our classification models are actually based on facial Bond, C. F., & Fahey, W. E. (1987). False suspicion and the misperception of deceit.
micro or macro-expressions. This is difficult to define also because there British Journal of Social Psychology, 26(1), 41–46. https://doi.org/10.1111/j.2044-
8309.1987.tb00759.x
is not a standard definition of micro vs. macro expression (for example, ten Brinke, L., Porter, S., & Baker, A. (2012). Darwin the detective: Observable facial
Su & Levine, 2016, define micro-expressions as facial actions that are muscle contractions reveal emotional high-stakes lies. Evolution and Human Behavior,
shorter than 1/5th of a second, and macro-expressions, i.e., facial ex­ 33(4), 411–416. https://doi.org/10.1016/j.evolhumbehav.2011.12.003
ten Brinke, L., Vohs, K. D., & Carney, D. R. (2016). Can ordinary people detect deception
pressions that last more than that; seminal works by Paul Ekman men­ after all? Trends in Cognitive Sciences, 20(8), 579–588. https://doi.org/10.1016/j.
tions generally “fractions of seconds” thus including both micro and tics.2016.05.012
macro expressions according to the definition of Su and Levine). Future Burgoon, J. K., Blair, J. P., & Strom, R. E. (2008). Cognitive biases and nonverbal cue
availability in detecting deception. Human Communication Research. https://doi.
works should focus on the explainability of the machine learning models
org/10.1111/j.1468-2958.2008.00333.x.
to obtain reliable hints on which specific features are useful or not to Curci, A., Lanciano, T., Battista, F., Guaragno, S., & Ribatti, R. M. (2019). Accuracy,
detect deception from facial cues. confidence, and experiential criteria for lie detection through a videotaped
Finally, according to previous literature, it seems that the multi­ interview. Frontiers in Psychiatry. https://doi.org/10.3389/fpsyt.2018.00748
Darwin, C. (1873). The expression of the emotions in man and animals. In The journal of
modal approach can lead to better results in identifying lies compared to the anthropological institute of great britain and Ireland. https://doi.org/10.2307/
the visual approach alone. Thus, as future work, we plan to include the 2841467
audio features and the text transcripts in the machine learning analysis. DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H.
(2003). Cues to deception. Psychological Bulletin. https://doi.org/10.1037//0033-
Moreover, we plan to explore transfer learning techniques to exploit 2909.129.1.74
other lie detection data sets in literature in the training of our machine Ding, M., Zhao, A., Lu, Z., Xiang, T., & Wen, J. R. (2019). Face-focused cross-stream
learning models. With these extensions, we expect to improve the pre­ network for deception detection in videos. In Proceedings of the IEEE computer society
conference on computer vision and pattern recognition. https://doi.org/10.1109/
dictive performance of artificial intelligence further. CVPR.2019.00799
Ekman, P. (1985). Telling lies: Clues to deceit in the marketplace, politics, and marriage.
Author contributions In New York W W norton ekman P.
Ekman, P. (2006). Darwin, deception, and facial expression. Annals of the New York
Academy of Sciences, 1000(1), 205–221. https://doi.org/10.1196/annals.1280.010
Nicolò Navarin: Conceptualization, Data curation, Formal analysis, Ekman, P., Friesen, W. V., & Hager, J. C. (2002). Facial action coding system -
Software, Writing- Original draft preparation. Merylin Monaro: investigator’s guide. In FACS.
Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In
Conceptualization, Methodology, Data curation, Writing- Original draft J. Bigun, & T. Gustavsson (Eds.), Scandinavian conference on image analysis (pp.
preparation. Stéphanie Maldera: Investigation, Data curation, Writing 363–370). Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-45103-X_50.
– original draft preparation. Cristina Scarpazza: Supervision, Funding. Frank, M. G., & Ekman, P. (1997). The ability to detect deceit generalizes across different
types of high-stake lies. Journal of Personality and Social Psychology, 72(6),
Giuseppe Sartori: Conceptualization, Supervision. 1429–1439. https://doi.org/10.1037/0022-3514.72.6.1429
Frankenhuis, W. E., Roelofs, M. F. A., & de Vries, S. A. (2018). Does exposure to
Funding psychosocial adversity enhance deception detection ability? Evolution. Behav. Sci., 12
(3), 218–229. https://doi.org/10.1037/ebs0000103
Frank, M. G., Menasco, M. A., & O’Sullivan, M. (2008). Human behavior and deception
The present study was granted by the University of Padova (Sup­ detection. In Wiley handbook of science and technology for homeland security. John
porting TAlent in ReSearch @ University of Padua - STARS Grants Wiley & Sons, Inc. https://doi.org/10.1002/9780470087923.hhs299.
Gogate, M., Adeel, A., & Hussain, A. (2018). Deep learning driven multimodal fusion for
2017). The present study was carried out within the scope of the automated deception detection. In 2017 IEEE symposium series on computational
research program “Dipartimenti di Eccellenza” (art.1, commi 314–337 intelligence. https://doi.org/10.1109/SSCI.2017.8285382. SSCI 2017 - Proceedings.
legge 232/2016), which was supported by a grant from MIUR to the Hartwig, M., & Bond, C. F. (2014). Lie detection from multiple cues: A meta-analysis.
Applied Cognitive Psychology, 28(5), 661–676. https://doi.org/10.1002/acp.3052
Department of General Psychology, University of Padua. Hill, M. L., & Craig, K. D. (2004). Detecting deception in facial expressions of pain. The
Clinical Journal of Pain, 20(6), 415–422. https://doi.org/10.1097/00002508-
200411000-00006
Declarations of competing interest von Hippel, W., & Trivers, R. (2011). The evolution and psychology of self-deception.
Behavioral and Brain Sciences, 34(1), 1–16. https://doi.org/10.1017/
S0140525X10001354
None. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Jaiswal, M., Tabibu, S., & Bajpai, R. (2016). The truth and nothing but the truth:
Appendix A. Supplementary data
Multimodal analysis for deception detection. In IEEE international conference on data
mining workshops. https://doi.org/10.1109/ICDMW.2016.0137. ICDMW.
Supplementary data to this article can be found online at https://doi. Krishnamurthy, G., Majumder, N., Poria, S., & Cambria, E. (2018). A deep learning
approach for multimodal deception detection.
org/10.1016/j.chb.2021.107063.
Ling, C. X., Huang, J., & Zhang, H. (2003). Auc: A better measure than accuracy in
comparing learning algorithms. In Y. Xiang, & B. Chaib-draa (Eds.), Advances in
References artificial intelligence (pp. 329–341). https://doi.org/10.1007/3-540-44886-1_25
López-Pérez, R., & Spiegelman, E. (2013). Why do people tell the truth? Experimental
evidence for pure lie aversion. Experimental Economics. https://doi.org/10.1007/
Allwood, J., Cerrato, L., Jokinen, K., Navarretta, C., & Paggio, P. (2007). The MUMIN
s10683-012-9324-x
coding scheme for the annotation of feedback, turn management and sequencing
Monaro, M., Businaro, M., Spolaor, R., Li, Q. Q., Conti, M., Gamberini, L., & Sartori, G.
phenomena. Language Resources and Evaluation. https://doi.org/10.1007/s10
(2019). The online identity detection via keyboard dynamics. In Proceedings of the
579-007-9061-5.
future technologies conference (FTC) 2018. FTC 2018. Advances in intelligent systems
Avola, D., Foresti, G. L., Cinque, L., & Pannone, D. (2019). Automatic deception detection
and computing (Vol. 881, pp. 342–357). https://doi.org/10.1007/978-3-030-02683-
in RGB videos using facial action units. In ACM international conference proceeding
7_24
series. https://doi.org/10.1145/3349801.3349806
Monaro, M., Capuozzo, P., Ragucci, F., Maffei, A., Curci, A., Scarpazza, C., Angrilli, A., &
Baltrusaitis, T., Mahmoud, M., & Robinson, P. (2015). Cross-dataset learning and person-
Sartori, G. (2020a). In Using blink rate to detect deception: A study to validate an
specific normalisation for automatic Action Unit detection. In 2015 11th IEEE

9
M. Monaro et al. Computers in Human Behavior 127 (2022) 107063

automatic blink detector and a new dataset of videos from liars and truth-tellers (pp. Slowe, T. E., & Govindaraju, V. (2007). Automatic deceit indication through reliable
494–509). https://doi.org/10.1007/978-3-030-49065-2_35 facial expressions. In 2007 IEEE workshop on automatic identification advanced
Monaro, M., Gamberini, L., & Sartori, G. (2017a). The detection of faked identity using technologies - proceedings. https://doi.org/10.1109/AUTOID.2007.380598
unexpected questions and mouse dynamics. PLoS One, 12(5), 1–19. e0177851 Stanley, J. T., & Blanchard-Fields, F. (2008). Challenges older adults face in detecting
https://doi.org/10.1371/journal.pone.0177851. deceit: The role of emotion recognition. Psychology and Aging. https://doi.org/
Monaro, M., Spolaor, R., QianQian, L., Conti, M., Gamberini, L., & Sartori, G. (2017b). 10.1037/0882-7974.23.1.24
Type me the truth!: Detecting deceitful users via keystroke dynamics. In Proceedings Su, L., & Levine, M. (2016). Does “lie to me” lie to you? An evaluation of facial clues to
of the 12th international conference on availability, reliability and security, ARES (Vol. high-stakes deception. Computer Vision and Image Understanding, 147, 52–68. https://
17, p. 60). https://doi.org/10.1145/3098954.3104047 doi.org/10.1016/j.cviu.2016.01.009
Monaro, M., Zampieri, I., Sartori, G., Pietrini, P., & Orrù, G. (2020b). The detection of Trauffer, N. M., Widen, S. C., & Russell, J. A. (2013). Education and the attribution of
faked identity using unexpected questions and choice reaction times. Psychological emotion to facial expressions. Psychological Topics, 22(2), 237–247.
Research. https://doi.org/10.1007/s00426-020-01410-4 Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on
Montagne, B., Kessels, R. P. C., Frigerio, E., de Haan, E. H. F., & Perrett, D. I. (2005). Sex Neural Networks, 10(5), 988–999. https://doi.org/10.1109/72.788640
differences in the perception of affective facial expressions: Do men really lack Venkatesh, S., Ramachandra, R., & Bours, P. (2020). Video based deception detection
emotional sensitivity? Cognitive Processing, 6(2), 136–141. https://doi.org/10.1007/ using deep recurrent convolutional neural network. In Communications in computer
s10339-005-0050-6 and information science. https://doi.org/10.1007/978-981-15-4018-9_15
Pérez-Rosas, V., Abouelenien, M., Mihalcea, R., & Burzo, M. (2015). Deception detection Vrij, A. (2008). Detecting lies and deceit: Pitfalls and opportunities. Wiley series in the
using real-life trial data. In Proceedings of the 2015 ACM on international conference on psychology of crime, policing and law. In Analysis. https://doi.org/10.1002/1521-
multimodal interaction - ICMI (Vol. 15, pp. 59–66). https://doi.org/10.1145/ 3773(20010316)40:6<9823::AID-ANIE9823>3.3.CO;2-C
2818346.2820758 Vrij, A., Granhag, P. A., & Porter, S. (2010). Pitfalls and opportunities in nonverbal and
Perš, J., Sulić, V., Kristan, M., Perše, M., Polanec, K., & Kovačič, S. (2010). Histograms of verbal lie detection. Psychological Science in the Public Interest. https://doi.org/
optical flow for efficient representation of body motion. Pattern Recognition Letters, 10.1177/1529100610390861. Supplement.
31(11), 1369–1376. https://doi.org/10.1016/j.patrec.2010.03.024 Vrij, A., Mann, S. A., Fisher, R. P., Leal, S., Milne, R., & Bull, R. (2008). Increasing
Porter, S., & Brinke, L. T. (2008). Reading between the lies: Identifying concealed and cognitive load to facilitate lie detection: The benefit of recalling an event in reverse
falsified emotions in universal facial expressions. Psychological Science. https://doi. order. Law and Human Behavior, 32(3), 253–265. https://doi.org/10.1007/s10979-
org/10.1111/j.1467-9280.2008.02116.x 007-9103-y
Porter, S., ten Brinke, L., & Wallace, B. (2012). Secrets and lies: Involuntary leakage in Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2013). Dense trajectories and motion
deceptive facial expressions as a function of emotional intensity. Journal of Nonverbal boundary descriptors for action recognition. International Journal of Computer Vision,
Behavior. https://doi.org/10.1007/s10919-011-0120-7 103(1), 60–79. https://doi.org/10.1007/s11263-012-0594-8
Rill-Garcia, R., Escalante, H. J., Villasenor-Pineda, L., & Reyes-Meza, V. (2019). High- Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. IEEE Int.
level features for multimodal deception detection in videos. In IEEE computer society Conf. Comput. Vision, 3551–3558. https://doi.org/10.1109/ICCV.2013.441, 2013.
conference on computer vision and pattern recognition workshops. https://doi.org/ Wu, Z., Singh, B., Davis, L. S., & Subrahmanian, V. S. (2018). Deception detection in videos.
10.1109/CVPRW.2019.00198 32nd AAAI conference on artificial intelligence, AAAI 2018.
Sartori, G., Agosta, S., Zogmaister, C., Ferrara, S. D., & Castiello, U. (2008). How to Zhang, Z., Singh, V., Slowe, T. E., Tulyakov, S., & Govindaraju, V. (2007). Real-time
accurately detect autobiographical events. Psychological Science, 19(8), 772–780. automatic deceit detection from involuntary facial expressions. In Proceedings of the
https://doi.org/10.1111/j.1467-9280.2008.02156.x IEEE computer society conference on computer vision and pattern recognition. https://
doi.org/10.1109/CVPR.2007.383383

10

You might also like