CHI PLAY Michal S Submission

Duke Spook’em: A responsive fear modulation system in a horror game
environment
ANONYMOUS AUTHOR(S)
Fig. 1. Duke Spook’em iteratively senses user affect, adjusts in-game tension, and measures the adjustment’s effects. To create
this system, we collected physiological data (electrodermal activity (EDA) and photoplethysmography (PPG)) to serve as privileged
information for use in a Learning Under Privileged Information (LUPI) machine learning paradigm. We also collected non-privileged
telemetry data, and users performed post-hoc affect annotation as ground truth. In our final game system, embedded in a procedurally-
generated horror game, the LUPI student model uses only non-privileged telemetry data to predict user affect with high accuracy
and guide their experience (top) towards a pre-authored tension curve (bottom).
Recent research on player modeling effectively employs machine learning algorithms to achieve incredible results. However, most
such approaches are impractical for real-world applications, usually for one of two reasons: they use intrusive sensory devices and
require their input at run-time [23], or they are not designed for real-time inference [44]. We introduce Duke Spook’em—an end-to-end
system which overcomes those limitations by combining state of the art methodologies in data acquisition, affect modelling and
prediction. Duke Spook’em closes the affective loop and creates a tailored, adaptive horror game experience which aims to maximize
player engagement. We first gather ground-truth data from 21 participants, leveraging unbounded affect annotation [39] and EDA,
PPG biometric signals. We discuss the tonic driver of electro-dermal activity (EDA) as a feature for predicting long-running tension.
We then produce an SVM ensemble trained under the Learning Under Privileged Information (LUPI) paradigm [65, 66], enabling
runtime predictions without need for auxiliary equipment. Using 5-fold cross validation, our model achieves accuracy of 80% for 2
classes, 74% for 3 classes, and 66% for 5 classes. In our final evaluation with 7 players, Duke Spook’em was shown to be more effective
than a corresponding rule-based system, making for a better and scarier experience.
CCS Concepts: • Human-centered computing → Interactive systems and tools; Empirical studies in HCI.
Additional Key Words and Phrases: affect, games, sensing, LUPI, machine learning, fear
ACM Reference Format:

Anonymous Author(s). 2022. Duke Spook’em: A responsive fear modulation system in a horror game environment . In . ACM, New
York, NY, USA, 23 pages. https://doi.org/XXXXXXX.XXXXXXX
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2022 Association for Computing Machinery.
Manuscript submitted to ACM
1
CHI PLAY, October 10–13, 2023, Stratford, Canada Anon.
1 INTRODUCTION
Affective computing, a relatively new sub-field of computer science in and of itself, has recently broken out into the
world of computer games. Recognizing the power and seemingly unlimited applications of this medium, new grounds
are being broken in research each year. So-called "Serious Games" are widely utilized in education [31], mental health
and therapy [18], developing interpersonal communication skills [29], or even military training [62]. Their impact was,
is, and will continue to be widely studied and explored.
Regardless of its application, however, a video game’s power lies in its ability to fully immerse the user in its digital
world and evoke an emotional response—one that gets stronger with player’s investment and attachment. And what
better way to improve that experience than to create virtual worlds specifically tailored towards the player’s current
moods and needs? Researchers have used affective computing to inform and supplement game design [25, 52] and
to effectively model player experience as a function of the game content and player behavior, enabling generation
of personalized content [68]. 2022 also saw a commercially-released product called Anacrusis—Left4Dead’s spiritual
successor—leverage an AI-powered system to moderate gameplay. Their system constantly tracks the intensity of
combat encounters as perceived by the player and adjusts them to feel rewarding without being overwhelming.
A system’s ability to respond to user emotions is, however, limited by its sensory capabilities. Many techniques for
recognizing emotion have been introduced in literature over the years—examining voice patterns [20, 36, 50], facial
expressions [2, 17, 26], body gestures [7, 9, 30], or other physiological responses [35, 37, 56], as well as blending multiple
techniques [14, 19, 21]. All of these have shown significant promise in laboratory conditions, but quickly become
cumbersome when one wants to use them in a game. The sensory equipment involved not only lowers the quality
and enjoyment of the experience, but also makes it impossible to distribute the product to end users without requiring
them to have access to specific sensor hardware. Even then, quality variance in consumer equipment (even simple and
pervasive devices, like webcams) might heavily influence the results.
In this paper we present a solution which does away with those limitations, requiring specialized equipment only
during the game’s development and design process. We utilize the Learning Under Privileged Information (LUPI)
paradigm [65, 66] to train an SVM ensemble classification model. LUPI enables use of privileged information (in our
case, intrusive sensor data) to create a “teacher” model. This model’s output is then used to train a “student” model
that, crucially, does not have access to the privileged information. We gather information about users’ in-game actions
as input into the teacher model and contextualize them during training with measured Electrodermal Activity (EDA)
and Photoplethysmography (PPG). Eventually, the "student" model produced can predict users’ affective response
without access to their EDA and PPG signals, and efficiently use that information to drive gameplay elements and
maximize the desired effect. During runtime, our implementation uses a relatively small set of statistical data collected
from the game—e.g., players’ movements and actions—which transfer trivially to any first-person game and could be
generalized to third-person implementations with additional work. This approach sacrifices some accuracy—we do
not know specifically when a player meets with a particular enemy or event in our game—at the benefit of allowing
the technology to be much more portable to many different games. In both first-person horror games and walking
simulators, players move around: we derive most of our input data from players’ movement-related behavior.
But, of course, a model is only as good as its training data. First, we discuss affect labels. Researchers have proposed
a variety of methods for collecting user affect data, which we discuss in more detail in Section 2.3. Ultimately, we
settled on using ordinal unbounded affect annotation for our data collection. Our work represents the first time that
this technique has been tested at any scale outside of the original work, and we also discuss how it performs as ground
2
Duke Spook’em: A responsive fear modulation system in a horror game environment CHI PLAY, October 10–13, 2023, Stratford, Canada
truth for establishing long running tension as opposed to “event-based manifestations of tension and stress” [39]. We
performed a data gathering experiment with 21 participants, where we asked them to play a horror game demo, watch
a video capture of the gameplay, and then annotate their experience. We modify the testing protocol from Lopes,
et al. [39]—longer play sessions, focus on building tension rather than “jump-scares”—as we focus on moderating
subject emotion over longer periods of time. Our results show high correlation rank between EDA and annotation,
demonstrating their approach’s merit. We also demonstrate high correlation with a previously under-explored feature:
the tonic EDA component, as explained in the next paragraph.
The other aspect of training data is the privileged information: the sensor input. As noted, we leverage both EDA
and PPG signals, which have been shown to correlate with user arousal. However, we also introduce and evaluate a
novel use of the EDA signal’s composition: while its phasic driver has been evaluated as an indicator of momentary
fear, we instead rely on its tonic driver as a cue to users’ experience of long-running tension. In our dataset, collected
with 21 users, we observed a good correlation rank of 0.12 between the traditionally-used phasic components and users’
self-reported affect. More interestingly, we show better results for the tonic component: 0.46.
By adapting, merging, and improving these cutting edge techniques and technologies we present an implementation
of an end-to end system we call Duke Spook’em: a pre-trained model for affect recognition (built under the LUPI
paradigm) is tightly coupled with a simple system running real-time content adjustment. Duke Spook’em predicts
affect, adjusts gameplay, and gathers feedback on whether said adjustments elicit the desired affective response. With
the loop of prediction and adjustment, we believe we are one step closer to closing the affective loop, and in our final
playtest, our 7 players agreed that their AI-assisted playthrough was more engaging. In addition, we had these players
post-hoc annotate their playthroughs, and our model achieved average 67% arousal prediction accuracy on their data.
Our contributions are therefore:
(1) The first thorough evaluation of ordinal and unbounded affect annotation against accepted EDA measures [39, 70]
(2) Introduction and evaluation of the tonic component of an EDA signal’s decomposition as a predictor of long-
running tension in subjects
(3) The first practical, real-time application of the LUPI paradigm for prediction of player affect
(4) Duke Spook’em, an end-to-end system that iteratively senses player affect and adjusts gameplay content at
runtime to match a pre-authored target tension curve
2 AFFECT
Affect, which we discuss the definition of below, has been studied by psychologists for decades; more recently, it has
been co-opted by computer science researchers in the study of affective computing. From there, further researchers have
built affective loops in video games. There is only moderate agreement on precise definitions for this messy concept;
we discuss the definitions and touchpoints that underpin our work.
2.1 Definitions and taxonomy

In this paper, we use an emotion taxonomy based on the ones presented in Jeon, et al., [28] and Harley, et al. [23], where
the authors distinguish between emotions, feelings, and moods. Emotions are defined as the physiological responses of
an individual’s body relative to their current state and objective. Feelings, on the other hand, are how those bodily
responses are interpreted and later represented in the individual’s consciousness. These prior works also distinguish
“moods”, but the difference is subtle [55] and not necessary for our work: we will only use “emotion.” Lastly, the broad
3
(a) The 2D valence-arousal model of emotion proposed (b) Trace gathered from a subjects self-reported arousal in our
by Russel [61] experiment
Fig. 2. Russel’s model of emotion we’ve decided to follow along with a presentation of an arousal measurement
term "affect" is used as an encapsulation of the entirety of an individual’s affective experience (emotions, feelings,
moods, and more).
In this work we only deal with emotions—and a pretty narrow subset of them. As with the rest of the nomenclature,
how emotions should be classified is—to this day—a hotly debated topic. Overall, there is some agreement on the
existence of so-called "basic emotions", however their exact number and nature remain contested [27, 45]. Paul Ekman
isolates six of them: anger, fear, disgust, sadness, happiness, and surprise [16]. Robert Plutchik adds anticipation and joy
into the mix, arranging all the emotions into a color wheel—opposing emotions sit on opposite sides of the spectrum and
are their distance from the center is determined by intensity [59]. Ekman’s other study proposes the two-dimensional
scale, which was later expanded on by Russel into the valence-arousal model most commonly used today [61], where
valence is the positivity/negativity of an emotion (frustrated is low valence, happy is high) and arousal is the energy
of the emotion (calm is low arousal, excited is high). There exist numerous other models that either re-invent the
taxonomy or expand on pre-existing theories, such as the 3D Lövheim Cube of Emotion [42].
We rely on Russel’s valence-arousal model (see Figure 2a) and specifically focus on the emotion of fear. According to
the model, fear is located in the low-valence-high-arousal region (quadrant II). Given that our work is grounded in a
horror-game environment, we assume that players will naturally gravitate towards negative valence and thus we focus
only on modulating their arousal—trying to steer them towards the "sweet-spot of fear".
2.2 Affective computing and its rise in video games

Affective computing as we know it now came into its own in the late 1990s with the work of Rosalind Picard [57, 58],
where she defines it as “computing that relates to, arises from, or influences emotions”. We extend that definition
with Yukse’s: "Affective Computing is the study and development of systems and devices that can recognize, interpret,
process, and simulate human affect/emotion" [71]. As we are operating in the space of interactive systems in general and
video games in particular, where influencing players is integral, we thus focus on systems and devices that recognize,
interpret, process, simulate, and influence human emotions.
Computer games are highly evocative and immersive experiences, embedding their players in a constant state
of interaction with their many systems. This allows players to develop complex responses to the stimuli, forming
strong emotional bonds to a game, its world, and its inhabitants. In games research, this is referred to as the “affective
4
loop” [6, 63]: a system that can work in the smooth cycle of eliciting, sensing, and then responding to the user’s
emotions. Video games are an ideal environment for this, as player input and behavioural data provide rich input,
while ever-improving output techniques can create entire responsive, parallel worlds. Players also seek out experiences
beyond purely positive emotions: they wilfully subject themselves to stressful or downright unpleasant situations to
experience deeper, more powerful involvement in the medium. This allows us to study a much wider spectrum of
emotions, such as tension, as we do here.
2.3 Related work

The field of affective computing is both new and fast-progressing. The exponential advances in technology over the
span of last two decades open many new exciting possibilities—we can more easily build comprehensive models [13]
while crowd-sourcing large user studies [1] can give insights on a scale previously unheard of. Most researchers use
physiological sensors to measure player affect. Examples include a brainwave-controlled game for relaxation training
[11], an emotion recognition system using facial landmarks detected from a live web camera feed [53], and crafting
the most anxiety-inducing horror experience possible by measuring real-time player heart rate [54]. We also use
physiological sensors to measure and respond to player affect, but we focus on Electrodermal Activity (EDA).
EDA is a proven measure of cognitive and emotional stress levels [5, 38], making it a staple in the field of affective
computing and a common tool in emotion modelling for affective computing [48], including for video games [15]. EDA
consists of two components: tonic and phasic [8]. Its tonic driver varies slowly, and changes are usually observed in a
time frame of up to a minute. The phasic driver, on the other hand, represents rapid changes in the signal over much
shorter time intervals. Typically the peaks in phasic response can be observed within a couple of seconds after exposure
to external stimuli [5]. Historically, research has analyzed the phasic driver, studying the relationship between the
stimuli—its nature, intensity, etc.—and immediate changes in the phasic response [41]. We are not aware of any work
that has explored use of the slower-changing tonic driver, which we do here.
In addition, all of these existing works require real-time readings from invasive sensor apparatuses: we leverage
novel machine learning techniques which enable us to avoid using sensors with the game during its runtime.
Vapnik’s 2005 Learning Under Privileged Information paradigm [65, 66] suggests mimicking a teacher-student approach
in a machine learning environment by training two models—one with access to both privileged and non-privileged
information, the other one with direct access only to non-privileged information, but able to use the first model’s output
in its own loss function [40]. Prior work explored using LUPI in games, with physiological and in-game telemetry
data treated as privileged and pixel-level gameplay information treated as non-privileged [44]. The authors achieved
impressive results: their average accuracy was 72% for binary classification of high/low arousal. This paper is the most
closely related to what we describe here, but we go further: we leverage telemetry instead of pixels as non-privileged
information, which makes our model more easily translatable to other games and, crucially, allows it to be run in
real-time rather than only in the offline evaluation task performed there. This enables us to evaluate an end-to-end
system in a double-blind study, and our model surpasses theirs in accuracy on a 2-class test (80% vs 72%). In addition,
we explore using LUPI for more granular classifications with 3 and 5 classes and show how this information can more
meaningfully control gameplay elements and player engagement.
3 DUKE SPOOK’EM: IMPLEMENTATION

We now introduce Duke Spook’em—our attempt at closing the affective loop. The system is capable of continuous content
adjustments in a video game based on user’s emotional response predicted by an ML model. We start by collecting
5
preliminary training data which shows the tonic driver of EDA to be a good predictor of long running tension, and which
evaluates the work of Yannakakis et al. on ordinal annotation [39, 70] in its first real-time prediction application. We
then use these findings to generate ground-truth data for training a LUPI-enabled ensemble SVM capable of multi-label
classification with results of 80%, 74% and 66% for 2, 3, and 5 levels of player arousal, respectively. We discuss a simple,
rule-based system in our horror game which consumes these classifications and moderates in-game content to steer the
player’s emotional response in a desired way—i.e., following a pre-authored tension curve. Lastly, we evaluate Duke
Spook’em in a blind test with 7 participants and show that it meaningfully and noticeably enhanced players’ experience.
3.1 Training data collection

We first gathered ground truth data for training our classifier. We wanted to evaluate work of Yannakakis et al. regarding
the ordinal nature of emotions [69, 70]—they present promising results, but so far those claims have not been verified
outside of the original work, or in any kind of real-time prediction environment, such as Duke Spook’em. We also
wished to explore the decomposed EDA signal, specifically to determine if its slow-changing tonic driver is a more
suitable predictor of long-running tension than the fast-changing phasic component previously explored by others.
3.1.1 Procedure. A researcher welcomed each participant, who was then helped to attach EDA and PPG sensors. The
participant sat in a standard chair and was asked to play a prototype of our horror game on a desktop computer. Sensors
were recorded with a 100 Hz sampling rate, and the user’s gameplay was also recorded at both the pixel- and telemetry
levels. Play sessions lasted between 5–12 minutes. After the gameplay session concluded, users were asked to annotate
their experienced arousal during gameplay using a system similar to RankTrace [39], which showed a replay of their
gameplay footage while they used an unbounded affect annotation tool controlled by the up/down keyboard arrows
to mark whether they were becoming more aroused or less aroused. There was no required frequency of annotation,
subjects were encouraged to indicate only changes in their arousal. Overall, players averaged 3.2 seconds between
annotations.
3.1.2 An aside on collecting affect ground truth. There is some dispute regarding annotating video game data. There is
the obvious technical limitation—players are already using an input device (mouse, keyboard, controller) to play, and
as such are unable to use yet another tool for live annotation. There is the possibility of verbal communication with
researchers, but this could influence the subject’s immersion and results. The de-facto standard for games-related affect
research is having subjects annotate videos captured during their attempt [46, 47]. One could argue that remembering
emotion is not the same as experiencing it live and—while this is true—there is sufficient evidence that humans are
fairly good at remembering their emotional states. Most studies conclude that emotions actually enhance our ability to
remember details correctly [33]. Furthermore, negative emotions have a more beneficial influence on memory than
positive ones [32], which is perfect for our application.
Additionally, humans are better at discriminating between options then ranking them on an absolute scale [69].
According to adaptation theory, we keep a frame of reference for each stimuli and we register said stimuli as deviations
from that reference. This baseline reference changes overtime—being exposed to a certain emotion a lot will eventually
make a subject less susceptible to impulses from the same category [24]. Studies show significant improvement in
inter-user agreement when using ordinal annotation, with less data needed in order to achieve satisfactory results [70].
Thus, as noted, we implemented a tool similar to Yannakakis’s RankTrace [39], which produces a “trace” of subjects’
perceived and recalled up/down changes in affect as they watch a replay of their game session. The annotation is
6
continuous and unbounded - the users do not require an arbitrary frame of reference and only focus on changes in
their recalled emotional response.
Fig. 3. An example of our annotation tool (window in the top-left corner) running, with an example recording of the experiment
playing in the background
3.1.3 Participants. We recruited 21 participants (7 female, 12 male) from our university and social networks, ranging
in age from 24–35. Participants all had previous experience with horror video games, and were not compensated for
their time.
3.1.4 Results. We used standard processing on the collected physiological data: we performed R-peak analysis of the
PPG signal resulting in 3 different heart rate measures (BPM, SDSD, RMSSD), and we decomposed EDA into tonic and
phasic drivers. As research suggests that the baseline response of EDA varies greatly from person to person [38, 41],
we also normalize the EDA signal and extract its gradient. Users’ annotated arousal traces were also normalized and
their gradients were extracted, then binned into equal-width classes numbering 2, 3, or 5 classes (used to train different
classifiers). Initial analysis showed correlation rank values of 0.7-0.9 between the annotated values and measured EDA
response. We further extracted 9 distinct features from collected telemetry data, capturing players’ movement patterns
and high-level game state (see Table 1).
All the data—biometrics, annotation, gameplay features—was gathered together and further processed via the "sliding
window" method. We settled on windows with length 5 seconds, step 0.5 seconds; preliminary exploration showed
significant accuracy drops in shorter time frames, while longer windows introduced instability in our models. To reduce
our windows to a not-overwhelming feature size and thus reduce overfitting, we extracted various statistical features
over each timeseries within the window. In particular, we took the average, maximum, minimum, and amplitude of
each of our 14 feature values (the 9 telemetry values, the 5 biometric values) within each window. This gave us 56
features to explore. We performed a similar process on the players’ annotations, taking average, minimum, maximum,
and amplitude in aligned windows of the same size. For annotation, we also considered the gradient (and its min, max,
and average) and the integral of the trace within each window.
From our 21 subjects, data from 18 was usable (the others were discarded due to sensor and other malfunctions)
and amounted to a total of 11732 data points in our dataset (datapoints were then windowed as described). To test our
7
Feature Description
𝑃𝑉 Player’s velocity vector
𝑃 |𝑉 | Player’s velocity magnitude
𝑃𝐹𝑊 𝐷 Player’s forward vector
Dot product of the velocity
𝑃𝐷𝑂𝑇
and movement vectors
The delta of player’s mouse input
𝑃𝐶 Δ
(X, Y in screen space)
Raw input (keystrokes) input
𝑃𝐼
during the frame
Distance to the nearest in-game
𝑃𝐷
entity (NPC, monster)
Types of objects the player
𝑃𝐿
looked at during the frame
Number of walls that the player
𝑃𝑊
hit head-on
Table 1. Telemetry features extracted from the recorded play sessions. All data
was collected on runtime and then processed after the experiment was over.
hypothesis that the EDA tonic component is a good predictor of long-running tension, we use Pearson Correlation
Coefficient and Spearman’s rank correlation coefficient between annotated ground truth and the two EDA signal
components. This follows the rationale of Yannakakis, et. al. [39] that EDA is a well-established and reliable manifestation
of fear and stress.
We observed promising results in the experiment—as presented in Table 2—showing high correlation levels between
the normalized signal of unbounded, ordinal annotation and EDA. We observed a correlation rank of 0.116 between the
phasic EDA component and annotated arousal, which is consistent with results reported by Yannakakis, et al. [39] .
What is more, we note better overall outcomes in using its tonic driver over the phasic one—corellation rank of 0.462.
We conclude that the ordinal, unbounded affect annotation method proposed by Yannakakis et al. is a valuable and
reliable tool in gathering ground truth data on users’ experience of fear, even in our game scenario with its focus on
long-running tension over jump scares. We also chose to move forward with the tonic component of EDA in our LUPI
model, given the strong Spearman rank correlation between it and the annotation delta in our users’ affect traces.
3.2 Predicting Player Arousal

In order to optimize and evaluate the performance of Duke Spook’em, we trained and compared multiple ML models for
predicting players’ arousal. We start by comparing five non-LUPI– and LUPI-based Neural Networks similar in structure
to those used by Yannakakis, et al. to predict player affect from pixel-level gameplay recordings [44]—we include a
pixel-based LUPI predictor of our own to use as a baseline for comparison. While we test pixel-based prediction offline,
we also note that it is impractical for online usage: grabbing and processing image data at runtime is far too expensive
of an operation to smoothly run alongside a video game.
• A pixel-based LUPI network of the same architecture as one proposed by Yannakakis, et al. with the only
difference being the amount of telemetry input features we use.
• 2 non-LUPI neural networks are trained with (1) only telemetry and (2) all of the available data, respectively.
They follow implement a simple architecture with 3 fully-connected layers and dropout to prevent overfitting.
8
Annotated ann_delta ann_grad ann_max_grad ann_min_grad integral

EDA_Tonic_AVG *0.462 0.010 -0.032 -0.009 -0.027 *0.461
EDA_Phasic_AVG *0.116 -0.032 0.025 0.029 -0.001 0.116
dot_mean *0.237 -0.010 0.023 0.020 0.011 *0.236
dot_std *0.128 *0.095 *0.089 *0.099 -0.002 *0.128
dot_min 0.005 -0.095 -0.055 -0.067 0.011 0.004
dot_max *0.200 *0.112 *0.074 *0.112 -0.036 *0.200
dot_amplitude *0.128 *0.104 *0.089 *0.102 -0.007 *0.128
x_std *0.075 -0.012 *0.085 0.046 0.059 *0.075
y_std -0.131 -0.059 0.000 -0.039 0.041 -0.131
x_amplitude *0.076 -0.004 *0.088 0.053 0.057 *0.076
y_amplitude -0.130 -0.057 0.003 -0.036 0.043 -0.130
bpm 0.012 *0.085 -0.025 -0.055 0.028 -0.127
sdsd *-0.145 -0.060 -0.008 -0.039 0.032 -0.145
rmssd *-0.127 -0.040 -0.015 -0.023 0.002 -0.127
Table 2. Correlation coefficients for selected features (Spearman) gathered in our experiment. Features marked as EDA represent the
physiological signal readings, features related to dot are derived from the dot product of the players forward-facing and velocity
vectors. Features containing x or y are derived from the player’s mouse movements.. Results similar or better than the original
RankTrace [39] evaluation are emboldened and signified with an asterisk (*).
• The LUPI teacher network is trained using only physiological sensor data—EDA and PPG signals. This network
follows the architecture of the previous two, with adjustments to reflect that it only receives biometric data.
• The student network is trained using telemetry as input and supplementing its loss function with predictions
from the teacher network. Architecture follows all the previous ones, with adjustments made to account for the
fact that the student does not receive the biometric signals.
We also tried fitting a variety of simpler models to our data, with a focus on SVMs as they are the model of choice in
Vapnik’s original LUPI work [66]. We believe them to be well-suited to our problem, as they are traditionally considered
to be one of the best predictors to use on small but complex data sets [10]. This also has the implicit benefit of allowing
us to follow the original work more easily.
We observed similar results to Yannakakis et al. with the pixel based classifier [44]—averaging around 75%—but
it quickly became clear that while NNs were a good fit for pixel-based model, they did not perform as well for our
telemetry-focused data set: we achieved only 52% accuracy for binary clasification, and observed the networks to be
very unstable. The SVMs, however, had significantly higher accuracy (68–80% for 3-label classification), particularly
when used in an ensemble: with 20 parallel SVMs we achieved 80% for binary higher/lower arousal classification.
An ensemble of NNs would be prohibitively expensive to run, so we did not test this. We offer further detail on the
SVM-based models, below.
3.3 Training and adding ensembles

We use the same 56 features considered in our data collection and EDA evaluation, i.e., telemetry’s window-based
statistical derivations (see Table 1) with 5s windows and 0.5s step. We further examined correlation rank of each
feature with the players’ annotations, trained single-feature SVMs to predict players’ annotations, and finally chose one
statistical feature to use for each raw telemetry feature in our final SVM predictor. We use the mean normalized affect
annotation within each time window as our labels, splitting the data into 2, 3, or 5 equal-width intervals to test different
9
(a) Slow buildup of tension and rising tonic EDA signal
(b) An example of a more erratic signal with visible peaks and valleys
Fig. 4. A user’s annotation trace plotted against their tonic EDA response. We postulate that the tonic component
of the signal is a good predictor of long-term tension experienced by the users.
granularity of classification. We perform 10 rounds of model training with 5-fold cross validation on each and report
averaged results—it’s important to note that we made sure that each model was trained on exactly the same data splits.
We also tested different split methods—random split, stratified random split, even random split, withholding whole
participants for testing, and a split into time-series based upon whether a player has moved sufficiently enough from
their previously recorded position—to account for any implicit bias being introduced due to considering time-series
data, but did not notice differences between their performance. Thus we report only the stratified random split here.
Even though the demo was designed as a non-linear experience, we observed players following similar patterns and
routes through the levels.
The pixel-based LUPI classifier performed very well, achieving average results of 75% on 2 classes, similarly to the
original paper. We note, however, that in our implementation the teacher network was prone to overfitting, which also
affected the student network. We quickly discovered that neural networks were not a good fit for our particular type of
data; even a non-LUPI model produced just 52% accuracy (barely better than chance) and showed instability which we
were unable to eliminate. As such, we explored various basic models and present the results on the 3-class task in Table
3. A particular star was the SVM, which achieved 81.5% accuracy, as compared to the Neural Net’s 41.5%. As SVMs
performed well and the original implementation of LUPI dealt exclusively with them, we settled on SVMs.
10
Accuracy ALL Accuracy PRIVILEGED Accuracy TELEMETRY

Classifier
SVM (RBF kernel) 0.815 0.641 0.583
Decision Tree 0.656 0.628 0.538
AdaBoost 0.635 0.620 0.540
Random Forest 0.588 0.614 0.520
Nearest Neighbors 0.573 0.704 0.573
Naive Bayes 0.542 0.563 0.484
Neural Net 0.415 0.565 0.423
QDA 0.396 0.538 0.396
Table 3. Comparison of accuracies on a 3-label classification. Results are averaged over 10 iterations—each using 5-fold cross
validation—and reported in descending order of accuracy achieved on all data. We note the poor performance of the Neural Net
classifier and the very good results of SVM
We quickly note the slow performance of SVM models—both during training and inference—due to high dimension-
ality of our data. It is often recommended to train an SVM model on a representative subset of the data [49, 51]. We
verify the merit of that approach by training a number of classifiers on randomly selected subsets of our data set, each
consisting of 30% of total data points. We observe an average accuracy of 69.2% across 5 distinct classifiers with STD of
0.4 (see Table 4a)—all reported values are averaged over 10 iterations in a 5-fold Cross Validation (5-CV) scheme, using
a stratified random split. Encouraged by the results, we trained a LUPI-enabled SVM classifier—which we henceforth
refer to as SVM+—in the same way. We forked and updated the svmplus library for python—a faithful implementation
of Vapnik’s original work—to use the newest SKLearn API for this task. Accuracies ranged from 44–48.2% for SVM+
LUPI student models on a 3-class classification task (see Table 4b). Needless to say, this is not adequate to our task.
All features Privileged Telemetry SVM+ (student)

Classifier 1 0.695 0.604 0.604 Classifier 1 0.482
(a) SVM with radial basis function (b) SVM+
Table 4. Training the support vector machine classifiers on data subsets to improve efficiency and prediction
time
We suspect that this lower accuracy can be attributed to the complex nature of the model—with the teacher’s loss
function encoded directly into the student we increase the potential points of failure when it comes to the highly
transient and contextual affect data. The problems compound as we train the models on only data subsets. We thus
explore ensemble learning to alleviate those issues, while still retaining the benefits of faster training/prediction
compared to a single, large SVM [12]. We should note that normally, we would not be so concerned with a model’s
runtime and could be more liberal with the approach we choose for the best results. However, Duke Spook’em is a live
inference system running alongside a video game on varying consumer-grade hardware, which begets all the practical
concerns. We implement a simple SVM-based ensemble model with majority voting, testing how different numbers of
11
classifiers from 5–30 influence prediction runtime, and then we compare their accuracy on 2, 3, and 5 label classification
problems (see Table 5). This followed our usual protocol, and reported accuracies are averaged over running the 5-fold
cross validation scheme 10 times.
Average
accuracy
No of classifiers
2 classes 3 classes 5 classes
in ensemble
5 0.721 0.639 0.569
10 0.784 0.681 0.654
20 0.808 0.718 0.596
30 0.801 0.741 0.666
Table 5. Results achieved by utilizing different numbers of classifiers in our ensemble model
Simple ensemble voting enabled us to consistently achieve 80% average accuracy on a binary classification problem,
74% for 3 classes, and 65% for 5-classes, using the largest ensemble size of 30 classifiers. While the accuracy of this
variant was superior, its runtime left a lot to be desired, taking over 2 seconds to produce a single prediction. As such,
we later utilize the 3-label variant of a 20-classifier ensemble—with sub-second prediction times and a still-respectable
accuracy of 80%, 72% and 60% for 2, 3, and 5 classes, respectively.
3.4 Implementing the feedback loop

3.4.1 Representing desired tension. So, what exactly is the target response that we want to elicit in our players?
Traditionally, tension throughout an experience will constantly build up while allowing local fluctuations. This ensures
that—even though the stakes get progressively higher—the audience is granted some breathing room. This is common
across traditional media like movies (see the breakdown of Star Wars Episode IV in Figure 5a), as well as interactive
media like video games. Unfortunately, we cannot directly apply this reasoning in our system’s design, as our game
relies heavily on procedural generation and emergent story-telling rather than linear progression. We cannot ensure
when or in what order particular story points will happen. We therefore propose a simplification of this model: we
assume the existence of a finite amount of "milestones" within the game. While we cannot ensure they will happen in
any pre-determined order, we know that, with each milestone, the stakes—and tension—should increase. In terms of
adaptation theory [24], each milestone increases the baseline for perceiving fear stimuli. As such, Duke Spook’em’s
loop is reduced to monitoring and adjusting the emotional response in a fluctuating fashion within each "milestone
interval" (see Figure 5b).
3.4.2 Creating the "action" space. While we have now solved the problems of sensing player arousal and understanding
what the target arousal is, we need to give Duke Spook’em some ammunition to scare the players with. We abstract
in-game events designed to change the fear response into "actions" that the system is allowed to take. To give game
designers creative control over Duke Spook’em’s behavior, we store information about any available actions in a
pre-authored knowledge base—a text file whose contents follow simple semantics. Each action is represented by (see
Figure 6a):
(1) Action: A descriptive name given, later used as an ID
(2) Ante: A list of prerequisites (other actions) for the action to be available
12
(a) An example of how tension is being built (b) We assume that tension will follow
throughout the running time of Star Wars a sinusoid-like function between certain
Episode IV: A New Hope story beats and its baseline will be driven
up by every milestone the player completes
Fig. 5. Examples of tension curves: the intended pre-authored tension curve of our game experience, and
an example from popular culture.
(3) Weight: A weight parameter, arbitrarily assigned by the designer. Weights are the determining factor when
multiple actions are available at the same time. The one with higher weight will be considered first.
(4) Expected: Whether the action is expected to raise or lower the tension
(5) Conse: If a node contains "Ante" and "Conse" fields it is considered a logical operator determining the availability
of certain actions, in which "Conse" determines the action(s) made available if the "Ante" of the given node is
fulfilled.
We were inspired by Haddawy’s work in Bayesian Networks (BN) generation [22] and we borrow the knowledge base
semantics directly from there. We also implement his algorithm for generating Bayesian Networks from pre-existing
knowledge bases, but we make no assumptions about the probability distribution of our actions and only use the
generated network as a graph structure for easy determination of actions available at any given moment. We use the
graph generation algorithm, but forgo the actual BN in our implementation, as they were an inadequate choice to
represent a game design model—it constantly changes throughout development and we want to be able to extend the
action space easily, if needed. An example of generation can be seen in Figure 6, where we present a simple "knowledge
base" and an action graph constructed from it.
3.4.3 The memory model. We introduce a "memory model", based on a simple logging mechanism, to keep information
about actions taken by the system. Executed system actions are batched into 30 second increments in a queue; they
move further and further back and are forgotten after 2.5 minutes (see Figure 7). In each 30 second interval, we keep
track of the following information:
• What actions were taken - including the full description of the action, with expected result and weight
• An array of predicted arousal values (inference model is queried in 1 second intervals)
Every 30 seconds, the data in the earliest bucket is discarded, everything left is moved one bucket over, and the most
recent one—now empty—starts being populated. We use the memory model mostly for adjusting action weights based
13
Action : Play mysterious monster sounds

Ante : MonsterSpawnedAtLeastOnce
Weight : 0.3
Expected : UP
Action : Play Signe screams

Ante : PlayerTalkedToSigne
AND M o n s t e r S p a w n e d A t L e a s t O n c e
Weight : 0.5
Expected : UP
Ante : PlayerFoundAccessCard
Conse : PlayerNearAnAccessPoint
Action : P l a y a message from Amy

Ante : PlayerNearAnAccessPoint
Weight : 0.2
Expected : DOWN
(a) An example of what a pre-authored action (b) A simple action graph generated from Listing 6a
knowledge base looks like
Fig. 6. Overview of how we generate actions to be considered by Duke Spook’em from a pre-authored
knowledge base.
on whether they had the desired outcome and, if so, how long it took to elicit the response: actions will be prioritized
(weight increased) if they met the desired responses and deprioritized (weight decreased) otherwise.
Fig. 7. A visualisation of the memory model used by Duke Spook’em: actions taken by the system are
bucketed into 30 second increments, with actions taken more than 150 seconds ago forgotten.
3.4.4 Putting it all together. Duke Spook’em operates in a constant loop of prediction-adjustment-feedback, trying to
steer the subject to follow the sine curve of expected arousal. This curve is assumed to be a sine-line function partitioned
into sections by milestones—pre-authored story beats within the game that are consistent across playthroughs. We
assume that crossing a milestone into another section increases baseline tension, thus shifting the function up (see
Figure 5b). Arousal is predicted by the inference model in 1 second intervals, while the the time between actions is a
configurable parameter—in the gameplay demo we prepared for evaluation it also ran each second (see Algorithm 1).
The feedback loop itself is not concerned with milestones and the increases in baseline tension following them—this is
reflected in the evolving set of actions, which are designed to appear in increasing order of intensity. Duke Spook’em
14
attempts to steer the player response within a single section between milestones to match the sine-like curve and—when
a new milestone is reached—the system is turned off for a brief moment, allowing players an opportunity to go down to
their baseline. This baseline is assumed to have risen—following adaptation theory [24]—and when the system resumes,
its loop treats the new baseline as the lowest arousal label. This framing ensures that we can meaningfully moderate a
long-running experience with a relatively low-resolution classifier.
There are numerous ways in which the system adjusts its action selection algorithm, scattered across different stages
of the loop. We present a brief overview of all those methods contributing to the adaptability of Duke Spook’em’s
director capabilities:
• Moderated action space: only a subset of actions ( 80%) will be loaded into memory on game start, giving every
playthrough the opportunity to be realized under unique circumstances. Additionally, with each milestone new
actions become available and some of the old ones are discarded.
• The memory model: following the well-established reasoning that exposition to a certain stimulus increases
our resistance to it [24], we adjust weights of the available actions to reflect whether they have already been
experienced in recent memory
• Desired response feedback: when an action is taken, it is committed to the memory along with its expected
response (UP/DOWN). We track the action throughout its lifetime in our 5 memory stages and adjust its weight
for future generations in the following ways:
– If the fear response predicted by our inference model changes to meet the desired one, adjust the weight:
𝐹
𝑤𝑎𝑑 𝑗𝑢𝑠𝑡𝑒𝑑 = 𝑤𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ∗ (1 + 0.2 ∗ 𝑛𝑏𝑢𝑐𝑘𝑒𝑡 +𝑛 ), where 𝐹 is a configurable fall-off parameter and 𝑛𝑎𝑐𝑡𝑖𝑜𝑛𝑠 is
𝑎𝑐𝑡𝑖𝑜𝑛𝑠
the amount of actions taken since the action in question. We thus reward quickly meeting the desired response
and penalize the response being slow or possibly influenced by other actions taken since.
– If the fear response predicted by our inference model changes to opposite of the desired one, adjust the weight:
𝐹
𝑤𝑎𝑑 𝑗𝑢𝑠𝑡𝑒𝑑 = 𝑤𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ∗ (1 − 0.2 ∗ 𝑛𝑏𝑢𝑐𝑘𝑒𝑡 +𝑛 ), where 𝐹 is the fall-off parameter and 𝑛𝑎𝑐𝑡𝑖𝑜𝑛𝑠 is the amount
𝑎𝑐𝑡𝑖𝑜𝑛𝑠
of actions taken since the action in question. We penalize actions taking opposite effect, but are more lenient
the further back in time they have taken place. We also take into account the amount of actions taken since,
any of which could have actually caused the change.
– If the action reaches the end of its in-memory lifetime without any recorded change in affect: 𝑤𝑎𝑑 𝑗𝑢𝑠𝑡𝑒𝑑 =
𝑤𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ∗ 0.95, we slightly penalize the action’s not having any effect, but not as severely as if it provoked an
opposite one.
4 EVALUATION
To evaluate the performance of our system, we conducted a user study in which subjects were asked to play the entire
game in two variants: with and without the use of Duke Spook’em. The participants answered a questionnaire after
each playthrough and took part in a short free-form interview at the end of the study. We also asked each subject
to annotate the captured videos of their attempts—using methodology introduced in Section 3.1.2—to measure the
accuracy of our inference model on previously unseen data.
4.1 Experimental protocol

Methodology remains largely the same as described in Section 3.1.2 with the following differences: the study was
was conducted in two parts, carried out on two consecutive days and the subjects were not attached to any sensors
15
Algorithm 1: Duke Spook’em core loop

Given:
𝐼𝑚𝑖𝑛 = Minimum interval between consecutive actions
𝐹 = Memory fall off parameter
𝑛𝑏𝑢𝑐𝑘𝑒𝑡 = For each action, the number of "memory bucket" in which action was last taken. -1 if action wasn’t
taken in the last 2,5 minutes
𝑤𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 = For each action, the weight derived from the generated graph
Perform:
(1) Every second: query the inference model for subject’s reported arousal
Every second: check the current desired emotional response
(2) If predicted != desired and at least 𝐼𝑚𝑖𝑛 seconds passed since taking the last action, perform action selection:
(a) Generate the action graph under current in-game circumstances
(b) Gather all currently available actions (leafs in the graph) and filter them according to the desired response
(UP/DOWN)
(c) Adjust action weights, based on whether they have been already been performed in the recent memory, if
𝑛𝑏𝑢𝑐𝑘𝑒𝑡 ≠ −1.
𝐹
𝑤𝑎𝑑 𝑗𝑢𝑠𝑡𝑒𝑑 = 𝑤𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 ∗ (1 − ( 𝑛𝑏𝑢𝑐𝑘𝑒𝑡 ))
(d) Select the best action and perform it
(3) Log to memory: arousal values and actions, if any have been taken
during their gameplay attempts. On each of those days we asked subjects to play one copy of our game without telling
them which variant they are given at the time. We did not disclose the exact purpose of the study and did not give
participants any indication whether or how the variants differed. The inference model implemented in the gameplay
demo was using the 3-label classification variant of a 20-model SVM+ ensemble. Like the previous study, we asked
participants to annotate their recorded gameplay videos after their playthroughs using our ranktrace-inspired interface.
We also recorded the actions taken by the system—along with their weights and predicted effect—and the system’s
second-by-second inferences of player affect.
4.1.1 Participants. We recruited 7 participants for this study (4 males, 3 females) ranging in age from 25–26 years.
As opposed to the previous experiments—where we aimed to create as diverse a population under test as possible—
we deliberately recruited subjects from the same class with similar profiles in regards to video game/horror media
enjoyment. This was done to ensure that we minimize subjective bias as much as possible and to allow us to more
confidently draw comparisons between different players.
4.2 Fear inference model accuracy

We compared subjects’ reported labels with predictions performed at runtime by taking the majority class within a 1
second window of an annotated trace—that’s how often inference was queried—and comparing it to a value predicted
by the inference model. Each subject contributed an average of 702 entries, STD 202. We note that the subjects’ second
attempt is usually shorter, resulting in less data points—by 31 seconds on average, see Table 6—but this did not in any
way correlate to respective accuracies. After interviews, we attribute this difference to players becoming familiar with
16
the control scheme and the game environment. Overall, results are promising: we have reached an average accuracy of
approx. 67% on the Duke Spook’em variant and 64% on the simple variant when comparing the affect predicted at
runtime with players’ annotated arousal, as reported in the post-play interview. We are particularly happy with the
almost 70% result reported on the longest playthrough of almost 19 minutes. All of those findings suggest that our
model is a good predictor of long-running tension.
"Duke Spook’em" Simple rule-based

variant variant
No. of entries No. of entries
Accuracy Accuracy
(seconds) (seconds)
Participant 1 0.631 381 0.589 432
Participant 2 0.679 689 0.665 624
Participant 3 0.714 543 0.648 576
Participant 4 0.691 1178 0.684 984
Participant 5 0.654 743 0.591 826
Participant 6 0.632 653 0.703 571
Participant 7 0.672 836 0.639 794
Table 6. Reported inference model accuracies - broken down by participant
4.3 Duke Spook’em as an end-to-end system

We now discuss results from the user questionnaires and interviews. Most importantly, we found that players of the
Duke Spook’em-enhanced game found it scarier and more enjoyable, without feeling that there was a smart system in
the background controlling things. In addition, we found evidence that our event management system was effective,
and that auditory stimuli were scarier than visual stimuli. Our summarized findings:
• Players playing the Duke Spook’em variant were less likely to say the game functioned as a pre-determined
system where they could discern clear rules. (This question was asked only for players’ first game experience,
whether it was rules-based or Duke Spook’em-enhanced, as we expected the question itself to bias them on
subsequent playthroughs of either system.)
• Players playing the Duke Spook’em variant were much less likely to notice the same events being repeated. This
demonstrates that our deprioritization of events present in recent memory works well.
• Across both variants, 6 out of 7 participants reported that repeating the same stimuli yielded diminished effects.
The remaining participant said it would have the same effect on them. This is in line with the theory of adaptation
[24] and solidifies our design to restrict repeating events that are the same or similar.
• 5 out of 7 of users reported that auditory stimuli is much more effective than other types—visual feedback,
NPC interaction, confrontation with enemies—in eliciting fear/anxiety. Following up on this question in user
interviews, they usually tied it to the fact that most of the sounds they have encountered alluded to things being
present in the levels that they could not see. This feeling of a hostile presence that stays out of one’s vision was a
source of a lot of uneasiness for the subjects.
• We asked players to rate how “scary” their experience was on a scale of 1-10. Due to our small N, we were unable
to find statistical significance in our results; both the two game conditions and the first day versus second day
playthroughs across players were indistinguishable. However, 4 out of 7 players rated their Duke Spook’em
17
playthrough as "scarier" than they did the regular version. 2 subjects reported equal scores, and the final subject
rated the Duke Spook’em variant lower.
Users also projected sentience onto elements of the game which were not part of the Duke Spook’em system, but which
rather either followed very simple rules or were completely random (such as the monster and how/when it appeared).
5 DISCUSSION AND LIMITATIONS

5.1 Relatively small test groups
We performed a single data gathering experiments with 21 participants and then a qualitative study on 7 subjects. This
is on par with other research into affect in games. While we are confident in our results, given how consistent they are
between both experiments, we hesitate to generalize any of the results onto the broader population. Our study would
benefit from a larger and more diverse test group to gather data for training the models, and the end-to-end evaluation
could definitely use some more participants.
5.2 Expanding in-game telemetry

Originally, our goal was to create a relatively game-agnostic system—none of the metrics used to train our inference
model are tied to any specific elements or behaviours and so the models could predict the value of arousal on any
(reasonably similar) first-person experience. Duke Spook’em is not tied into any game-specific behaviors either: in the
implementation, actions are just function callbacks with necessary parameters. However, this success came at a price:
it is difficult to gather information on players’ behaviour without context. In the original work of Yannakakis, et al.
[44], they also leverage high-level abstract heuristics such as “detrimental/helpful events in the game”, or “enemies
approaching.” We believe that if we were to forgo the generalizability in favour of a broader spectrum of measurements,
we could have achieved better results.
5.3 More physiological data to explore

In the end, we used extracted features from only 2 physiological sensors as our privileged information and there is a
variety of other possible input sources to explore, as we briefly touched upon in the Background section [43, 72]. Since
our results show promise in the LUPI paradigm, it will be valuable to explore what other physiological signals can be
integrated into affect sensing models as privileged information. Affect modelling from different features—physiological,
facial, body posture, etc.—is a research-dense field that contains a lot of promising results [67] just waiting to be utilized
in a LUPI-enabled model. We think that—given our current experimental protocol with biometric sensors—additional
signals could be easily incorporated into the dataset. For example, both ECG (heart data) and EEG (brainwave data)
have been shown to be reliable sources for measuring an individual’s stress response [4, 60] and even employed in
training machine learning inference models for stress detection [64]. We also believe that other affect sensing modalities
could be explored, and we’re particularly interested in emotion detection based on facial features analysis. This is a
well-explored direction in affect modelling with great results being reported from both traditional machine learning
models [34] as well as more interesting approaches such as fuzzy neural networks [26]; it is begging to be explored in
systems that can be used in the real world that do not require intrusive sensing at runtime.
18
5.4 More efficient implementation of the classifier

As we have noted throughout the paper, our more accurate models took too long to produce a prediction to be used in
such a frequent loop. Many of these issues could be solved by using more efficient libraries and implementations—most
notably, a substantial improvement could be made if we utilized a dedicated library for SVM ensemble voting such
as ensembleSVM [12]. Another idea to improve performance is to use dimensionality reduction in our dataset. Our
preliminary inquiries into using Primary Component Analysis showed that 98% of variance in our data could be
explained by only 2 primary components. Using them to train the classifiers, we saw a 12% improvement in training
time and 9% improvement when running predictions. A more detailed and comprehensive study could probably produce
even better results.
5.5 More labels and the potential of regression models

While the results of our inference model are encouraging, operating on only 3 fear levels (the amount of labels used in
the final gameplay demo) does not give us the fine control over content that we desire and we have to work around
that limitation by introducing auxiliary systems and rules—such as the assumption of baseline fear being shifted up
after every milestone event. With additional work mentioned above—introducing more features and building more
efficient models—we should theoretically be able to build a system capable of much more granular predictions.
What is more, we believe that our good results across different varieties of the multi-label classification problem
(2–, 3– and 5-class) might be explained by the fact that we are essentially trying to fit a very low resolution regression
models to the trace function of user’s affect. This point is even more underlined by how closely the tonic responses
matched reported user affect in our experiments (See Figure 4). As such we believe it would be interesting to explore
training actual regression models on our data. LUPI was originally developed for classification problems, but research
using it for regression does exist [3].
5.6 Training Duke Spook’em

Right now, Duke Spook’em is an end-to-end system which incorporates a machine learning inference model and
uses its predictions to drive the gameplay loop. The system itself works well, as shown by our results, but we can
imagine driving the whole experience with machine learning. For example, the weights in the action database could
be generated through additional experiments on groups of players receiving random presentations of them, rather
than pre-authored as they are now. Indeed, within a highly systemic game such as Breath of the Wild that leverages
physics-based interactions and emergent gameplay that the interaction of its many systems provide, actions themselves
could even be generated through random combination of possible stimuli which is then presented to players and logged
into the action database. Levels, too, could be generated procedurally by the system, with long hallways abruptly
coming to an end as users’ tension builds to a climax.
5.7 Emotion detection beyond fear

Our system predicts arousal of the players, and given the horror environment of our game, we have assumed a certain
valence of their emotions—placing players in the stress/fear region. While it worked for a horror game, this would not
carry over to non-horror games or other experiences, as players might transition between different emotions; currently
we have no way of contextualizing their arousal in such case. Therefore, in order to make the system truly robust it
could be coupled with some sort of basic emotion recognition mechanism.
19
5.8 Affect prediction beyond video games

Even though our system is not at all aware of users’ valence, it’s ability to model arousal works well. Valence can be
variously interpreted as the fear, mental load, or stress currently experienced by a user. These concepts are broadly
applicable even outside of games. For example, the set of features used to fit the inference model was mostly metrics
of player’s movement and mouse inputs: this suggests that the system could be interesting to test on players’ real
movements if they are playing a game or performing a task in virtual or augmented reality. Predicted arousal of the
player could be used to modify a game, or, if they are performing a more mundane task, it could drive the display of UI
elements—simplifying them when the user gets annoyed, slowly showing more options as they calm down.
6 CONCLUSION
This paper presented Duke Spook’em—an end-to-end system combining the findings of cutting edge research in video
game affect modelling field and putting them to effective use. Duke Spook’em closes the affective loop in a horror video
game by implementing a constant predict-adjust-feedback loop based on arousal prediction, leading to increased player
satisfaction. Along the way, we discussed and evaluated affect ground truth via ordinal and unbounded annotation
and introduced a novel physiological marker of long running tension—the tonic component of the EDA signal. Our
experiment showed a correlation rank of 0.46 between user reported affect and the tonic EDA, which gave us confidence
to use to use the signal as privileged information for training a LUPI-enabled SVM ensemble: this is now the backbone
of Duke Spook’em. Our end-to-end study showed Duke Spook’em made for a more immersive and scarier experience.
We believe that our work presented here lays out a course for designing robust affect detection and modulation systems
that can be practically applied in the real world.
REFERENCES
[1] 2022. Crowdsourcing research questions in science. Research Policy 51, 4 (2022), 104491. https://doi.org/10.1016/j.respol.2022.104491
[2] K. Anderson and P.W. McOwan. 2006. A real-time automated system for the recognition of human facial expressions. IEEE Transactions on Systems,
Man, and Cybernetics, Part B (Cybernetics) 36, 1 (2006), 96–105. https://doi.org/10.1109/TSMCB.2005.854502
[3] Amina Asif, Muhammad Dawood, and Fayyaz ul Amir Afsar Minhas. 2018. A generalized meta-loss function for distillation and learning using
privileged information for classification and regression. CoRR abs/1811.06885 (2018). arXiv:1811.06885 http://arxiv.org/abs/1811.06885
[4] Mahsa Bagheri and Sarah D Power. 2020. EEG-based detection of mental workload level and stress: the effect of variation in each state on
classification of the other. 17, 5 (oct 2020), 056015. https://doi.org/10.1088/1741-2552/abbc27
[5] Mathias Benedek and Christian Kaernbach. 2010. A continuous measure of phasic electrodermal activity. J Neurosci Methods 190, 1 (May 2010),
80–91.
[6] Daniel R. Bersak, Gary McDarby, Daragh McDonnell, Brian McDonald, and Rahul Karkun. 2001. Intelligent Biofeedback using an Immersive
Competitive Environment.
[7] Nadia Bianchi-berthouze and Andrea Kleinsmith. 2003. A categorical approach to affective gesture recognition. Connection Science 15, 4 (2003),
259–269. https://doi.org/10.1080/09540090310001658793 arXiv:https://doi.org/10.1080/09540090310001658793
[8] Wolfram Boucsein. 2012. Electrodermal activity. Springer Science & Business Media.
[9] Ginevra Castellano, Santiago D. Villalba, and Antonio Camurri. 2007. Recognising Human Emotions from Body Movement and Gesture Dynamics.
In ACII.
[10] Jair Cervantes, Farid Garcia-Lamont, Lisbeth Rodríguez-Mazahua, and Asdrubal Lopez. 2020. A comprehensive survey on support vector machine
classification: Applications, challenges and trends. Neurocomputing 408 (2020), 189–215. https://doi.org/10.1016/j.neucom.2019.10.118
[11] Luca Chittaro and Riccardo Sioni. 2014. Affective computing vs. affective placebo: Study of a biofeedback-controlled game for relaxation training.
International Journal of Human-Computer Studies 72, 8 (2014), 663–673. https://doi.org/10.1016/j.ijhcs.2014.01.007 Designing for emotional wellbeing.
[12] Marc Claesen, Frank De Smet, Johan A.K. Suykens, and Bart De Moor. 2014. EnsembleSVM: A Library for Ensemble Learning Using Support Vector
Machines. Journal of Machine Learning Research 15, 4 (2014), 141–145. http://jmlr.org/papers/v15/claesen14a.html
[13] Cristina Conati and Heather Maclaren. 2009. Modeling User Affect from Causes and Effects, Vol. 5535. 4–15. https://doi.org/10.1007/978-3-642-
02247-0_4
20
[14] L.C. De Silva and Pei Chi Ng. 2000. Bimodal emotion recognition. In Proceedings Fourth IEEE International Conference on Automatic Face and Gesture
Recognition (Cat. No. PR00580). 332–335.
[15] Anders Drachen, Georgios Yannakakis, Lennart Nacke, and Anja Pedersen. 2010. Correlation between Heart Rate, Electrodermal Activity and Player
Experience in First-Person Shooter Games (Pre-print). https://doi.org/10.1145/1836135.1836143
[16] Paul Ekman. 1992. An argument for basic emotions. Cognition and Emotion 6, 3-4 (1992), 169–200. https://doi.org/10.1080/02699939208411068
arXiv:https://doi.org/10.1080/02699939208411068
[17] I.A. Essa and A.P. Pentland. 1997. Coding, analysis, interpretation, and recognition of facial expressions. IEEE Transactions on Pattern Analysis and
Machine Intelligence 19, 7 (1997), 757–763. https://doi.org/10.1109/34.598232
[18] Theresa M. Fleming, Lynda Bavin, Karolina Stasiak, Eve Hermansson-Webb, Sally N. Merry, Colleen Cheek, Mathijs Lucassen, Ho Ming Lau, Britta
Pollmuller, and Sarah Hetrick. 2017. Serious Games and Gamification for Mental Health: Current Status and Promising Directions. Frontiers in
Psychiatry 7 (2017). https://doi.org/10.3389/fpsyt.2016.00215
[19] N. Fragopanagos and J.G. Taylor. 2005. Emotion recognition in human–computer interaction. Neural Networks 18, 4 (2005), 389–405. https:
//doi.org/10.1016/j.neunet.2005.03.006 Emotion and Brain.
[20] Michael Grimm, Kristian Kroschel, Emily Mower, and Shrikanth Narayanan. 2007. Primitives-based evaluation and estimation of emotions in speech.
Speech Communication 49, 10 (2007), 787–800. https://doi.org/10.1016/j.specom.2007.01.010 Intrinsic Speech Variations.
[21] Hatice Gunes and Massimo Piccardi. 2007. Bi-modal emotion recognition from expressive face and body gestures. Journal of Network and Computer
Applications 30, 4 (2007), 1334–1345. https://doi.org/10.1016/j.jnca.2006.09.007 Special issue on Information technology.
[22] Peter Haddawy. 2013. Generating Bayesian Networks from Probability Logic Knowledge Bases. https://doi.org/10.48550/ARXIV.1302.6811
[23] Jason Matthew Harley. 2016. Chapter 5 - Measuring Emotions: A Survey of Cutting Edge Methodologies Used in Computer-Based Learning
Environment Research. In Emotions, Technology, Design, and Learning, Sharon Y. Tettegah and Martin Gartmeier (Eds.). Academic Press, San Diego,
89–114. https://doi.org/10.1016/B978-0-12-801856-9.00005-0
[24] Harry Helson. 1948. Adaptation-level as a basis for a quantitative theory of frames of reference. Psychological Review 55, 6 (1948), 297–313.
https://doi.org/10.1037/h0056721
[25] Eva Hudlicka. 2008. Affective computing for game design. 4th International North-American Conference on Intelligent Games and Simulation, Game-On
’NA 2008 (01 2008), 5–12.
[26] Spiros V. Ioannou, Amaryllis T. Raouzaiou, Vasilis A. Tzouvaras, Theofilos P. Mailis, Kostas C. Karpouzis, and Stefanos D. Kollias. 2005. Emotion
recognition through facial expression analysis based on a neurofuzzy network. Neural Networks 18, 4 (2005), 423–435. https://doi.org/10.1016/j.
neunet.2005.03.004 Emotion and Brain.
[27] Rachael E. Jack, Oliver G.B. Garrod, and Philippe G. Schyns. 2014. Dynamic Facial Expressions of Emotion Transmit an Evolving Hierarchy of
Signals over Time. Current Biology 24, 2 (2014), 187–192. https://doi.org/10.1016/j.cub.2013.11.064
[28] Myounghoon Jeon. 2017. Chapter 1 - Emotions and Affect in Human Factors and Human–Computer Interaction: Taxonomy, Theories, Approaches,
and Methods. In Emotions and Affect in Human Factors and Human-Computer Interaction, Myounghoon Jeon (Ed.). Academic Press, San Diego, 3–26.
https://doi.org/10.1016/B978-0-12-801851-4.00001-X
[29] Johan Jeuring, Frans Grosfeld, Bastiaan Heeren, Michiel Hulsbergen, Richta Ijntema, Vincent Jonker, Nicole Mastenbroek, Maarten van der Smagt,
Frank Wijmans, Majanne Wolters, and Henk Zeijts. 2015. Communicate! — A Serious Game for Communication Skills —. 9307 (01 2015), 513–517.
https://doi.org/10.1007/978-3-319-24258-3_49
[30] Asha Kapur, Ajay Kapur, Naznin Virji-Babul, George Tzanetakis, and Peter Driessen. 2005. Gesture-Based Affective Computing on Motion Capture
Data. Affective Computing and Intelligent Interaction 3784, 1–7. https://doi.org/10.1007/11573548_1
[31] Nuri Kara. 2021. A Systematic Review of the Use of Serious Games in Science Education. Contemporary Educational Technology 13 (01 2021), ep295.
https://doi.org/10.30935/cedtech/9608
[32] Elizabeth A. Kensinger. 2007. Negative Emotion Enhances Memory Accuracy: Behavioral and Neuroimaging Evidence. Current Directions in
Psychological Science 16, 4 (2007), 213–218. https://doi.org/10.1111/j.1467-8721.2007.00506.x arXiv:https://doi.org/10.1111/j.1467-8721.2007.00506.x
[33] Elizabeth A. Kensinger. 2009. Remembering the Details: Effects of Emotion. Emotion Review 1, 2 (2009), 99–113. https://doi.org/10.1177/
1754073908100432 arXiv:https://doi.org/10.1177/1754073908100432 PMID: 19421427.
[34] Amjad Rehman Khan. 2022. Facial Emotion Recognition Using Conventional Machine Learning and Deep Learning Methods: Current Achievements,
Analysis and Remaining Challenges. Information 13, 6 (2022). https://doi.org/10.3390/info13060268
[35] K. H. Kim, Seok Won Bang, and S. R. Kim. 2006. Emotion recognition system using short-term monitoring of physiological signals. Medical and
Biological Engineering and Computing 42 (2006), 419–427.
[36] Kang-Kue Lee, Youn-Ho Cho, and Kyu-Sik Park. 2006. Robust Feature Extraction for Mobile-Based Speech Emotion Recognition System. Springer Berlin
Heidelberg, Berlin, Heidelberg, 470–477. https://doi.org/10.1007/978-3-540-37258-5_48
[37] Christine Lisetti, Fatma Nasoz, Cynthia Lerouge, Onur Ozyer, and Kaye Alvarez. 2003. Developing multimodal intelligent affective interfaces for
tele-home health care. Int. J. Hum.-Comput. Stud. 59 (07 2003), 245–255. https://doi.org/10.1016/S1071-5819(03)00051-X
[38] Yun Liu and Siqing Du. 2017. Psychological stress level detection based on electrodermal activity. Behav Brain Res 341 (Dec. 2017), 50–53.
[39] Phil Lopes, Georgios Yannakakis, and Antonios Liapis. 2017. RankTrace: Relative and unbounded affect annotation. 158–163. https://doi.org/10.
1109/ACII.2017.8273594
[40] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. 2015. Unifying distillation and privileged information. (11 2015).
21
[41] Erika Lutin, Ryuga Hashimoto, Walter De Raedt, and Chris Van Hoof. 2021. Feature Extraction for Stress Detection in Electrodermal Activity.
In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 2: BIOSIGNALS,. INSTICC,
SciTePress, 177–185. https://doi.org/10.5220/0010244601770185
[42] Hugo Lövheim. 2012. A new three-dimensional model for emotions and monoamine neurotransmitters. Medical Hypotheses 78, 2 (2012), 341–348.
https://doi.org/10.1016/j.mehy.2011.11.016
[43] Ilias Maglogiannis, Eirini Kalatha, and Efrosyni-Alkisti Paraskevopoulou-Kollia. 2014. An overview of Affective Computing from the Physiology and
Biomedical Perspective. 367–395. https://doi.org/10.1201/b17080-19
[44] Konstantinos Makantasis, David Melhart, Antonios Liapis, and Georgios N. Yannakakis. 2021. Privileged Information for Modeling Affect In The
Wild. CoRR abs/2107.10552 (2021). arXiv:2107.10552 https://arxiv.org/abs/2107.10552
[45] Suzan Mansourian, Jacob Corcoran, Anders Enjin, Christer Löfstedt, Marie Dacke, and Marcus C Stensmyr. 2016. Fecal-Derived Phenol Induces
Egg-Laying Aversion in Drosophila. Curr Biol 26, 20 (Sept. 2016), 2762–2769.
[46] David Melhart, Antonios Liapis, and Georgios N. Yannakakis. 2022. The Arousal Video Game AnnotatIoN (AGAIN) Dataset. IEEE Transactions on
Affective Computing 13, 4 (oct 2022), 2171–2184. https://doi.org/10.1109/taffc.2022.3188851
[47] Dávid Melhárt, Antonios Liapis, and Georgios Yannakakis. 2019. PAGAN: Video Affect Annotation Made Easy. 130–136. https://doi.org/10.1109/
ACII.2019.8925434
[48] Victor Motogna, Georgina Lupu-Florian, and Eugen Lupu. 2021. Strategy For Affective Computing Based on HRV and EDA. In 2021 International
Conference on e-Health and Bioengineering (EHB). 1–4. https://doi.org/10.1109/EHB52898.2021.9657654
[49] Sara Mourad, Ahmed Tewfik, and Haris Vikalo. 2017. Data subset selection for efficient SVM training. In 2017 25th European Signal Processing
Conference (EUSIPCO). 833–837. https://doi.org/10.23919/EUSIPCO.2017.8081324
[50] R. Nakatsu, A. Solomides, and N. Tosa. 1999. Emotion recognition and its application to computer agents with spontaneous interactive capabilities.
In Proceedings IEEE International Conference on Multimedia Computing and Systems, Vol. 2. 804–808 vol.2. https://doi.org/10.1109/MMCS.1999.778589
[51] Jakub Nalepa and Michal Kawulok. 2019. Selecting training sets for support vector machines: a review. Artificial Intelligence Review 52, 2 (01 Aug
2019), 857–900. https://doi.org/10.1007/s10462-017-9611-1
[52] Yiing Y’ng Ng, Chee Weng Khong, and Robert Jeyakumar Nathan. 2018. Evaluating Affective User-Centered Design of Video Games Using Qualitative
Methods. International Journal of Computer Games Technology 2018 (04 Jun 2018), 3757083. https://doi.org/10.1155/2018/3757083
[53] Binh T. Nguyen, Minh H. Trinh, Tan V. Phan, and Hien D. Nguyen. 2017. An efficient real-time emotion detection using camera and facial landmarks.
In 2017 Seventh International Conference on Information Science and Technology (ICIST). 251–255. https://doi.org/10.1109/ICIST.2017.7926765
[54] Pedro A. Nogueira, Vasco Torres, Rui Rodrigues, Eugénio Oliveira, and Lennart E. Nacke. 2016. Vanishing scares: biofeedback modulation of affective
player experiences in a procedural horror game. Journal on Multimodal User Interfaces 10, 1 (01 Mar 2016), 31–62. https://doi.org/10.1007/s12193-
015-0208-1
[55] W. Gerrod Parrott. 2004. The Nature of Emotion. Blackwell Publishing, Malden, 5–20.
[56] R.W. Picard, E. Vyzas, and J. Healey. 2001. Toward machine emotional intelligence: analysis of affective physiological state. IEEE Transactions on
Pattern Analysis and Machine Intelligence 23, 10 (2001), 1175–1191. https://doi.org/10.1109/34.954607
[57] R. W. Picard. 1995. Affective Computing.
[58] Rosalind W. Picard. 1997. Affective Computing. MIT Press, Cambridge, MA, USA.
[59] ROBERT PLUTCHIK. 1980. Chapter 1 - A GENERAL PSYCHOEVOLUTIONARY THEORY OF EMOTION. In Theories of Emotion, Robert Plutchik
and Henry Kellerman (Eds.). Academic Press, 3–33. https://doi.org/10.1016/B978-0-12-558701-3.50007-7
[60] Sara Pourmohammadi and Ali Maleki. 2020. Stress detection using ECG and EMG signals: A comprehensive study. Computer Methods and Programs
in Biomedicine 193 (2020), 105482. https://doi.org/10.1016/j.cmpb.2020.105482
[61] James A. Russell. 1980. A circumplex model of affect. Journal of Personality and Social Psychology 39, 6 (1980), 1161–1178. https://doi.org/10.1037/
h0077714
[62] Andreja Samčović. 2018. Serious games in military applications. Vojnotehnicki glasnik 66 (07 2018), 597–613. https://doi.org/10.5937/vojtehg66-16367
[63] Petra Sundström. 2005. Exploring the Affective Loop.
[64] Konstantinos Tzevelekakis, Zinovia Stefanidi, and George Margetis. 2021. Real-Time Stress Level Feedback from Raw Ecg Signals for Personalised,
Context-Aware Applications Using Lightweight Convolutional Neural Network Architectures. Sensors (Basel) 21, 23 (Nov. 2021).
[65] Vladimir Vapnik and Rauf Izmailov. 2015. Learning Using Privileged Information: Similarity Control and Knowledge Transfer. J. Mach. Learn. Res.
16, 1 (jan 2015), 2023–2049.
[66] Vladimir Vapnik and Akshay Vashist. 2009. A new learning paradigm: Learning using privileged information. Neural Networks 22, 5 (2009), 544–557.
https://doi.org/10.1016/j.neunet.2009.06.042 Advances in Neural Networks Research: IJCNN2009.
[67] Mincheol Whang and Joasang Lim. 2008. A Physiological Approach to Affective Computing. , 5 pages. https://doi.org/10.5772/6174
[68] Georgios Yannakakis and Julian Togelius. 2011. Experience-Driven Procedural Content Generation. Affective Computing, IEEE Transactions on 2 (07
2011), 147–161. https://doi.org/10.1109/T-AFFC.2011.6
[69] Georgios N. Yannakakis, Roddy Cowie, and Carlos Busso. 2017. The ordinal nature of emotions. In 2017 Seventh International Conference on Affective
Computing and Intelligent Interaction (ACII). 248–255. https://doi.org/10.1109/ACII.2017.8273608
[70] Georgios N. Yannakakis and Héctor P. Martínez. 2015. Grounding truth via ordinal annotation. In 2015 International Conference on Affective
Computing and Intelligent Interaction (ACII). 574–580. https://doi.org/10.1109/ACII.2015.7344627
22
[71] Beste F. Yukse. 2018. Introduction to Affective Computing. https://www.cs.usfca.edu/~byuksel/affectivecomputing/presentations/

IntroToAffectiveComputing.pdf. Accessed: 2022-03-03.
[72] Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. 2009. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous
Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 1 (2009), 39–58. https://doi.org/10.1109/TPAMI.2008.52
23

CHI PLAY Michal S Submission

Uploaded by

Copyright:

Available Formats

You might also like

CHI PLAY Michal S Submission

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CHI PLAY Michal S Submission

Uploaded by

Copyright:

Available Formats

Duke Spook’em: A responsive fear modulation system in a horror game

ACM Reference Format:

2.1 Definitions and taxonomy

2.2 Affective computing and its rise in video games

2.3 Related work

3 DUKE SPOOK’EM: IMPLEMENTATION

3.1 Training data collection

3.2 Predicting Player Arousal

Annotated ann_delta ann_grad ann_max_grad ann_min_grad integral

3.3 Training and adding ensembles

(a) Slow buildup of tension and rising tonic EDA signal

Accuracy ALL Accuracy PRIVILEGED Accuracy TELEMETRY

All features Privileged Telemetry SVM+ (student)

3.4 Implementing the feedback loop

Action : Play mysterious monster sounds

Action : Play Signe screams

Action : P l a y a message from Amy

4.1 Experimental protocol

Algorithm 1: Duke Spook’em core loop

4.2 Fear inference model accuracy

"Duke Spook’em" Simple rule-based

4.3 Duke Spook’em as an end-to-end system

5 DISCUSSION AND LIMITATIONS

5.2 Expanding in-game telemetry

5.3 More physiological data to explore

5.4 More efficient implementation of the classifier

5.5 More labels and the potential of regression models

5.6 Training Duke Spook’em

5.7 Emotion detection beyond fear

5.8 Affect prediction beyond video games

[71] Beste F. Yukse. 2018. Introduction to Affective Computing. https://www.cs.usfca.edu/~byuksel/affectivecomputing/presentations/

You might also like