Arousal and Validation in 2023

Motivation and Emotion, Vol. 21, No.
4, 1997
Arousal and Valence in the Direct Scaling of

Emotional Response to Film Clips1
Nancy Alvarado2
University of California, San Francisco
Contributions of differential attention to valence versus arousal (Feldman,

1995) in self-reported emotional response may be difficult to observe due to
(1) confounding of valence and arousal in the labeling of rating scales, and
(2) the assumption of an interval scale type. Ratings of emotional response to
film clips (Ekman, Friesen, & Ancoli, 1980) were reanalyzed as categorical
(nominal) in scale type using consensus analysis. Consensus emerged for
valence-related scales but not for arousal scales. Scales labeled Interest and
Arousal produced a distribution of idiosyncratic responses across the scale,
whereas scales labeled Happiness, Anger, Sadness, Fear, Disgust, Surprise, and
Pain, produced consensual response. Magnitude of valenced response varied
with both stimulus properties and self-reported arousal.
Feldman (1995) presented evidence that individuals differ in their attention

to two orthogonal dimensions of emotion: valence (evaluation) and arousal.
These differences were noted when subjects were asked to make periodic
mood ratings using scales that confound these two aspects of affective ex-
perience. Feldman analyzed these ratings in the context of Russell's (1980)
circumplex model and Watson and Tellegen's (1985) dimensions of positive
affect (PA) and negative affect (NA) and suggested that the structure of
1Preparation of this article was supported in part by National Institute of Mental Health
(NIMH) grant MH18931 to Paul Ekman and Robert Levenson for the NIMH Postdoctoral
Training Program in Emotion Research. I thank Paul Ekman for permitting access to the
data analyzed here. I also thank Jerome Kagan and several anonymous reviewers for their
helpful comments on this manuscript.
2Address all correspondence concerning this article to Nancy Alvarado, who is now at the
Department of Psychology (0109), University of California at San Diego, 9500 Gilman Drive,
La Jolla, California 92093-0109.
323
0146-7239/97/1200-0323$12..50/0 <8 1997 Plenum Publishing Corporation
324 Alvarado
affect changes with the focus of attention. She speculated that valence focus
"may be associated with the tendency to attend to environmental, particu-
larly social cues " (p. 163) whereas arousal focus may be related to internal
(somesthetic) cues, citing Blascovich (1990; Blascovich et al., 1992). This
paper presents support for Feldman's views, in a direct-scaling self-report
context where valence and arousal are reported independently and the en-
vironmental cues are held constant, using data originally collected by Ek-
man, Friesen, and Ancoli (1980).
Direct Scaling Assumptions
Direct scaling of emotional response occurs when a subject is exposed

to an affect-inducing stimulus, then asked to introspect and rate the amount
of some affect using a rating scale, often labeled with the name of an emo-
tion to be reported, and typically numbered in intervals, such as from 1 to
7. Researchers frequently anchor the endpoints of such scales with descrip-
tive phrases such as not at all angry, extremely angry, or most anger ever felt
in my life. These ratings are treated as judgments on an interval, continuous
scale. They are then averaged to produce means which are compared using
analysis of variance (ANOVA) or t-test.
There is some evidence that self-report judgments of emotional re-
sponse are consistent across time for the same individual (Larsen & Diener,
1985, 1987), that self report varies systematically with certain physiological
changes associated with emotion and thus may be a valid indicator of emo-
tional response (Levenson, 1992), and that higher ratings on a scale do
correspond to greater emotional experience for the same individual (mono-
tonicity). These findings justify assumption of an ordinal scale type during
data analysis. On the other hand, there is no evidence that the subjective
distances between adjacent numbers on every portion of the scale are equal,
as would be necessary in order to assume that the data are interval in na-
ture. Further, aggregation of data and interrater comparisons are problem-
atic because it is unclear how individual differences in emotional response
are related to individual differences in the use of rating scales. Nor have
the distances between numbers been shown to correspond to the same sub-
jective differences in response for each individual in a study.
Consider temperature as an analogy. We can use an objective scale,
such as the Fahrenheit scale, to evaluate the accuracy of subjective judg-
ments. However, if we had no such scale, but instead asked subjects to
rate temperature based upon the hottest or coldest temperatures they had
ever experienced, their subjective experience would be confounded with
variations in their devised scales. Unless we know the anchor points and
Scaling Emotional Response to Film Clips 325
scale intervals, we cannot know whether two subjects reporting different

temperature ratings for the same stimulus are using the same scale but
experiencing the temperature differently, or experiencing the temperature
as the same but using different scales. If we ignore these difficulties and
average their ratings, we obtain a measure that is useful in certain experi-
mental contexts but insensitive to individual variations in subjective expe-
rience. Rather, we have a scale that assumes that individual differences are
unimportant or nonexistent.
No objective physical unit of measurement exists to compare against
self-reported emotional experience. Even when we supply a 7-point scale
anchored by descriptive phrases, we have no way of knowing how the in-
dividual interprets such phrases, e.g., how much anger one person has ever
felt in his or her lifetime, compared to the maximum experienced by an-
other. Further, anchoring using descriptive phrases such as most emotion
ever felt in your life invites subjects to apply a scale with unequal distances
between intervals, such that the most emotion ever felt on a 10-point scale
is not 10 times the amount felt when 1 is reported, but probably far greater.
Use of a scale with 100 rather than 10 divisions does not remedy this prob-
lem.
Use of rating scales to describe emotion is further complicated if mag-
nitude is part of the meaning of the label used to identify the scale itself.
For example, it is unclear how the difference in meaning between scale
labels such as anxiety and fear, or annoyance and fury, would affect the
judgments of magnitude made using that scale. Would an experience rated
in the middle of an annoyance scale be rated lower if the scale were labeled
frustration, anger, or rage?
Given these difficulties, the direct scaling of emotional response ap-
pears to be, at best, ordinal. As Townsend and Ashby (1984) noted, ". . .
if the strength of one's data is only ordinal, as much of that in the social
sciences seems to be, then even a comparison of group mean differences
via the standard Z or t test or by analysis of variance is illegitimate. Only
those statements and computations that are invariant under monotone (or-
der is preserved) transformations are permissible" (p. 395). When the pur-
pose of a study is merely to demonstrate a difference using self report as
a dependent variable, then the measurement concerns described above are
unlikely to affect the validity of the findings. However, when these means
tests are used to assert the equality of stimuli presented to evoke emotional
response, or the efficacy of such stimuli as an elicitor of a specific emotion,
then the concerns raised above become crucial to the findings. Everything
that follows in such a study rests upon an initial assumption that mean
self-report values are an accurate index of emotional response.
326 Alvarado
This problem is relevant to several recent studies investigating the con-

gruence between facial activity and self-reported emotional response, as
noted by Ruch (1995). In an ongoing controversy over whether smiling is
an indicator of expressed feeling, Fridlund (1991) reported that happiness
ratings did not parallel electromyelograph (EMG) monitoring of smiling
among subjects viewing film clips, but seemed related instead to the so-
ciality of the viewing condition. Hess, Banse, and Kappas (1995) improved
the measurement of facial activity by monitoring Duchenne versus non-
Duchenne smiling and varied the amusement level of the film stimuli pre-
sented as well as the viewing context. They found a more complex
relationship between social context and smiling. In both studies, the crucial
comparison between facial activity and emotional response rested on the
accuracy of the self-report ratings, analyzed using an ANOVA across view-
ing conditions, and assumed to be a valid measure of emotional response.
Use of Direct Rating to Norm Film Clips
This study reanalyzes self-report ratings of emotional response to film

clips, originally collected by Ekman et al. (1980). These data have been
frequently cited by Fridlund (1994) because they contain anomalies that
he considers support for his view that smiling is related to social context
rather than emotional response. Fridlund's larger issue of the sociality of
smiling was addressed by Hess et al. (1995) and will not be discussed further
here. This discussion instead will focus upon the complexity involved in
demonstrating congruence between self-report ratings and facial activity (or
other behavior), and the need to improve methods of collecting and ana-
lyzing self-report data. The stimulus set used by Ekman et al. (1980) pro-
vides a useful illustration of the methodological and theoretical issues
discussed earlier because, unlike many similar studies, it includes both base-
line self-report ratings and concurrent ratings using multiple, separately la-
beled rating scales.
Ekman et al. (1980) compared self-report judgments for 35 subjects
with their measured facial expressions when viewing pleasant and unpleas-
ant film clips selected for their ability to evoke emotion. Fridlund (1994)
noted that facial expression and direct ratings agreed only for the film stim-
uli with social content, but not for a third film for which the mean rated
happiness was the same. At issue were three pleasant film clips: (1) a gorilla
playing in a zoo, (2) ocean waves, and (3) a puppy playing with a flower.
All three films evoked the same mean ratings when subjects were asked
to rate their response on a scale labeled Happiness. However, as Fridlund
noted, the film clips evoked differential amounts of facial activity, with the
gorilla film evoking the greatest duration and intensity of facial activity,
the puppy film showing the greatest frequency of facial activity, and only
seven subjects showing any facial response to the ocean film. From this,
Fridlund argued that the gorilla and puppy films were somehow more social
in nature, evoking more facial expression because such expressions only
arise from social antecedents. However, this is only true if the films did in
fact evoke the same emotional responses. As will be argued later, I believe
they did not.
Consensus Modeling
The assumptions of the random-effects ANOVA model are that re-

sponses are drawn from a normal distribution and that they are made using
an interval scale. The model further assumes that all individuals use the
same scale in the same manner (implicit to the assumption of equal vari-
ance).3 The point here is not whether analysis of variance has been correctly
applied in psychological research, but rather whether a model that assumes
minimal individual differences is suitable for exploring whether such indi-
vidual differences in fact exist. The analysis below applies consensus mod-
eling to explore (1) whether the averaging of ratings produced misleading
norms for the various film clips, (2) whether subject ratings were idiosyn-
cratic or consensual (as is implicitly assumed by the averaging of data),
and (3) whether subjects used all scales in an equivalent manner across
the rating contexts. Consensus analysis is a formal computational model
which uses the pattern of responses within a data set to predict the like-
lihood of correct response for each subject (called the competence rating),
provide an estimate of the homogeneity of response among subjects (the
mean competence), and provide confidence intervals for the correctness of
each potential response to a set of questions. While this model also makes
certain assumptions, discussed in greater detail below, it incorporates good-
ness-of-fit measures that permit an analysis of the extent to which those
assumptions have been met. Thus the model can be used to investigate the
nature of response using rating scales, and thereby to address the issues
raised above. A formal description of the model has been provided by
Batchelder and Romney (1988, 1989). Equations are provided in the Ap-
pendix.
'According to Hays (1988), these assumptions can be violated without greatly affecting results
when a fixed-effects model is used to test inferences about specific means. Violating
assumptions of normality and equal variance has serious consequences for a random-effects
model used to test inferences about the variance of the population effects.
328 Alvarado
Consensus modeling assumes that subjects draw upon shared latent

knowledge when making their responses. The source of this shared knowl-
edge may be cultural or may be derived from shared physiology or common
humanity. The model cannot distinguish between these sources of homo-
geneous responding. It assumes that intercorrelation of subject responses
across a data set occurs because subjects are drawing upon the same latent
answer key when making their responses. Therefore, the latent answer key
can be recreated using the pattern of intercorrelation. The model assumes
that subjects vary in their performance and in their access to shared knowl-
edge, but that subjects with higher correlation to the group are more expert
because they have greater access to shared knowledge. The answer key con-
fidence intervals are estimated using Bayes' theorem. Each subject's com-
petence score is used as a probability of correctness. Subjects who are more
expert because they agree more with the group are given greater weight
in producing the estimated answer key. Thus, consensus emerges not from
majority response to a particular question, but from patterns of agreement
across the entire data set.
For purposes of this study, the question was: "What number on this
rating scale best describes the emotional response to this film clip?" This
analysis assumes that there is a single correct number on each scale, for
each rating context, that characterizes the group. This is the same assump-
tion made when a group mean is used as a normative rating. Use of such
a mean implies that one number (e.g., 4.5) best predicts the potential re-
sponse of any individual selected at random from the population.
Using consensus analysis, we can test whether subjects assign the same
stimulus the same number on their internal subjective scales, or whether
their scales are calibrated such that the same stimulus may produce widely
varying response. This is important because it tells us something about the
consistency of emotional response across individuals. Previous studies have
also assumed that individual scales are calibrated in a similar enough man-
ner to justify the aggregation of data across subjects and the use of ANOVA
models. This approach tests whether that assumption is justified. In the
study that follows, consensus analysis results are supplemented by analysis
of the normality of the distribution of responses, and of the patterns of
correlation among the scales.
METHOD
This analysis was performed upon the original self-report data col-
lected by Ekman et al. (1980), rather than the summaries provided by the
resulting article. Additional details about the data collection procedures
were provided in that article and are omitted here, except where relevant
to the arguments presented.
Subjects
Subjects were 35 female volunteers, ages 18 to 35 years, recruited

through advertisements to participate in a study of psychophysiology.
Stimuli
Stimuli consisted of five films of 1-min duration, three intended to be

pleasant and two intended to be unpleasant. The three pleasant films (de-
scribed above), were created by Ekman and Friesen and were always shown
in the same order: gorilla, ocean, puppy. The two unpleasant films were
edited versions of a workshop accident film designed to evoke fear and
disgust. The first film depicts a man sawing off the tip of his finger. The
second shows a man dying when a plank of wood is thrust through his
chest by a circular saw. These films were always shown in this same order.
Procedure
Subjects rated their emotional responses for two baseline periods and
five film-viewing periods using a series of nine unipolar 9-point scales, la-
beled with the following terms: Interest, Anger, Disgust, Fear, Happiness,
Pain, Sadness, Surprise, and Arousal. Pain was defined for subjects as "the
experience of empathetic pain" and Arousal was explained as applying to
the total emotional state rather than to any one of the other scales pre-
sented. The other terms were not explained to subjects. Scales ranged from
0 (no emotion) to 8 (strongest feeling). Instructions explained how the ratings
were to be made (Ekman et al., 1980): "... strength of a feeling should
be viewed as a combination of (a) the number of times you felt the emo-
tion—its frequency; (b) the length of time you felt the emotion—its dura-
tion; and (c) how intense or extreme the emotions [sic] was—its intensity"
(p. 1127).
The first baseline occurred during a 20-min period in which the subject
was instructed to relax. The presentation of pleasant or unpleasant films
first was counterbalanced. Ratings for all three pleasant films were made
after viewing all three films. Similarly, ratings for the two unpleasant films
were made after viewing both films, A second baseline rating was made
330 Alvarado
after rating of the first set of films, during a 5-min interval before starting
the second series of films.
RESULTS
Consensus Analysis
The following discussion is adapted from the description of consensus

modeling provided by Weller and Romney (1988). Consensus analysis pro-
vides a measure of reliability in situations where correct responses to items
are not already known. Mathematically, it closely parallels item response
theory or reliability theory, except that data are coded as given by subjects
rather than as "correct" or "incorrect," and the reliability of the subjects
is measured instead of the reliability of the items. The formal model is
described in Batchelder and Romney (1988, 1989). Additional description
of the model is provided in the Appendix. The main idea of the model is
that when correct answers exist, the answers given by subjects are likely to
be positively correlated with that correct answer key. Thus, in situations
where correct answers are unknown but assumed to exist, the pattern of
intercorrelations or agreement among subjects (called consensus) can be
used to reconstruct the latent answer key. This is similar to the idea in
reliability theory that correlations among items reflect their independent
correlation with an underlying trait or ability. Similarly, high agreement
among subjects about the answers to a set of items measuring a coherent
domain suggests the likelihood that shared knowledge exists and provides
information about what that knowledge is. In the words of Weller and Rom-
ney (1988), "A consensus analysis is a kind of reliability analysis performed
on people instead of items" (p. 75). This reliability analysis is used to make
inferences about the nature of the domain or to determine the correct an-
swers. When a correct answer key does not exist, as when subjects belong
to subcultures drawing upon different sources of shared knowledge, or
when subjects draw upon idiosyncratic knowledge, that violation of the
model's assumptions is readily apparent in the measures provided by the
model.
Ratings for each of the nine emotion-labeled scales were analyzed
separately; thus the data consisted of seven numerical ratings (one for each
rating period) for each of the 35 subjects, for each labeled scale (nine
scales). The data were treated as multiple-choice responses to the implied
question "Which number corresponds to the correct emotional response
rating for this particular film segment or baseline period?" Given the pre-
ceding discussion about scale types, it would have been preferable to ana-
lyze the data using an ordinal consensus model, but such a model has not
yet been developed. The categorical, multiple-choice model used here as-
sumes an equal probability of guessing the alternatives in its correction for
guessing. The analysis of normality (presented later) suggests that this as-
sumption is appropriate for some but not all of the rating scales. With or-
dinal data, it is more likely that guessing biases differ among the rating
alternatives (e.g., the probability of guessing 5 may be different than the
probability of guessing 0). A model incorporating such biases had not been
developed at the time this analysis was performed, but now exists (see
Klauer and Batchelder, 1996). In general, the application of a categorical
model to what we suspect is ordinal data tends to work against a finding
of consensus because subjects must agree on the exact rating number given
to each stimulus out of nine alternatives (0 to 8).
The measures used to evaluate results are (1) individual competence
scores, (2) mean competence, (3) eigenvalues produced during the principal
component analysis used to estimate the solution to the model's equations,
and (4) answer key confidence estimates. Competence scores range from
-1.00 to 1.00 and are maximum-likelihood parameter estimates. They are
best understood as estimated probabilities rather than correlation coeffi-
cients. A negative competence score indicates extreme and consistent dis-
agreement with the group across rating periods.
Batchelder and Romney (1988, 1989) established three criteria for
judging whether consensus exists in subject responses to questions about a
domain: (1) eigenvalues showing a single dominant factor (a ratio greater
than 3:1 between the first and second factors), (2) a mean competence
greater than .500, and (3) absence of negative competence scores in the
group of subjects. While failure to meet these criteria does not necessarily
rule out consensus, it can indicate a poor fit between the data and the
model.
Consensus analysis results for the nine scales across the seven rating
periods are summarized in Table I. All scales except those labeled Interest
and Arousal met the criteria for consensus. In contrast, the scales for In-
terest and Arousal showed nearly half the group with negative consensus
scores, indicating severe disagreement about the correct responses on those
scales. The scales for Anger, Disgust, and Pain showed the greatest con-
sensus, with the highest mean consensus scores and with eigenvalue ratios
indicating a single dominant factor in the data. While the scales for Sadness
and Surprise each showed a single negative consensus score, the otherwise
high mean consensus scores and ratios between the eigenvalues suggest that
consensus also existed for those scales.
This finding of consensus for seven of the nine scales suggests that
subjects agreed strongly in their emotional responses to the stimuli pre-
332 Alvarado
Table I. Consensus Analysis of Nine Rating Scales Across Seven Rating

Periods
Consensus
Ratio of Negative Confidence
Scale label Mean SD eigenvalues scores N level
Anger .831 .179 13.348 0 35 1.0000

1.702
Disgust .795 .082 13.620 0 35 .9478
1.180
Fear .699 .155 8.926 0 35 .9392
1.332
Happy .580 .131 5.998 0 35 .9943
1.200
Pain .793 .106 14.511 0 35 .9841
1.223
Sadness .674 .290 7.136 1 35 1.0000
1.584
Surprise .657 .230 8.959 1 35 .9838
1.096
Interest .101 .288 1.382 16 35 .9363
1.144
Arousal .150 .230 1.087 17 35 .8486
1.720
sented, particularly with respect to the scales labeled Anger, Disgust, and
Pain. Lesser agreement existed for Surprise and Fear, and for Happiness
and Sadness. Based upon the measures provided by this model, consensual
emotional response did not exist for the two scales labeled Arousal and
Interest. The importance of this finding will be discussed later.
Answer key confidence levels were high (M = .95), even when emo-
tional response was reported, but consensus appeared to be largely gov-
erned by agreement about the absence of negative emotion during the
pleasant film clips, and the absence of positive emotion during the unpleas-
ant film clips.4 The scales showing lower consensus (but nevertheless meet-
ing the criteria for consensus), Sadness, Happiness, and Surprise, showed
minor violations of this pattern. Because the presentation of films was
counterbalanced, half of the subjects saw pleasant films and half saw un-
pleasant films before the second baseline. From the ratings, several subjects
appeared to have carried residual negative emotional response into this
second baseline period, producing mixed ratings. They may also have car-
ried such response into the pleasant film ratings, as Ekman et al. (1980)
4Thisis far from a trivial finding, as several emotion theorists have hypothesized that complex
emotional responses may be blends of basic emotions and thus have insisted that multiple
scales be provided to permit subjects to express such complexity. A lack of response is thus
as meaningful as positive response on each single scale with respect to each rating context.
Table II. Predicted Emotional Responses for Nine Rating Scales

Label Baseline 1 Gorilla Ocean Puppy Baseline 2 Cut finger Death
Anger 0 0 0 0 0 0 0
Disgust 0 0 0 0 0 0 5
Fear 0 0 0 0 0 1 8
Happiness 0 4 0 0 0 0 0
Pain 0 0 0 0 0 8 8
Sadness 0 0 0 0 0 0 0
Surprise 0 0 0 0 0 8 6
Interest 0 1 1 1 0 3 5
Low 0 1 1 1 0 3 5
High 4 6 7 6 2 5 7
Arousal 0 1 2 1 0 1 3
Low 0 1 2 1 0 1 3
Medium 1 4 1 4 1 6 5
High 2 3 6 3 4 8 8
noted in their discussion. Nor were the pleasant films unambiguously pleas-
ant. Five subjects responded to the gorilla film with mild anger, and four
responded to the puppy film with even stronger anger (e.g., 6, 7, or 8).
Similarly, several subjects reported sadness when watching the gorilla film,
and several reported disgust while watching the puppy film. These re-
sponses may be partly explained by the content of the films. The puppy
ultimately chewed up and spit out the flower with which it was playing,
evoking disgust in some subjects. The gorilla may have aroused sadness
because it resided in a zoo. The lower consensus for the Fear and Surprise
ratings result from several subjects who claimed to have felt no surprise
or fear in response to the second workshop accident.
Model-predicted answer key responses for each of the scales during
each of the viewing periods are shown in Table II. Examination of the an-
swer key for the Happiness rating scale shows a clear difference in the
level of enjoyment among subjects for the three film clips. The gorilla film
was rated as 4, the ocean film as 0, and the puppy film as 0. The consensus
model makes these predictions by weighting each subject's response by that
subject's overall agreement with the group (the estimated probability of
correctness). Even without the model's weighting, these responses were the
modal responses among subjects for these films. It is only when all re-
sponses are averaged that higher numbers emerge for the ocean and puppy
films. To see why this occurs, consider a group in which equal numbers of
subjects give ratings of 0 and 8 and no other ratings. When these are av-
eraged to obtain a mean of 4.0, it should be evident that this rating is an
accurate portrayal of emotional response for no single subject in that group.
334 Alvarado
Nor will it be a good predictor of the response of the next subject who
views the film. The actual distribution of scores generally raises an alarm
about using the mean as an indicator of central tendency (see the analysis
of normality below).
During subsequent research, Ekman and Friesen edited the puppy film
to remove the portion where the puppy eats the flower, and thereby ob-
tained higher enjoyment ratings. Examination of the disgust and anger
scales provided important clues to the differing emotions evoked in indi-
vidual subjects by this particular film. The difference in content may ac-
count for the puppy film's higher frequency of smiling but lower duration
and intensity of smiling, compared to the gorilla film (Ekman et al., 1980).
Differing emotions were not reported across the nine scales for the ocean
film. This analysis shows that the ocean film was simply not as enjoyable
as the gorilla film. The finding that few subjects smiled while viewing it is
entirely consistent with the self-report ratings obtained for the ocean film.
Although responses are typically distributed across a range of response
options in any data set, even one showing strong consensus, the process of
consensus modeling permits identification of those subjects with consistently
divergent response patterns across the set of questions. These divergent sub-
jects obtain negative consensus scores during analysis. By partitioning the
data set based upon the sign of the consensus score (negative or positive),
Table III. Consensus Analysis of Interest and Arousal Subgroups

Partitioned by Sign of Score
Consensus
Ratio of Negative
Scale label Mean SD eigenvalues scores N
Interest .101 .288 1.382 16 35

1.144
Positive (low) .334 .215 2.352 1 19
1.267
Negative (high) .243 .256 1.285 3 16
1.045
Arousal .150 .230 1.087 17 35

1.720
Positive (low) .402 .164 2.556 0 19
1.720
Negative (high) .232 .322 1.680 5 16
1.358
Neg-posa (high) .409 .226 2.005 0 11
1.304
Neg-nega (mediuim) .430 .273 2.108 0 5
aThe arousal negative subgroups was repartitioned and reanalyzed based
on the sign of the score. Neg = negative; Pos = positive.
Fig. 1. Examples of used and unused valenced rating scales: Ratings of Surprise for the
cut finger film clip (top) and ratings of Anger for the puppy film clip (bottom). Std. Dev.
= standard deviation.
and reanalyzing the data, it can be determined whether response is idiosyn-

cratic, or whether several divergent subjects form a coherent subgroup (per-
haps because they are members of a subculture). Partitioning and reanalysis
336 Alvarado
of the data for the Arousal and Interest scales yielded no coherent sub-
groups because the resulting partitioned data sets also failed to meet the
criteria for consensus (see Table III). Instead, responses seemed to be dis-
tributed across the range of possible responses. However, subjects with nega-
tive consensus scores on the arousal scale tended to obtain negative
consensus scores on the Interest scale as well (Goodman and Kruskal's
gamma = .54). This suggests several conclusions: (1) the scales for Arousal
and Interest do not lend themselves to this type of categorical analysis; (2)
subjects are idiosyncratic but consistent in their response using these two
scales; and (3) there is no single correct (i.e., consensual) rating response
for arousal or interest with respect to these stimuli. This suggests a quali-
tative difference in behavior among subjects when using the Arousal and
Interest scales compared to the remaining scales.
Analysis of Normality
Frequency histograms were produced for each of the nine rating scales,
by stimulus rating period. In general, a given scale was either used or un-
used (mostly 0 ratings) for a given stimulus, consistent with the consensus
analysis results described above and shown in Table II. When a scale was
used, the distribution was frequently bimodal and generally included a sub-
stantial minority reporting no affect (0 ratings), as shown in Fig. 1. In con-
trast, ratings of arousal and interest were distributed across the entire range
of scores for each rating period, as shown in Fig. 2.
Happiness ratings were spread across the entire scale for all three
pleasant films, as shown in Fig. 3. However, none of the distributions was
normal. A representative comparison of observed versus expected scores,
and detrended deviation from an expected normal distribution, are plotted
in Fig. 4. Consistent with consensus analysis, the modal response for both
the puppy and ocean films was 0. Note that although the means for the
three pleasant film clips were equal, the distributions were clearly different.
These differences, especially with respect to those reporting no affect (0
ratings), are fully consistent with the differences in smiling noted by Ekman
et al. (1980) and do not support Fridlund's interpretation that little smiling
occurs because the ocean film evoked equal happiness but was asocial in
content.
Patterns of Correlation
For each scale, baseline periods were significantly correlated, ratings

for pleasant films were significantly correlated, and ratings for unpleasant
Fig. 2. Ratings of Arousal for the gorilla and puppy film clips. Std. Dev. = standard deviation.
films were significantly correlated. As would be expected, arousal showed

a low correlation with scales that were unused (those with mostly 0 ratings).
For scales that were used, significant correlations were found between
arousal and valence.
338 Alvarado
Fig, 3. Ratings of Happiness for the three pleasant film clips. Std. Dev. = standard deviation.
Correlations between happiness and arousal for the seven rating pe-
riods are shown in Table IV A significant Spearman rank order correlation
(p < .01) was found between Arousal and Happiness for each of the three
Fig. 4. Detrended normal Q-Q plot of Happiness ratings for the gorilla film clip.
pleasant films: gorilla (r = .68), ocean (r = .68), puppy (r = .44). However,

the ocean film's Arousal and Happiness ratings were most strongly corre-
lated with the second baseline period (r = .56) rather than the other pleas-
ant films. In contrast, the Arousal ratings for the gorilla and puppy films
were not significantly correlated (p < .05) with the second baseline. This
supports the interpretation that the ocean film clip had less emotional con-
340 Alvarado
tent and served as a less affective interval between the other two pleasant
stimuli.
That most subjects reported arousal even when they reported no va-
lenced emotion (e.g., during baseline periods) supports the consensus
analysis evidence that valence is experienced differently than arousal, that
it varies with the stimulus, and that it is only related to arousal when the
magnitude of the rating is considered. In other words, arousal appears to
be related to the selection of a particular value on the Happiness rating
scale, but unrelated to whether that scale was used. The strong correlation
between arousal and valence scales when a valenced emotion was reported
suggests that subjects were using the arousal and valence scales in a con-
sistent manner, on an individual basis. They were clearly using the arousal
and valence scales inconsistently as a group because consensus emerged
for valence but not for arousal, and because no consensus for arousal ex-
isted despite consensus for valence.
DISCUSSION
This reanalysis suggests that (1) the mean ratings used as norms were
a misleading assessment of the happiness evoked by the film clips; (2) sub-
ject ratings were consensual, varying with stimulus properties for the rating
scales labeled using valenced emotion terms, but were idiosyncratic, varying
from a personal baseline, for the scales labeled using arousal terms; (3)
subjects appeared to use the valence-related scales differently than the
arousal-related scales across the rating contexts; and (4) the magnitude of
ratings of valence appeared related to the magnitude of arousal reported
when valenced emotion was reported (but not vice versa).
When emotional response ratings were treated as discrete, categorical
data rather than as interval-scaled continuous data, results showed strong
agreement among subjects with respect to scales labeled using emotion
terms, including those labeled with the terms Anger, Disgust, Sadness, Hap-
piness, Fear, and Surprise. Strong agreement was also found with respect
to the scale labeled Pain. Strong disagreement among subjects was shown
with respect to the scales labeled Interest and Arousal, across the spectrum
of rating contexts. Further, stimuli considered equal in their ability to evoke
Happiness ratings when responses were analyzed as interval-scaled data
were found to be quite different in their enjoyment potential when analyzed
discretely. This may account for the previously reported failure to find
equal facial expressivity in response to equally rated film clips.
The analysis of normality suggests that averaged means do not ac-
curately characterize group response for this data set. Further, substantial
342 Alvarado
minorities report no affective response to stimuli, even where consensus

suggests that such response is normative. When a group includes such
subjects, attempts to correlate scale values with objectively measurable
continuous variables such as facial movement are likely to underreport
any relationship between the two (Hays, 1988). This difficulty is com-
pounded when data are aggregated across individuals. That any statistical
relationship between facial activity and self-reported emotional response
exists in the literature suggests that a strong link between the two exists
in reality, given the difficulties of measurement that must be overcome.
Due to what appears to be a stimulus-related, all-or-nothing quality to
emotional response, investigators may be justified in eliminating subjects
who report no affect until the sources of such response are better un-
derstood.
Ruch (1995) provided evidence that correlations between self-reported
affect and facial activity can be increased when methodology is improved.
Some researchers have attempted to compensate for the difficulties inher-
ent in using self-report data by standardizing ratings; by adjusting self-re-
port ratings based upon some other variable, such as measured autonomic
arousal; or by performing their correlations on a within-subject or individ-
ual-by-individual basis. However, these techniques still assume that self-re-
port ratings are interval in nature when they are likely to be ordinal, at
best. For example, standardizing ratings again assumes both a normal dis-
tribution and an interval scale and does not eliminate difficulties of inter-
rater agreement. The relevance of this difficulty to the questions at hand
should be carefully evaluated.
Even arousal ratings using the arousal and interest scales are not nor-
mally distributed in this data set (see Fig. 2). Adjusting self-report scores
for an arousal baseline would correct for the effects of arousal upon the
use of the remaining valenced scales. However, the relationship between
arousal focus and valence focus must be better understood before such
corrections can be confidently made. That arousal and valence ratings are
correlated does not mean that they necessarily report aspects of the same
experience.
Larsen and Diener (1985, 1987) have noted individual differences in
the use of self-report scales similar to those reported here. However, they
attributed such differences to variation in the subjective experience of emo-
tion. Because we have no objective measure of emotion, we do not know
whether individual differences in self-reported emotional response arise
from differences in internal experience or from consistent and stable dif-
ferences in the use of self-report rating scales. Further, we do not know
whether arousal is correlated with emotion because it is an essential part
of emotional experience, or because level of arousal has a global effect on

rating behavior, independent of what is being rated.
Feldman's hypothesis that subjects differentially attend to arousal and
valence, two dimensions of emotional experience, can be tested by asking
subjects explicitly to differentially focus on these dimensions and noting
any resulting changes in their behavior. In a sense, that is what has been
done by Ekman et al. (1980), when subjects were asked to report arousal
separately from the other labeled scales. Because this manipulation was
not the objective of the study, no control condition was provided in which
arousal and valence were confounded (i.e., when no separate Arousal scale
was provided). It may be that the results analyzed above combine the be-
havior patterns of three groups: (1) those with an exclusive arousal focus,
(2) those with a mixed focus that they did not dissociate, and (3) those
with an exclusive valence focus. Group 1 above may have produced the
unexpectedly large number of 0 ratings on the valence scales together with
non-0 ratings of arousal for all stimuli. Group 2 may have produced the
strongly correlated arousal and valence scores, albeit from differing indi-
vidual baselines. Group 3 may have produced the consensual response on
the valenced scales, based largely upon stimulus properties, with 0 or low
ratings of arousal. This speculative explanation can be confirmed by studies
that more deliberately manipulate instructions to subjects, or that apply
cognitive approaches to studying attention.
Feldman's concept of an attentional focus provides a more complete
explanation, capable of resolving difficulties encountered by theories that
consider emotion to be synonymous with arousal. For example, Mandler
(1984) viewed the intensity of an experienced emotion as a function of
autonomic nervous system arousal, and Thayer (1989, p. 134) considered
energetic arousal to be synonymous with positive affect while tense arousal
is synonymous with negative affect. Thayer (1986) demonstrated that the
two dimensions of self-reported arousal, energetic arousal and tense
arousal, both correlate with psychophysiological measures of autonomic
arousal. Neither of these definitions is wholly consistent with the results
produced here because they neglect instances in which self-reports of va-
lence and arousal diverge.
Thayer (1989) mapped the items of his self-report adjective checklist
for arousal onto the dimensions of positive and negative affect suggested
by Watson and Tellegen (1985), suggesting that they are interchangeable
labels for the same phenomenon. Watson and Tellegen's self-report space
was further analyzed by Larsen and Diener (1992), who suggested a revised
labeling and interpretation of the relevant dimensions as unpleasant-
ness/pleasantness and activation. The findings reported here support ob-
344 Alvarado
servations by Larsen and Diener (1992) that the practice of labeling scales
using adjectives from different octants of the emotion circumplex will pro-
duce different rating behavior, and that the dimensions of pleasantness or
unpleasantness versus activation seem to vary independently of each other.
To support this, Larsen and Diener (1992) described findings that the
Velten mood induction techniques tend to change hedonic tone (evalu-
ation) without affecting activation. My reanalysis confirms this.
While self-reported arousal does vary with external circumstances, and
appears to have the characteristics of a state rather than a trait measurement
(Matthews, Davies, & Lees, 1990), Matthews et al. noted the following:
. . .Revelle (personal communication [to Matthews et al.], July 11, 1988) pointed
out that individuals' self-ratings of arousal may be affected by individual differences
in characteristic baseline levels of arousal, so that arousal ratings are not directly
comparable across subjects . . . . Thus, only a part of the interindividual variance
in arousal scores will reflect absolute arousal values; a second part will reflect
interindividual variation in baseline, (pp. 151-152)
The .54 gamma correlation between Arousal and Interest scores for the
same subject may exist because both scales vary from the same baseline,
not because they both measure the same construct.
The scales analyzed in this study drew their terms from different
quadrants of Larsen and Diener's (1992) two-dimensional self-report
space. The Interest, Arousal, and Surprise scales were labeled with terms
from the activation dimension. The Happiness, Sadness, Anger, and Dis-
gust scales were labeled with terms from the hedonic (pleasant/unpleas-
ant) dimension. Pain did not appear in the circumplex because it is not
usually considered an affect term, but it seems closest to terms like mis-
erable or distressed in the hedonic dimension. Fear appears midway be-
tween the activation and hedonic dimensions, in a quadrant for activated
unpleasant affect.
The analysis reported here supports Larsen and Diener's (1992)
contention that the dimensions of activation and pleasantness/unpleas-
antness are orthogonal, at least with respect to introspective monitoring
and self-report. In these results, the activation reported on the Arousal
and Interest scales appears to vary differently than the remaining rating
scales for the stimuli presented. Even the scales combining hedonic affect
and activation, i.e., the Surprise and Fear scales, show considerable con-
sensual response with strong ratings by subjects in response to the second
unpleasant film clip (where an accidental death is shown). Although con-
sidered to be located in the high activation quadrant of Larsen and Di-
ener's self-report affect circumplex, these scales nevertheless show
consensual response. Surprise and fear typically involve strong autonomic
arousal, correlated with self-report ratings of arousal. Here, there is no

greater correlation between surprise and arousal than exists between hap-
piness and arousal. However, it may be that subjects experienced less
arousal when viewing a film than they might when fear involves personal
threat. Nevertheless, this is problematic for Larsen and Diener's theory.
The qualitative differences in the use of rating scales noted in this
study invite speculation about the effects of confounding valence and
arousal in previous studies. For example, the distinction between valence
and arousal parallels the distinction traditionally made between states
that are emotional in nature and those that are not (Clore, 1992; Clore,
Ortony, & Foss, 1987). It appears that a continuous distribution of idi-
osyncratic response, directly related to autonomic or reticular arousal and
monitored with respect to a personal baseline, is more typical of the less
emotional subjective states, including those recruited by Russell (1980)
to fill out the activation quadrants of the emotion circumplex, labeled
using terms such as dull, drowsy, relaxed, content, lively, peppy, and so on.
My reanalysis suggests that these states are subjective in nature and ac-
cessible to introspection and thus to self-report, but that they vary due
to factors specific and internal to the individual, and secondarily due to
the characteristics of the stimulus or the appraisal of that stimulus, as
hypothesized by Feldman (1995) and Blascovich (1990; Blascovich et al.,
1992). Theorizing a dichotomy between shared, consensual emotional re-
sponse to a particular stimulus and a generalized, personal, idiosyncratic
level of engagement with the environment makes intuitive sense and
seems to be important to the folk definition of emotion (Clore, 1992;
Clore et al., 1987).
Whatever the reader might feel about the consensus modeling tech-
nique applied in this study, it should be clear from the distributions and
from the differing modal responses that subject use of the rating scales
was quite different for the Interest and Arousal scales than for the re-
maining rating scales. Although emotion may be expressed and experi-
enced in combination with personal activation, at least some subjects
appeared to be able to rate them independently. Further, maintaining
the distinction between arousal and valenced emotion appears to be use-
ful. When dependent variables behave differently and show a different
relationship to an eliciting stimulus, it makes little methodological sense
to include them in a single encompassing construct. Systematic investi-
gation of the contribution of attentional focus may clarify the relationship
between self-report, valenced emotion, and arousal. In the meantime,
theorizing based upon a seeming incongruity between self-report and
emotional behavior seems premature.
346 Alvarado
APPENDIX
The following description of the consensus model is adapted from Rom-

ney, Weller, and Batchelder (1986). The model uses the following notation:
di the probability that a subject i knows the right answer to a
given question
1-d i the probability that the subject doesn't know the answer
L the number of response options to a given question
1/L the probability that the subject will guess the correct answer
1-1/L the probability of guessing the incorrect answer
mij the probability that two subjects i and j give the same answer
to a given question
The parameter di is the subject's competence score. It is readily cal-
culated if the answer key is known because it is the percentage of correct
questions answered (Ti) minus a correction for guessing:
If the answer key is not known, the parameters are estimated using the
following equations:
where m*ij is an empirical point estimate of the proportion of matches be-

tween two subjects, corrected for guessing (on the assumption of no bias).
Equation (3) is solved for d via minimum residual factor analysis to yield
a least squares estimate of the d parameter (competence score) for each
subject. Bayes' theorem is then used to estimate the answer key confidence
levels, given the estimated values of d. Consensus analysis implemented in
software by Borgatti (1993).
The model provides three measures for evaluating the extent of con-
sensus within a group: (1) eigenvalues showing a single dominant factor,
(2) mean competence rating over .500, (3) number of negative or low com-
petence ratings in a group of subjects (more than one or two present in a
data set suggests a lack of homogeneity even if the mean exceeds .500).
Together, these criteria function similarly to a significance level, in the sense
that they are (1) established based on experience with the domain of knowl-
edge in question, (2) related to acceptable levels of error, and (3) prees-
tablished when used for hypothesis testing.
No experience using this model in this domain has been reported pre-
viously in the literature, except by this author. However, the model has
been widely used in anthropology (Romney, Batchelder, & Weller, 1987)

and in other domains within psychology. It can be used as either a formal
model investigating the nature of knowledge in a domain, or as a simple
measure of the properties of a particular data set. In this application, the
broader assumptions of the model about culture are not claimed and the
technique is used primarily to evaluate the nature of the response patterns
among a set of subjects.
The criteria listed above are those considered by Romney et al. (1986),
the developers of the model, to be indicative of consensus in other domains
of cultural knowledge (e.g., classification of disease, parenting practices).
Thus, they seem to be a reasonable standard for judging existence of con-
sensus in this context. Behavior of the model has been tested using Monte
Carlo simulation (as described by Batchelder & Romney, 1989). Compe-
tence ratings have been found to be normally distributed and differences
between them can be tested using methods like normal curve tests and
ANOVAs (Batchelder & Romney, 1989).
Answer key confidence levels depend upon the number of subjects and
the extent of consensus within a group of subjects. The number of subjects
needed to estimate an answer key with a specified level of confidence de-
pends upon the mean competence of the group and can be estimated using
the formal model (see Romney et al., 1986).
REFERENCES
Batchelder, W. & Romney, A. (1988). Test theory without an answer key. Psychometrika, 53,
71-92.
Batchelder W., & Romney, A. (1989). New results in test theory without an answer key. In
E. Roskam (Ed.), Mathematical psychology in progress (pp. 229-248). Heidelberg,
Germany: Springer Verlag.
Blascovich, J. (1990). Individual differences in physiological arousal and perception of arousal:
Missing links in Jamesian notions of arousal-based behaviors. Personality and Social
Psychology Bulletin, 16, 665-675.
Blascovich, J., Tomaka, J., Brennan, K., Kelsey, R., Hughes, P., Coad, M. L, & Adlin, R.
(1992). Affect intensity and cardiac arousal. Journal of Personality and Social Psychology,
63, 164-174.
Borgatti, S. (1993). Anthropac 4.0. Columbia, SC: Analytic Technologies, Inc.
Clore, G. (1992). Cognitive phenomenology: Feelings and the construction of judgment. In
L. Martin & A. Tesser (Eds.), The construction of social judgments (pp. 133-163). Hillsdale,
NJ: Erlbaum.
Clore, G., Ortony, A., & Foss, M. (1987). The psychological foundations of the affective
lexicon. Journal of Personality and Social Psychology, S3, 751-766.
Ekman, P., Friesen, W., & Ancoli S. (1980). Facial signs of emotional experience. Journal of
Personality and Social Psychology, 39, 1125-1134.
Feldman, L. (1995). Valence focus and arousal focus: Individual differences in the structure
of affective experience. Journal of Personality and Social Psychology, 69, 53-166.
348 Alvarado
Fridlund, A. (1991). Sociality of solitary smiling: Potentiation by an implicit audience. Journal

of Personality and Social Psychology, 60, 229-240.
Fridlund, A. (1994). Human facial expression: An evolutionary view. San Diego, CA: Academic
Press.
Hays, W. (1988). Statistics (4th ed.). Austin, TX: Harcourt Brace College Publishers.
Hess, U., Banse, R., & Kappas, A. (1995). The intensity of facial expression is determined
by underlying affective state and social situation. Journal of Personality and Social
Psychology, 69, 280-288.
Klauer, K., & Batchelder, W. (1996). Structural analysis of subjective categorical data.
Psychometrika, 61, 199-240.
Larsen, R., & Diener, E. (1985). A multitrait-multimethod examination of affect structure:
Hedonic level and emotional intensity. Personality and Individual Differences, 6, 631-636.
Larsen, R., & Diener, E. (1987). Affect intensity as an individual difference characteristic: A
review. Journal of Research in Personality, 21, 1-39.
Larsen, R., & Diener, E. (1992). Promises and problems with the circumplex model of
emotion. In M. Clark (Ed.), Review of personality and social psychology (pp. 25-59).
Newbury Park, CA: Sage.
Levenson, R. (1992). Autonomic nervous system differences among emotions. Psychological
Science, 3, 23-27.
Mandler, G. (1984). Mind and body: Psychology of emotion and stress. New York: Norton.
Matthews, G., Davies, D. R., & Lees, J. (1990). Arousal, extraversion, and individual
differences in resource availability. Journal of Personality and Social Psychology, 59,
150-168.
Romney, A. K., Weller, S., & Batchelder, W. (1986). Culture as consensus: A theory of cultural
and informant accuracy. American Anthropologist, 88, 313-338.
Romney, A. K., Batchelder, W., & Weller, S. (1987). Recent applications of cultural consensus
theory. American Behavioral Scientist, 31, 163-177.
Ruch, W. (1995). Will the real relationship between facial expression and affective experience
please stand up: The case of exhilaration. Cognition and Emotion, 2, 33-58.
Russell, J. (1980). A circumplex model of affect. Journal of Personality and Social Psychology,
32, 1161-1178.
Thayer, R. (1986). Activation-Deactivation Adjective Checklist: Current overview and
structural analysis. Psychological Reports, 58, 607-614.
Thayer, R. (1989). The biopsychology of mood and arousal. New York: Oxford University Press.
Townsend, J., & Ashby, F. G. (1984). Measurement scales and statistics: The misconception
misconceived. Psychological Bulletin, 96, 394-401.
Watson, D., & Tellegen, A. (1985). Towards a consensual structure of mood. Psychological
Bulletin, 98, 219-235.
Weller, S., & Romney, A, K. (1988). Systematic data collection. Newbury Park, CA: Sage.

Arousal and Validation in 2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arousal and Validation in 2023

Uploaded by

Copyright:

Available Formats

Motivation and Emotion, Vol. 21, No.

Arousal and Valence in the Direct Scaling of

Contributions of differential attention to valence versus arousal (Feldman,

Feldman (1995) presented evidence that individuals differ in their attention

Direct Scaling Assumptions

Direct scaling of emotional response occurs when a subject is exposed

scale intervals, we cannot know whether two subjects reporting different

This problem is relevant to several recent studies investigating the con-

Use of Direct Rating to Norm Film Clips

This study reanalyzes self-report ratings of emotional response to film

The assumptions of the random-effects ANOVA model are that re-

Consensus modeling assumes that subjects draw upon shared latent

Subjects were 35 female volunteers, ages 18 to 35 years, recruited

Stimuli consisted of five films of 1-min duration, three intended to be

The following discussion is adapted from the description of consensus

Table I. Consensus Analysis of Nine Rating Scales Across Seven Rating

Anger .831 .179 13.348 0 35 1.0000

Table II. Predicted Emotional Responses for Nine Rating Scales

Table III. Consensus Analysis of Interest and Arousal Subgroups

Interest .101 .288 1.382 16 35

Arousal .150 .230 1.087 17 35

and reanalyzing the data, it can be determined whether response is idiosyn-

For each scale, baseline periods were significantly correlated, ratings

films were significantly correlated. As would be expected, arousal showed

pleasant films: gorilla (r = .68), ocean (r = .68), puppy (r = .44). However,

minorities report no affective response to stimuli, even where consensus

of emotional experience, or because level of arousal has a global effect on

arousal, correlated with self-report ratings of arousal. Here, there is no

The following description of the consensus model is adapted from Rom-

where m*ij is an empirical point estimate of the proportion of matches be-

been widely used in anthropology (Romney, Batchelder, & Weller, 1987)

Fridlund, A. (1991). Sociality of solitary smiling: Potentiation by an implicit audience. Journal

You might also like