Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

220 Sara J. Philibotte, Stephen Spivack, Nathaniel H.

Spilka, Ian Passman, & Pascal Wallisch

T H E W H O L E I S N O T D I F F E R E N T F R O M I T S PA RT S :
M U S I C E XC E R P T S A R E R E P R E S E N TAT I V E O F S O N G S

S A R A J. P H I L I B O T T E , S T E P H E N S P I VAC K , this diversity of approaches: the average song in popular


N AT HA N I E L H. S P I L K A , I A N PA S S M A N , & music is about 3–4 minutes long (Allain, 2014). As there
PA S C A L WA L L I S C H are limits on the attention and time of research partici-
New York University pants, using whole songs in an experiment curbs the
number of different pieces that can be used in any given
MUSIC PSYCHOLOGY HAS A LONG HISTORY, BUT study; indeed, such studies typically use less than 10

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


the question of whether brief music excerpts are repre- distinct pieces. This is a concern because music is inher-
sentative of whole songs has been largely unaddressed. ently complex—varying in many dimensions—so pick-
Here, we explore whether preference and familiarity ing only a few pieces runs the risk of undersampling the
ratings in response to excerpts are predictive of these underlying stimulus space of music (Lundin, 1953;
ratings in response to whole songs. We asked 643 parti- Prince, 2011). In contrast, using brief excerpts as the
cipants to judge 3,120 excerpts of varying durations stimulus material enables experimenters to employ
taken from different sections of 260 songs from a broad many different music selections in the same study,
range of genres and time periods in terms of preference which in turn allows for a more thorough coverage of
and familiarity. We found that within the range of dura- the underlying stimulus space. However, this approach
tions commonly used in music research, responses to runs the risk that these excerpts are not representative of
excerpts are strongly predictive of whole song affect and the whole songs from which they were extracted. In
cognition, with only minor effects of duration and loca- other words, in any given study, music researchers face
tion within the song. We concluded that preference and an inherent tradeoff between better coverage of the
familiarity ratings in response to brief music excerpts stimulus space of music or better coverage of individual
are representative of the responses to whole songs. Even songs.
the shortest excerpt duration that is commonly used in Whether the results gained from these different
research yields preference and familiarity ratings that approaches can inform each other depends on the
are close to those for whole songs, suggesting that lis- answers to two questions:
teners are able to rapidly and reliably ascertain recog-
nition as well as preference and familiarity ratings of 1. How representative are the excerpts of the songs
whole songs. that from which they were sampled?
2. How much do these results depend on the specific
Received: April 20, 2018, accepted December 19, 2022. way in which the excerpts were sampled?

Key words: music, individual differences, rating, recog- To our knowledge, the answer to the first question has
nition, methodology not been studied empirically. Whereas some studies
have investigated how quickly participants can make
reliable aesthetic judgments about excerpts (Belfi

M
USIC CAN EVOKE AFFECTIVE RESPONSES, et al., 2018), none have explicitly compared the rating
which has been studied empirically. Some of of excerpts to whole songs. In other words, the answer
the work on the elicitation of emotions with to this question remains open. Meanwhile, the field
music employs whole songs (Grewe et al., 2007; Grewe simply relies on the assumption that excerpts are in fact
et al., 2009), whereas other studies utilize brief excerpts representative of songs. This might be the case, but there
(De Vries, 1991; Kreutz et al., 2008; Krumhansl, 1997). are concerns about both internal and external validity.
A related line of research pertains to the question of For instance, it is conceivable that the long-term tem-
what determines music preference more generally. poral structure of a song matters in a way that can—in
Again, some use whole songs (Schäfer & Sedlmeier, principle—not be captured by brief excerpts. In terms of
2010) whereas others use excerpts to address this ques- external validity, most people listen to whole songs, not
tion (Rentfrow et al., 2011). There are good reasons for excerpts, when listening to music. Therefore, results

Music Perception, VOLUM E 40, ISSU E 3, PP. 220–236, IS S N 0730-7829, EL ECTR ONI C ISSN 1533-8312. © 2023 B Y THE R E GE N TS OF THE UN IV E RS I T Y O F CA LI FOR NIA A LL
R IG HTS RES ERV ED . PLEASE DIR ECT ALL REQ UEST S F OR PER MISSION T O PHOT O COPY OR R EPRO DUC E A RTI CLE CONT ENT T HRO UGH T HE UNI VE R S IT Y OF CALI FO RNIA P R E SS ’ S
R EPR INT S AN D P E R M I S S I O NS W E B PAG E , HT T PS :// W W W. UCPR ESS . E D U / JOU RNA LS / R E P RI NTS - PERMISSI ONS . DOI: https://doi.org/10.1525/ M P.2023.40.3.220
Brief Clips are Reliable Estimators of Whole Songs 221

from the study of ‘‘music’’ that relies solely on excerpts through flyers, email lists, and advertisements in classes.
could be misleading if this assumption does not hold. We used the data from all 638 participants (> 99%) that
In terms of the second question, there is a wide range finished the study for the analysis presented here.
of stimulus durations commonly used in music research
that utilizes excerpts (also referred to as ‘‘selections,’’
MATERIALS
‘‘snippets,’’ or ‘‘clips’’ in some of these publications). This
Song Selection
duration range spans from around 5 seconds (Krum-
We selected 260 songs in total, with the goal of achieving
hansl, 2017; Krumhansl & Zupnick, 2013), to about 9
a selection that was representative of music styles com-
seconds (Peretz et al., 1998), to a ‘‘very short’’ 14 seconds
monly heard by U.S. listeners, including a wide range of
(Mehr at al., 2018), to 15 seconds (Belfi et al., 2016,
genres, time periods, and popularity. Thus, we included
Rentfrow et al., 2011), to 30 seconds (McCown et al.,
152 pop songs from the Billboard popular music charts

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


1997), and up to 2 minutes (Zentner et al., 2008), but it
(2 randomly chosen ‘‘number-one songs’’ for each year
is unclear how this choice affects aesthetic judgments.
from 1940 through 2015), 52 diverse songs that were
Sometimes, even shorter (e.g., Belfi et al., 2018; Mace
judged by a panel of experts to be obscure (Rentfrow
et al., 2012) and longer pieces are used (e.g., Vuoskoski
et al., 2011), as well as 56 ‘‘iconic’’ songs from eight broad
et al., 2012; Warrenburg, 2020). In addition, sometimes
music genres (classical, country, electronic, jazz, popular,
this excerpt may be taken from the chorus (Krumhansl,
rap/hip-hop, rock, and R&B/soul). For the iconic songs,
2017; Krumhansl & Zupnick, 2013), sometimes from
a lab-wide panel composed of six researchers with vary-
a ‘‘highly recognizable part of the song’’ (Belfi et al.,
ing music backgrounds determined seven sub-genres for
2016), or sometimes from ‘‘melodic lines’’ of a song (Per-
each of the eight broad genres (such as ‘‘Renaissance’’ and
etz et al., 1998). At other times, the clips are simply
‘‘Baroque’’ for classical music; and ‘‘Bebop’’ and ‘‘Fusion’’
whatever is provided by iTunes (Barrett et al., 2010). This
for Jazz). For each of these 56 sub-genres, the panel voted
might matter, as the sonic properties of a song could be
on one song to best represent the given sub-genre.
different in different parts of a song. For example, the
sonic properties of the introduction might be different Excerpt Selection and Creation
from those of the chorus, which might in turn be differ- For each of the 260 songs, we created 12 excerpts
ent from those of the verse. Often, it is not specified at all (‘‘clips’’) with durations of precisely 5s, 10s, and 15s.
how the excerpts were sampled from the song (Kreutz The clips were ‘‘nested’’ such that clips of all durations
et al., 2008; Rentfrow et al., 2011; Vuouskoski et al., 2012), had the same starting point; for example, if a clip started
only that excerpts—with otherwise unknown proper- at 1:10 minutes into the song, the 15s clip played from
ties—were used, rendering it unclear what research par- 1:10 minutes to 1:25 minutes, the 10s clip from 1:10
ticipants actually listened to. This is a concern in terms of minutes to 1:20 minutes, and the 5s clip from 1:10
the potential replicability of this research, which is an minutes to 1:15 minutes. We created the clips with a cus-
increasingly important matter (Camerer et al., 2018; tom-built MATLAB program. We linearly faded the last
Open Science Collaboration 2015). 10% of all clips to silence (the last 1.5s, 1.0s, and 0.5s for
In this study, we aim to address these two related the 15s, 10s, and 5s clips, respectively).
questions in order to quantify how representative To create these clips, we indexed the structure of all
excerpts are of their song of origin and whether affective songs according to when the intro (I), ‘‘outro’’ (O), cho-
and cognitive judgments are affected by duration or rus (C), and ‘‘verse’’ (V) sections of the song began, with
location of the excerpts. the ‘‘verse’’ sections including verses, solos, bridges, and
any other portions of the song that did not fit into the
Method other categories. Doing so, we derived a standard song
structure: beginning with I, then V and C, with the V-C
PARTICIPANTS combo repeated various (n) times, until O (the ending
Our sample consisted of New York University under- section of the song). This algorithm can be represented
graduate students as well as residents of the greater New as I þ (V*C)n þ O. For ‘‘standard’’ songs that followed
York City area (age range = 17 to 87 years, mean = 21.3 this format, another custom-built MATLAB program
years, median = 20 years). Our participants (n = 643) randomized which I/O, C, or V portion to use for the
completed the study in a single, in-person two-hour clip and when in that portion the clip should start. We
session. All participants either received course credit used these target starting points to create the clips but
or were compensated with $20. Participants were manually ensured that the clips did not begin in the
recruited using the ‘‘NYU SONA subject pool’’ and middle of a music phrase.
222 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch

For example, if the output of the program indicated to instance, some songs did not have a chorus or clearly
start the chorus clips at 0:23 minutes but this cut off delineated sections at all. For these songs, we used the
a phrase, we might start the clips at 0:24.1 minutes to temporal distribution of the clips from the songs had
avoid cutting off the start of the music phrase, such as clearly delineated sections, in order to create a statistical
when the lyrics begin or at the downbeat of the first distribution of starting points that matched that of the
measure. We balanced the actual starting points, ensuring songs with a standard structure. In other words, we
that there were an even number of actual starting points found sections of these songs that corresponded to the
before and after the target starting points given by the time point in which the chorus or verse typically started,
MATLAB program, such that the actual starting point if there were a chorus or verse.
deviations from the suggested starting point for each For each song, 9 of the 12 clips were created in the
song were normally distributed with a mean of zero. The systematic way detailed above (3 sections per clip dura-

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


actual clip starting points were all within a few seconds in tion, crossed with 3 clip durations). An additional 3
either direction of the target starting points. If any of the clips per song were chosen subjectively in the following
chorus or verse sections overlapped, we created the first manner. The members of the selection panel indepen-
section using the given starting point and then re-drew dently listened to all of the songs and voted on which
the starting times for the other sections by running the 15s section of each song they considered to be the most
sampling program again. If the intro or outro was less representative or characteristic of the entire song. The
than 15 seconds, we used the first or last 15 seconds of panel discussed their chosen sections until a consensus
the song. At the end of this process, none of the clips was reached among all members. We then created the
drawn from I/O, C or V had any overlap at all. 15s, 10s, and 5s clips for these representative sections
Some songs did not clearly follow this standard using the same nesting technique described earlier
structure (the remaining ‘‘non-standard’’ songs). For (see Figure 1).

FIGURE 1. Schematic of how the excerpts were sampled from a song, as exemplified by one example (Britney Spears’ “ . . . Baby One More Time”). The
graph is the sonogram, the red bars indicate the sampling window. Horizontal black bars represent clip durations.
Brief Clips are Reliable Estimators of Whole Songs 223

To summarize, we created 12 clips for each of the 260 selections such that a clip from a song cannot immedi-
songs. For standard songs, we picked clips from a cho- ately precede or follow the song from which it came)
rus, verse, intro, or outro section, as well as a subjectively and alternated with 191 behavioral ‘‘palate cleansers’’ for
chosen section, with a duration of 15s, 10s, and 5s for a total of 383 trials per participant.
each. For songs with a non-standard structure, we To operationalize affective (preference) and cognitive
picked clips from sections in the song that statistically (familiarity/recognition) judgments respectively, parti-
corresponded to the typical starting points of chorus, cipants were asked in each music trial to rate how much
verse, and intro/outro of standard songs. they like the current song or clip on a 7-point Likert
scale (‘‘Hate it,’’ ‘‘Strongly dislike it,’’ ’’Slightly dislike it,’’
Auditory ‘‘Palate Cleanser’’
‘‘Indifferent,’’ ‘‘Slightly like it,’’ ‘‘Strongly like it,’’ ‘‘Love
As each participant listened to numerous songs and
it’’) as well as rate their familiarity on a 5-point Likert
clips (a total of 192) throughout the experiment, we

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


scale, responding to the question, ‘‘How often have you
created simple behavioral tasks that the participants
heard this before?’’ (‘‘Never,’’ ‘‘Once,’’ ‘‘More than once,’’
completed between each music presentation to avoid
‘‘Multiple times,’’ ‘‘Too many to count’’). We used these
carryover effects from one music selection to the next.
scales because piloting indicated that these response
This ‘‘palate cleansing’’ was effective. We performed an
options constitute ‘‘natural’’ reference points. Partici-
analysis correlating the music ratings from all 192 trials
pants were also asked to indicate whether they recog-
with themselves, offset by one trial. This cross-correla-
nized the song by binary choice (‘‘Yes’’ or ‘‘No’’).
tion analysis revealed that this correlation between all
Importantly, our participants only see the qualitative
ratings and those offset by one trial was statistically
labels, to which we assign numbers from 1 to 7 (for
indistinguishable from zero for almost all (94%) of the
ratings) and 1 to 5 (for familiarity) in the analysis.
participants. It was only significant for the rest of the
When instructing the participants, we emphasized
participants, and those effects were likely driven either
that they should answer all questions pertaining to the
by mood effects (someone being in a temporary up- or
specific clip or song they just heard, not other portions
down-state), or someone ‘‘clicking through’’ for a while,
of the song, or whole genres. All procedures were
during the task.
approved by the New York University Institutional
Review Board, the University Committee on Activities
Procedure Involving Human Subjects (UCAIHS).

Throughout the experiment, participants listened to Data Analysis


music (full songs and clips), which they rated in terms
of liking and familiarity. They also were asked to indi- To analyze the data, we use MATLAB throughout. As
cate whether they recognized each music selection. Each we performed several statistical comparisons in the
music presentation was followed by a ‘‘palate cleanser,’’ form of significance tests, we adopted a conservative
a task requiring a behavioral response that was designed significance level alpha of .005 (Benjamin et al., 2018),
to distract participants from what they had just heard. in order to avoid false positive results.
All music was presented via Audio-Technica ATH- Before performing these tests, we needed to do
M20x Professional Monitor Headphones using cus- a manipulation check to establish whether our partici-
tom-built MATLAB (2016b) software. pants used the scale described above properly. For
The songs used in each session were randomly drawn instance, it was conceivable that there were floor effects,
from all 260 songs, with the following constraints: each ceiling effects, response biases, as well as the possibility
participant listened to 6 songs from the Billboard, 3 that we used songs that were psychometrically not rep-
from the Rentfrow corpus, and 3 ‘‘iconic’’ songs, for resentative of the full range of music.
a total of 12 songs per participant and experimental To accomplish this, we plotted a histogram of the
session. Participants were also presented with the 144 number of responses as a function of the average pref-
clips (12 per each song) drawn from these 12 songs as erence rating for both clips and songs. If there were
well as 36 clips that were randomly drawn from the substantial deviations from normal, this could pose
other 248 songs in order to enrich the experience of statistical issues as well as raise the question of whether
participants, as piloting suggested that listening to only our participants were engaged in the task or had other
12 songs over and over again would be too monotonous. biases.
The order of these 192 music selections was effectively However, our results showed that the overall prefer-
randomized (we constrained the order of the music ence ratings of our chosen music were well-calibrated
224 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 2. Ratings distributions. Each histogram represents the average preference ratings per participant, binned. Left panel: Clips. Right
panel: Songs.

and that participants used the rating scale properly (see significant and substantial. Put differently, clip prefer-
Figure 2). The mean song preference rating was 4.23, ence ratings did seem to predict song preference
with a standard deviation of 0.95, while the mean clip ratings well.
preference rating was 4.13, with a standard deviation of However, there was an unaddressed confound in this
0.87. Both distributions did not significantly deviate analysis. We simply correlated the song preference rat-
from a normal distribution, as tested by a Shapiro-Wilk ings with the clip preference ratings, regardless of where
test: SW = 0.9991, p = .99 for clips, and SW = 0.9958, in the presentation sequence the song appeared. In
p = .082 for songs. Thus, our preference ratings distri- other words, what might appear as prediction—predict-
butions are not statistically distinguishable from normal ing the song preference rating from the clip preference
distributions. rating—might in actuality be post-diction. This was not
a trivial point, as the task (from the perspective of the
Results participant) was quite different psychologically. In the
first case, there were large parts of the song that they
The main question we attempt to answer in this study had not heard yet, if they just heard a brief excerpt. In
was whether preference and familiarity ratings in the second case, they simply had to realize that the clip
response to brief music excerpts were representative of was part of a song they already heard. Thus, we further
the whole songs from which they were sampled. refined this analysis by whether the clips occurred
To address this question, we correlated the average before or after the song (see Figure 4).
clip preference rating of participants with their song Our intuition as to the different nature of the
preference rating. The median Spearman rank correla- responses appeared to be correct. The median Spear-
tion was .834 (see Figure 3 for the distribution). man correlation for clips presented after the song was
As only 12 points went into any given correlation (per significantly higher than that between clips presented
participant), we were concerned whether such a median before the song, as assessed by a KS-test (D = 0.091,
could likely be obtained by chance. Thus, we randomly D = 0.279, p = 2.68e-22).
shuffled the ratings data 500,000 times. Doing so, the Taken together, the correlation between clip and song
highest median correlation we obtained was .0724, ratings was high, indicative of high intra-song reliability
implying an exact p value of 0. We therefore concluded of appraisal. But did this depend on the length of the
that this median correlation was both statistically clip? Intuitively, it must—from first principles, the
Brief Clips are Reliable Estimators of Whole Songs 225

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 3. Distribution of Spearman rank correlations between clip and song preference ratings. Grey bars: Binned correlations. Orange vertical line:
Median correlation.

longer an organism integrates information, the more preference rating. However, just like in the analysis
reliable a judgment will be (Anderson, 1962). However, where we did not disaggregate by duration, clip ratings
it is possible that humans are highly efficient integrators from clips presented after the song were more strongly
of this information. In other words, accuracy of judg- correlated with the song preference ratings.
ments might have already saturated at the beginning of Whereas it appears that there was no significant dif-
the range of excerpt durations commonly used in music ference between clip durations that are commonly used
psychology research (around 5 s). We explore this ques- in music, one might wonder if it matters from which
tion in Figure 5. part of the song the clip is sampled. For instance, it is
Median Spearman correlations between the clip and conceivable that there could be a considerable difference
song preference ratings were .74, .76, and .78 for clips of in the correlation between clips and song for clips taken
5s, 10s, and 15s duration, respectively, if the clip was from the chorus versus clips taken from other sections
presented before the song. Median Spearman correla- of the song. We explored this question in Figure 6.
tions between the clip and song preference ratings were Median Spearman correlations between the clip and
.84, .86, and .86 for clips of 5s, 10s, and 15s duration, song preference ratings were .72, .76, .75, and .76 for
respectively, if the clip was presented after the song. clips from Intro/Outro, Chorus, Verse, and a subjectively
We perform KS-tests to assess whether these differ- maximally representative portion of the song, respec-
ences are statistically significant (see Table 1). tively, if the clip was presented before the song. Median
To summarize this pattern of results, all differences Spearman correlations between the clip and song pref-
between clip durations—whether they be before or after erence ratings were .83, .85, .84, and .85 for clips from
the song—were not significant. In contrast, all differ- Intro/Outro, Chorus, Verse, and a subjectively maxi-
ences between clips presented before and after the song mally representative portion of the song, respectively,
(for a given clip duration) were statistically significant. if the clip was presented after the song. We performed
We concluded that clip duration was not a significant KS-tests to assess whether these differences are statisti-
factor when predicting song preference rating from clip cally significant and present the results in Table 2.
226 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 4. Distribution of Spearman rank correlations between clip and song preference ratings. Grey bars: Binned correlations. Orange vertical lines:
Median correlation. Top panel: Preference ratings from clips presented before the song. Bottom panel: Preference ratings from clips presented after
the song.

Again, a consistent pattern of results emerges. fact, it is possible that what drives preference ratings are
Broadly speaking, it did not matter from which section not acoustic properties of the song per se, but the entire
in the song the clips were sampled from. The only context and associations that were encoded when first
exception to this was the comparison between clips encountering the song prior to the experiment. If this
from the Intro or Outro compared to subjectively most were the case, we would predict that such an effect
representative clips, with the latter being slightly stron- would boost the correlation between clips and song
ger correlated with the song preference ratings. This is preference ratings, as these memories could be used to
plausible, as the Intro or Outro can be quite drastically rate the clip, consistent with the previously encountered
different from the rest of the song, in terms of acoustic song. Thus, this correlation should be higher for clips
properties. where participants recognize the song compared to
Again, there were strong and statistically significant those where they do not.
differences for clips presented before vs. after the song, There are indications that are consistent with this
with clips presented after the song being consistently model. For instance, we recorded the time it took for
more correlated with the song preference ratings, participants to click on the ‘‘next trial’’ button. Looking
regardless of the song section from which the clip came. at the reaction times for 5s clips is fair, as participants
Consistent with the idea that when participants could already report their preference ratings while the
encountered a particular song, any subsequent clip can clip is still playing, but not continue to the next trial
then serve as a cue to evoke the memory of the feelings until the clip had finished playing. Looking at the reac-
that the whole song elicits, it is also possible that parti- tion times for these clips, the median reaction time for
cipants have encountered the whole song before the unrecognized clips is 7.16s, whereas it is 6.76s for rec-
study. In that case, a clip could act as a retrieval cue ognized clips. This difference was statistically signifi-
to the memory of the whole song, whenever that was cant, as evaluated with a Mann-Whitney U test
heard (Spivack et al., 2019). In other words, the effects (ranksum = 360980, p = 9.38e-7). Thus, it took longer
laid out above might well be mediated by recognition. In to rate an unrecognized song than a recognized one,
Brief Clips are Reliable Estimators of Whole Songs 227

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 5. Distribution of Spearman rank correlations between clip and song preference ratings by clip duration. Grey bars: Binned correlations.
Orange vertical lines: Median correlation. Top row: Preference ratings from clips presented before the song. Bottom row: Preference ratings from clips
presented after the song. Columns from left to right: 5-second, 10-second, and 15-second clips, respectively.

TABLE 1. KS-test Results Corresponding to Data Presented in As one can see, song recognition can be predicted
Figure 5 from the number of clips that are recognized and is well
Comparison of clip described by a logistic function (ß0 = 4.12, ß1 = 0.64).
durations relative However, this analysis integrated information over all
to song Delta D p Significant? participants and clips and was challenging to do on
5s Before vs. 10s Before 0.011 0.062 .171 ns a per-participant level, as recognition rates for quite
5s Before vs. 15s Before 0.034 0.088 .0143 ns a few of our songs were close to zero. This is not sur-
10s Before vs. 15s 0.023 0.061 .176 ns prising, given that we took a substantial portion of our
Before sample from the Rentfrow et al. (2011) corpus, which
5s After vs. 10s After 0.014 0.050 .409 ns
5s After vs. 15s After 0.014 0.050 .389 ns
was deliberately sampled to consist of obscure music.
10s After vs. 15s After 5.5e-6 0.039 .720 ns Overall, only 20% of the clips were reported to be rec-
5s Before vs. 5s After 0.099 0.22 5.60e-14 *** ognized by our participants. Therefore, instead of cal-
10s Before vs. 10s After 0.101 0.26 1.82e-19 *** culating the average clip-to-song correlation, we
15s Before vs. 15s After 0.079 0.23 1.03e-14 *** calculated the mean absolute deviation in this analysis
(see Figure 8).
perhaps due to the fact that the memory serves as short- We attempted to tease apart the effects of recogni-
cut, which is remarkable, as it takes some time to click tion and whether clips were presented before or after
the ‘‘I recognize the clip’’ checkbox whereas it takes no the song. The mean absolute deviation (MAD) was
time to simply leave it unclicked (the default). Thus, calculated as the mean absolute difference between
everything else being equal, we would expect the unrec- song and clip ratings. As one can see, the MAD was
ognized clip trials to be completed faster. lowest for recognized clips presented after the song.
Moreover, there is information in the clip recognition The prediction accuracy of clips presented before the
data (see Figure 7). song but that were recognized was roughly equal to
228 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 6. Distribution of Spearman rank correlations between clip and song preference ratings by song section. Grey bars: Binned correlations.
Orange vertical lines: Median correlation. Top row: Preference ratings from clips presented before the song. Bottom row: Preference ratings from clips
presented after the song. Columns from left to right: Clips sampled from the Intro or Outro, Chorus, Verse or a subjectively most representative section
of the song, respectively.

TABLE 2. KS-test Results Corresponding to Data Presented in Figure 6

Comparison of clip durations relative to song Delta D p Significant?


I/O before vs. Chorus before 0.0363 0.0754 .0510 ns
I/O before vs. Verse before 0.0302 0.0628 .1565 ns
I/O before vs. Representative before 0.0432 0.1120 6.06e-04 *
Chorus before vs. Verse before 0.0060 0.0440 .5605 ns
Chorus before vs. Representative before 0.0069 0.0508 .3755 ns
Verse before vs. Representative before 0.0130 0.0838 .0216 ns
I/O after vs. Chorus after 0.0225 0.0750 .0530 ns
I/O after vs. Verse after 0.0160 0.0539 .3041 ns
I/O after vs. Representative after 0.0283 0.0982 .0039 *
Chorus after vs. Verse after 0.0065 0.0485 .4346 ns
Chorus after vs. Representative after 0.0058 0.0469 .4779 ns
Verse after vs. Representative after 0.0124 0.0627 .1571 ns
I/O before vs. I/O after 0.1055 0.2229 2.28e-14 ***
Chorus before vs. Chorus after 0.0917 0.2137 3.19e-13 ***
Verse before vs. Verse after 0.0912 0.2253 1.12e-14 ***
Representative before vs. Representative after 0.0906 0.2023 6.66e-12 ***

that of unrecognized clips presented after the song. recognition and presentation order seem to play
Unrecognized clips that were presented before the about equally strong roles in increasing song predict-
song were least predictive of the song preference ability from clips, with short-term and long-term
rating, but still far more predictive than one would memory serving as plausible mechanisms underlying
expect from random chance. We concluded that both this effect.
Brief Clips are Reliable Estimators of Whole Songs 229

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 7. Logistic regression of song recognition from clip recognition. The x-axis represents the number of clips recognized. The y-axis represents
whether the song was recognized. Black dots represent individual songs across all participants. The orange line is the logistic fit.

FIGURE 8. Bar graph of mean absolute deviations as a function of whether clips were presented before or after the song, and whether they were
recognized or not. Orange error bars represent the standard error of the mean. For comparison, we also included a bar that corresponds to randomly
shuffled ratings.
230 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch

It is somewhat surprising that neither the duration of a function of duration and song section. As we can only
the excerpt (within a three-fold duration window of use the very first instance of the excerpts, we computed
commonly used excerpt durations from 5 to 15 seconds) the root mean-squared error (RMSE) across all partici-
nor the song section that the excerpt came from appre- pants and songs and then bootstrap confidence inter-
ciably affected the predictability of song preference rat- vals. To mimic the significance level used in the rest of
ings from the preference rating of excerpts (with the this publication and to guard against multiple compar-
exception of the contrast between intro/outro and rep- ison concerns, we computed 99% confidence intervals.
resentative ratings). With this in mind, note that we As is evident in Figure 9, the concerns about cross-
presented the excerpts in random order throughout the contamination between excerpts were not borne out
experiment. Given this design, it is possible that there empirically, as these results closely mirror those from
was cross-contamination between the excerpts; for previous analyses presented above.

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


instance, if a longer excerpt was played first, that might After answering the question of whether preference
affect how a later, but shorter one was interpreted. It is ratings of songs can be predicted from clip preference
also possible that if a more representative segment was ratings, we turned to the question of whether song
played first, the intro would remind the listener of that, familiarity could be predicted from clip familiarity. To
which would artificially level the difference in predict- answer this question, we employed a similar but some-
ability. This is a valid concern, and one of the key rea- what modified analysis, as each presentation of music—
sons Belfi et al. (2018) adopted a block design. However, both clips and songs—changed the reported familiarity
such a design is also not without inherent concerns, of subsequent music presentations of the same kind (see
such as order effects or temporal autocorrelations. Thus, Figure 10).
we addressed this possibility by analyzing the predict- Figure 10 shows two things: First, familiarity signifi-
ability of song ratings based on the very first excerpt cantly increased as a function of the number of presen-
(and only if that excerpt was played before the song), as tations of the same music. Second, the type of music

FIGURE 9. Predictability of song preference ratings from excerpt preference ratings based on the very first pre-song excerpt only. Upper panel: The x-
axis denotes the duration of the excerpts. The y-axis indicates the root mean squared error (RMSE) between excerpt and song preference ratings. The
orange error bars represent the bootstrapped 99% confidence interval. Lower panel: Like the upper panel, but the x-axis now denotes the section of
the song the excerpt was extracted from.
Brief Clips are Reliable Estimators of Whole Songs 231

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 10. Familiarity as a function of presentation number. x-axis: Presentation. y-axis: Mean familiarity. Red error bars represent the standard
error of the mean. Left panel: Treating all music presentations——be they clips or songs——equally. Right panel: Ordering clip presentations relative to the
song presentation, 0 = song.

mattered; in other words, listening to the song had To conclude our analysis, we also considered the rela-
a much larger impact on the reported familiarity than tionship between clip and song familiarity as a function
a clip. This is plausible, as the average song was 21 times of song section from which the clip was sampled. We
longer than the average clip. As both observations show these distributions in Figure 13. As the same con-
impact any possible relationship between clips and siderations regarding correlation apply, we also present
songs, we restricted our analysis to clips that were the mean absolute deviation (MAD).
immediately preceding a song. We believe this was a Like the duration findings, these results are consistent
fair comparison, as their average familiarity was with the corresponding results from the preference rat-
comparable. ings; see Table 3 for the results of the hypothesis tests.
Thus, to answer the question of whether song famil-
iarity can be predicted from clip familiarity, we calcu- Discussion
lated the median Spearman correlation between songs
and the corresponding clips that immediately preceded In this study, we explored whether the psychological
them as .771 (see Figure 11 for the distribution). responses to excerpts were representative of the songs
Following the logic of the preference ratings analysis, from which they were sampled. We found that this was
we considered whether there was a difference in median the case, broadly speaking: Both the preference and
correlation as a function of clip duration. However, as familiarity ratings of a song could be well predicted
discussed above, only the immediately preceding clip from excerpt preference and familiarity ratings. This
was valid as a predictor of song familiarity. This meant pattern of results is remarkably consistent. Both song
that each correlation was only based on a few points, preference and familiarity ratings are well predicted
which made correlations of 1, 0 and -1 much more from clips, regardless of their duration, and regardless
likely, as evident in Figure 12. Thus, we paired the cor- of which section in the song they were sampled from,
relation figures with corresponding mean absolute devi- with the exception of a significant difference between
ation numbers and distributions. songs from the Intro or Outro vs. those from the sub-
Consistent with the findings from the preference rat- jectively most representative part of the song.
ings, the relationship between clip and song familiarity One strength of this research is that it was conducted
was strong, but there was no significant difference with high statistical power, as we used about an order of
between different clip durations (see Table 3). magnitude more participants in this study than most
232 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 11. Distribution of Spearman rank correlations between clip and song familiarity. Grey bars: Binned correlations. Orange vertical line: Median
correlation.

FIGURE 12. Distribution of Spearman rank correlations and mean absolute deviations between clip and song ratings by clip duration. Grey bars:
Binned correlations. Orange vertical lines: Median correlation. Top row: Spearman correlations. Bottom row: Mean absolute deviation.
Brief Clips are Reliable Estimators of Whole Songs 233

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


FIGURE 13. Distribution of Spearman rank correlations and mean absolute deviations between clip and song familiarity ratings by song section the
clip was sampled from. Grey bars: Binned correlations. Orange vertical lines: Median correlation. Top row: Spearman correlations. Bottom row: Mean
absolute deviation.

TABLE 3. KS-test Results Corresponding to Data Presented in problems (Buchanan & Scofield, 2017). In the age of
Figures 11 and 12 questionable replicability, power is a considerable con-
cern (Open Science Collaboration, 2015; Wallisch,
Comparison of
MAD (clip vs. 2015).
song familiarity) Delta D p Significant? A limitation of this work is that we were unable to
determine the concordance between clip and song rat-
5s vs. 10 s 0.0266 0.0544 .3085 ns ings as a function of clip duration, as the agreement was
5s vs. 15s 0.0236 0.0425 .6207 ns
10s vs. 15s 0.0031 0.0246 .9898 ns already high at the shortest duration we used (5 sec-
Intro/Outro vs. 0.1097 0.1022 .0032 * onds). This is perhaps not surprising given research
Chorus on ‘‘thin slices’’ in music perception: participants are
Intro/Outro vs. 0.0988 0.0891 .0169 ns able to recognize snippets of songs that are as short as
Verse 300 ms with a 25% success rate (Krumhansl, 2010). In
Intro/Outro vs. 0.1270 0.1287 8.586e-5 **
Representative general, the temporal fidelity of the human auditory
Chorus vs. Verse 0.0109 0.0283 .9645 ns system is exquisite; people are able to distinguish the
Chorus vs. 0.0173 0.0728 .0727 ns human voice from other sources with exposure dura-
Representative tions as short as 2 ms (Suied et al., 2013).
Verse vs. 0.0282 0.0676 .1228 ns Thus, the shortest duration at which song rating and
Representative
recognition can be perfectly predicted from clip
responses will lie somewhere below 5 seconds, perhaps
existing experimental studies on music perception and considerably so. However, the question of which clip
cognition. The only exceptions are music studies that duration predictions about song judgments dip from
rely on survey research or participants on Amazon perfect (given the limits of reliability imposed by judg-
mTurk, which come with their own host of inherent ments about the song ratings) is perhaps mostly of
234 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch

academic interest. An exposure duration of 5 seconds is very fast (Gjerdingen & Perrott, 2008; Mace et al.,
already allows the experimenter to rapidly present 2012), and people presumably know how much they
excerpts from many songs and it might take participants like a given genre. Our results suggest that while song
a while to make judgments about what they are hearing. recognition did play a role, it did not account for all of
For instance, presenting a rapid stream of clips with the effects we observed. However, one caveat is that we
a duration of 1 second might still yield reliable judg- might not have achieved a fair apples-to-apples com-
ments about the songs, but might be too exhausting for parison, as so many clips were unrecognized. As this
participants if this stream is too long. Nevertheless, study was not designed to address this question specif-
determining the exact slope of this ‘‘rise of the temporal ically, we recommend doing so in future research, with
kernel’’ could be of interest and thus constitute an area a more balanced sample of music in terms of popular
of future research. recognition.

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


Another potential limitation of this research consists The strong correlation between clip and song prefer-
in the fact that we used a fully randomized design when ence ratings, independent of where in the song the clip
determining the order of clips and songs. Such an was sampled and independent of duration, raises the
approach has many advantages, as it eliminates con- question of what actually determines one’s preference
cerns regarding temporal autocorrelations and secular of a given song. What exactly underlies this—perhaps
response biases. However, one downside of such a design invariant sonic properties that are present throughout
is that if a shorter clip is presented after a longer one, the entire song or judgments of genres as a whole—
participants could conceivable remember—and should be explored in future research. It also addresses
respond—to what they heard before, not the currently one of the central questions of Gestalt psychology,
presented short clip. If this effect is strong, it might limit namely whether ‘‘the sum is other than the sum of its
the interpretability of the conclusion that duration did parts’’ (Koffka, 1935). At least in terms of some psycho-
not matter. Similar concerns apply to the conclusion logical qualities of popular music as measured by pref-
that the song section of origin of an excerpt did not erence and familiarity ratings, that does not seem to be
matter. However, this would require prodigious feats the case, as responses to the parts seem, overall, to be
of memory on part of the participants to keep track of a good proxy for responses to the whole. Indeed, this
all of this information across presentations from differ- research suggests that popular music is close to auditory
ent song sections throughout the study. Indeed, our textures (Ellis et al., 2011; McDermott & Simoncelli,
analysis presented in Figure 9 shows that this concern 2011), contrasting with popular movies, which are nar-
is unfounded empirically; even if one only considers rative- and plot-development driven (Wallisch &
first exposures of song excerpts, where no such contam- Whritner, 2017).
ination between excerpts is possible, we find the same Finally, this research could have practical implica-
results as in the other analyses. tions. Music industry platforms like iTunes, Amazon,
To summarize, we believe this research has implica- and Pandora already provide prospective buyers with
tions for the study of music perception and cognition as a 30-second excerpt of the song, presumably chosen
a whole. First, this research shows that excerpts that are by expert judgment. Instead, these samples could simply
as short as 5 seconds are already sufficiently represen- include a 5-second clip that was randomly picked from
tative in terms of preference and familiarity ratings of a song segment other than the Intro or Outro.
the song as a whole. This means that studies on the
psychology of music can potentially present many more Author Note
stimuli than used in previous work without compromis-
ing how well a song is represented by the clip. Second, We would like to thank Lucy Cranmer, Warren Ersly,
there are also theoretical implications for the psychol- and Ted Coons for helpful comments on a prior version
ogy of music. For instance, the sonic properties of a song of this manuscript. We would also like to thank the
are presumably somewhat different between the chorus, Dean’s Undergraduate Research Fund (DURF) at New
the verse, and the intro. Remarkably, this does not seem York University for financial support of this project and
to matter much. Judgments about how well someone Andy Hilford for providing the space for our lab to run
likes a given song seem to be largely independent of the participants.
location in the song from which it was sampled. Correspondence concerning this article should be
Of course, simply recognizing the song could be addressed to Pascal Wallisch, New York University, 60
enough to determine how much one likes it. This is Fifth Avenue, Room 210, New York, NY, 10011. E-mail:
plausible, given that the accurate recognition of genre pascal.wallisch@nyu.edu
Brief Clips are Reliable Estimators of Whole Songs 235

References

A LLAIN (2014). Why are songs on the radio about the same K REUTZ , G., O TT, U., T EICHMANN , D., O SAWA , P., & VAITL , D.
length? Wired. https://www.wired.com/2014/07/why-are- (2008). Using music to induce emotions: Influences of music
songs-on-the-radio-about-the-same-length/ preference and absorption. Psychology of Music, 36(1),
A NDERSON , N. H. (1962). Application of an additive model to 101–126.
impression formation. Science, 138(3542), 817–818. K RUMHANSL , C. L. (1997). An exploratory study of music emo-
B ARRETT, F. S., G RIMM , K. J., R OBINS , R. W., W ILDSCHUT, T., tions and psychophysiology. Canadian Journal of Experimental
S EDIKIDES , C., & J ANATA , P. (2010). Music-evoked nostalgia: Psychology, 51(4), 336–353.
Affect, memory, and personality. Emotion, 10(3), 390. K RUMHANSL , C. L. (2010). Plink:" Thin slices" of music. Music

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023


B ELFI , A. M., K ARLAN , B., & T RANEL , D. (2016). Music evokes Perception, 27(5), 337–354.
vivid autobiographical memories. Memory, 24(7), 979–989. K RUMHANSL , C. L. (2017). Listening niches across a century of
B ELFI , A. M., K ASDAN , A., R OWLAND, J., V ESSEL , E. A., S TARR , popular music. Frontiers in Psychology, 8, 431.
G. G., & P OEPPEL , D. (2018). Rapid timing of music K RUMHANSL , C. L., & Z UPNICK , J. A. (2013). Cascading ’’remi-
aesthetic judgments. Journal of Experimental Psychology: niscence bumps’’ in popular music. Psychological Science,
General, 147(10), 1531–1543. https://doi.org/10.1037/ 24(10), 2057–2068.
xge0000474 LUNDIN , R. W. (1953). An objective psychology of music. Ronald
B ENJAMIN , D. J., B ERGER , J. O., J OHANNESSON , M., N OSEK , B. A., Press.
WAGENMAKERS , E. J., B ERK , R., ET AL . (2018). Redefine sta- M ACE , S. T., WAGONER , C. L., T EACHOUT, D. J., & H ODGES , D.
tistical significance. Nature Human Behaviour, 2(1), 6. A. (2012). Genre identification of very brief music excerpts.
B UCHANAN , E. M., & S COFIELD, J. E. (2018, February 18). Psychology of Music, 40(1), 112–128.
Methods to detect low quality data and its implication for M C C OWN , W., K EISER , R., M ULHEARN , S., & W ILLIAMSON , D.
psychological research. Center for Open Science. Retrieved (1997). The role of personality and gender in preference for
from osf.io/cv2bn https://doi.org/10.3758/s13428-018- exaggerated bass in music. Personality and Individual
1035-6 Differences, 23(4), 543–547.
C AMERER , C. F., D REBER , A., H OLZMEISTER , F., H O, T. H., M C D ERMOTT, J. H., & S IMONCELLI , E. P. (2011). Sound texture
H UBER , J., J OHANNESSON , M., ET AL . (2018). Evaluating perception via statistics of the auditory periphery: Evidence
the replicability of social science experiments in Nature from sound synthesis. Neuron, 71(5), 926–940.
and Science between 2010 and 2015. Nature Human M EHR , S. A., S INGH , M., YORK , H., G LOWACKI , L., & K RASNOW,
Behaviour, 1. M. M. (2018). Form and function in human song. Current
D E V RIES , B. (1991). Assessment of the affective response to Biology, 28(3), 356–368.e5. DOI: 10.1016/j.cub.2017.12.042
music with Clynes’s sentograph. Psychology of Music, 19(1), Open Science Collaboration. (2015). Estimating the reproduc-
46–64. ibility of psychological science. Science, 349(6251), aac4716.
E LLIS , D. P., Z ENG , X., & M C D ERMOTT, J. H. (2011, May). P ERETZ , I., G AUDREAU, D., & B ONNEL , A. M. (1998). Exposure
Classifying soundtracks with audio texture features. In effects on music preference and recognition. Memory and
Acoustics, Speech and Signal Processing (ICASSP) (pp. Cognition, 26(5), 884–902.
5880–5883). IEEE International Conference. P RINCE , J. B. (2011). The integration of stimulus dimensions in
G JERDINGEN , R. O., & P ERROTT, D. (2008). Scanning the dial: the perception of music. The Quarterly Journal of Experimental
The rapid recognition of music genres. Journal of New Music Psychology, 64(11), 2125–2152.
Research, 37(2), 93–100. R ENTFROW, P. J., G OLDBERG , L. R., & L EVITIN , D. J. (2011). The
G REWE , O., KOPIEZ , R., & A LTENMÜÜLLER , E. (2009). The chill structure of music preferences: A five-factor model. Journal of
parameter: Goose bumps and shivers as promising measures in Personality and Social Psychology, 100, 1139–1157.
emotion research. Music Perception, 27(1), 61–74. S CHÄFER , T., & S EDLMEIER , P. (2010). What makes us like music?
G REWE , O., NAGEL , F., KOPIEZ , R., & A LTENMÜLLER , E. (2007). Determinants of music preference. Psychology of Aesthetics,
Listening to music as a re-creative process: Physiological, Creativity, and the Arts, 4(4), 223–234.
psychological, and psychoacoustical correlates of chills and S PIVACK , S., P HILIBOTTE , S. J., S PILKA , N. H., PASSMAN , I., &
strong emotions. Music Perception, 24(3), 297–314. WALLISCH , P. (2018). How fleeting is fame? Collective memory
KOFFKA , K. (1935). Principles of Gestalt psychology (Vol. 44). for popular music. Preprint from PsyArXiv. DOI: 10.31234/
Routledge. osf.io/tdfyc
236 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch

S UIED, C., A GUS , T. R., T HORPE , S. J., & P RESSNITZER , D. (2013). WALLISCH , P., & W HRITNER , J. A. (2017). Strikingly low agree-
Processing of short auditory stimuli: The rapid audio sequen- ment in the appraisal of motion pictures. Projections, 11(1),
tial presentation paradigm (RASP). In B. C. J. Moore (Ed.), 102–120.
Basic aspects of hearing (pp. 443–451). Springer. WARRENBURG , L. A. (2020). Choosing the right tune: A review of
V UOSKOSKI , J. K., T HOMPSON , W. F., M C I LWAIN , D., & E EROLA , music stimuli used in emotion research. Music perception,
T. (2012). Who enjoys listening to sad music and why? Music 37(3), 240-258.
Perception, 29(3), 311–317. Z ENTNER , M., G RANDJEAN , D., & S CHERER , K. R. (2008).
WALLISCH , P. (2015). Brighter than the sun: Powerscape visuali- Emotions evoked by the sound of music: characterization,
zations illustrate power needs in neuroscience and psychology. classification, and measurement. Emotion, 8(4), 494.
Preprint from arXiv. https://doi.org/10.48550/arXiv.1512.09368

Downloaded from http://online.ucpress.edu/mp/article-pdf/40/3/220/769560/mp.2023.40.3.220.pdf by guest on 13 August 2023

You might also like