Professional Documents
Culture Documents
The Whole Is Not Different From Its Parts
The Whole Is Not Different From Its Parts
T H E W H O L E I S N O T D I F F E R E N T F R O M I T S PA RT S :
M U S I C E XC E R P T S A R E R E P R E S E N TAT I V E O F S O N G S
Key words: music, individual differences, rating, recog- To our knowledge, the answer to the first question has
nition, methodology not been studied empirically. Whereas some studies
have investigated how quickly participants can make
reliable aesthetic judgments about excerpts (Belfi
M
USIC CAN EVOKE AFFECTIVE RESPONSES, et al., 2018), none have explicitly compared the rating
which has been studied empirically. Some of of excerpts to whole songs. In other words, the answer
the work on the elicitation of emotions with to this question remains open. Meanwhile, the field
music employs whole songs (Grewe et al., 2007; Grewe simply relies on the assumption that excerpts are in fact
et al., 2009), whereas other studies utilize brief excerpts representative of songs. This might be the case, but there
(De Vries, 1991; Kreutz et al., 2008; Krumhansl, 1997). are concerns about both internal and external validity.
A related line of research pertains to the question of For instance, it is conceivable that the long-term tem-
what determines music preference more generally. poral structure of a song matters in a way that can—in
Again, some use whole songs (Schäfer & Sedlmeier, principle—not be captured by brief excerpts. In terms of
2010) whereas others use excerpts to address this ques- external validity, most people listen to whole songs, not
tion (Rentfrow et al., 2011). There are good reasons for excerpts, when listening to music. Therefore, results
Music Perception, VOLUM E 40, ISSU E 3, PP. 220–236, IS S N 0730-7829, EL ECTR ONI C ISSN 1533-8312. © 2023 B Y THE R E GE N TS OF THE UN IV E RS I T Y O F CA LI FOR NIA A LL
R IG HTS RES ERV ED . PLEASE DIR ECT ALL REQ UEST S F OR PER MISSION T O PHOT O COPY OR R EPRO DUC E A RTI CLE CONT ENT T HRO UGH T HE UNI VE R S IT Y OF CALI FO RNIA P R E SS ’ S
R EPR INT S AN D P E R M I S S I O NS W E B PAG E , HT T PS :// W W W. UCPR ESS . E D U / JOU RNA LS / R E P RI NTS - PERMISSI ONS . DOI: https://doi.org/10.1525/ M P.2023.40.3.220
Brief Clips are Reliable Estimators of Whole Songs 221
from the study of ‘‘music’’ that relies solely on excerpts through flyers, email lists, and advertisements in classes.
could be misleading if this assumption does not hold. We used the data from all 638 participants (> 99%) that
In terms of the second question, there is a wide range finished the study for the analysis presented here.
of stimulus durations commonly used in music research
that utilizes excerpts (also referred to as ‘‘selections,’’
MATERIALS
‘‘snippets,’’ or ‘‘clips’’ in some of these publications). This
Song Selection
duration range spans from around 5 seconds (Krum-
We selected 260 songs in total, with the goal of achieving
hansl, 2017; Krumhansl & Zupnick, 2013), to about 9
a selection that was representative of music styles com-
seconds (Peretz et al., 1998), to a ‘‘very short’’ 14 seconds
monly heard by U.S. listeners, including a wide range of
(Mehr at al., 2018), to 15 seconds (Belfi et al., 2016,
genres, time periods, and popularity. Thus, we included
Rentfrow et al., 2011), to 30 seconds (McCown et al.,
152 pop songs from the Billboard popular music charts
For example, if the output of the program indicated to instance, some songs did not have a chorus or clearly
start the chorus clips at 0:23 minutes but this cut off delineated sections at all. For these songs, we used the
a phrase, we might start the clips at 0:24.1 minutes to temporal distribution of the clips from the songs had
avoid cutting off the start of the music phrase, such as clearly delineated sections, in order to create a statistical
when the lyrics begin or at the downbeat of the first distribution of starting points that matched that of the
measure. We balanced the actual starting points, ensuring songs with a standard structure. In other words, we
that there were an even number of actual starting points found sections of these songs that corresponded to the
before and after the target starting points given by the time point in which the chorus or verse typically started,
MATLAB program, such that the actual starting point if there were a chorus or verse.
deviations from the suggested starting point for each For each song, 9 of the 12 clips were created in the
song were normally distributed with a mean of zero. The systematic way detailed above (3 sections per clip dura-
FIGURE 1. Schematic of how the excerpts were sampled from a song, as exemplified by one example (Britney Spears’ “ . . . Baby One More Time”). The
graph is the sonogram, the red bars indicate the sampling window. Horizontal black bars represent clip durations.
Brief Clips are Reliable Estimators of Whole Songs 223
To summarize, we created 12 clips for each of the 260 selections such that a clip from a song cannot immedi-
songs. For standard songs, we picked clips from a cho- ately precede or follow the song from which it came)
rus, verse, intro, or outro section, as well as a subjectively and alternated with 191 behavioral ‘‘palate cleansers’’ for
chosen section, with a duration of 15s, 10s, and 5s for a total of 383 trials per participant.
each. For songs with a non-standard structure, we To operationalize affective (preference) and cognitive
picked clips from sections in the song that statistically (familiarity/recognition) judgments respectively, parti-
corresponded to the typical starting points of chorus, cipants were asked in each music trial to rate how much
verse, and intro/outro of standard songs. they like the current song or clip on a 7-point Likert
scale (‘‘Hate it,’’ ‘‘Strongly dislike it,’’ ’’Slightly dislike it,’’
Auditory ‘‘Palate Cleanser’’
‘‘Indifferent,’’ ‘‘Slightly like it,’’ ‘‘Strongly like it,’’ ‘‘Love
As each participant listened to numerous songs and
it’’) as well as rate their familiarity on a 5-point Likert
clips (a total of 192) throughout the experiment, we
and that participants used the rating scale properly (see significant and substantial. Put differently, clip prefer-
Figure 2). The mean song preference rating was 4.23, ence ratings did seem to predict song preference
with a standard deviation of 0.95, while the mean clip ratings well.
preference rating was 4.13, with a standard deviation of However, there was an unaddressed confound in this
0.87. Both distributions did not significantly deviate analysis. We simply correlated the song preference rat-
from a normal distribution, as tested by a Shapiro-Wilk ings with the clip preference ratings, regardless of where
test: SW = 0.9991, p = .99 for clips, and SW = 0.9958, in the presentation sequence the song appeared. In
p = .082 for songs. Thus, our preference ratings distri- other words, what might appear as prediction—predict-
butions are not statistically distinguishable from normal ing the song preference rating from the clip preference
distributions. rating—might in actuality be post-diction. This was not
a trivial point, as the task (from the perspective of the
Results participant) was quite different psychologically. In the
first case, there were large parts of the song that they
The main question we attempt to answer in this study had not heard yet, if they just heard a brief excerpt. In
was whether preference and familiarity ratings in the second case, they simply had to realize that the clip
response to brief music excerpts were representative of was part of a song they already heard. Thus, we further
the whole songs from which they were sampled. refined this analysis by whether the clips occurred
To address this question, we correlated the average before or after the song (see Figure 4).
clip preference rating of participants with their song Our intuition as to the different nature of the
preference rating. The median Spearman rank correla- responses appeared to be correct. The median Spear-
tion was .834 (see Figure 3 for the distribution). man correlation for clips presented after the song was
As only 12 points went into any given correlation (per significantly higher than that between clips presented
participant), we were concerned whether such a median before the song, as assessed by a KS-test (D = 0.091,
could likely be obtained by chance. Thus, we randomly D = 0.279, p = 2.68e-22).
shuffled the ratings data 500,000 times. Doing so, the Taken together, the correlation between clip and song
highest median correlation we obtained was .0724, ratings was high, indicative of high intra-song reliability
implying an exact p value of 0. We therefore concluded of appraisal. But did this depend on the length of the
that this median correlation was both statistically clip? Intuitively, it must—from first principles, the
Brief Clips are Reliable Estimators of Whole Songs 225
longer an organism integrates information, the more preference rating. However, just like in the analysis
reliable a judgment will be (Anderson, 1962). However, where we did not disaggregate by duration, clip ratings
it is possible that humans are highly efficient integrators from clips presented after the song were more strongly
of this information. In other words, accuracy of judg- correlated with the song preference ratings.
ments might have already saturated at the beginning of Whereas it appears that there was no significant dif-
the range of excerpt durations commonly used in music ference between clip durations that are commonly used
psychology research (around 5 s). We explore this ques- in music, one might wonder if it matters from which
tion in Figure 5. part of the song the clip is sampled. For instance, it is
Median Spearman correlations between the clip and conceivable that there could be a considerable difference
song preference ratings were .74, .76, and .78 for clips of in the correlation between clips and song for clips taken
5s, 10s, and 15s duration, respectively, if the clip was from the chorus versus clips taken from other sections
presented before the song. Median Spearman correla- of the song. We explored this question in Figure 6.
tions between the clip and song preference ratings were Median Spearman correlations between the clip and
.84, .86, and .86 for clips of 5s, 10s, and 15s duration, song preference ratings were .72, .76, .75, and .76 for
respectively, if the clip was presented after the song. clips from Intro/Outro, Chorus, Verse, and a subjectively
We perform KS-tests to assess whether these differ- maximally representative portion of the song, respec-
ences are statistically significant (see Table 1). tively, if the clip was presented before the song. Median
To summarize this pattern of results, all differences Spearman correlations between the clip and song pref-
between clip durations—whether they be before or after erence ratings were .83, .85, .84, and .85 for clips from
the song—were not significant. In contrast, all differ- Intro/Outro, Chorus, Verse, and a subjectively maxi-
ences between clips presented before and after the song mally representative portion of the song, respectively,
(for a given clip duration) were statistically significant. if the clip was presented after the song. We performed
We concluded that clip duration was not a significant KS-tests to assess whether these differences are statisti-
factor when predicting song preference rating from clip cally significant and present the results in Table 2.
226 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch
Again, a consistent pattern of results emerges. fact, it is possible that what drives preference ratings are
Broadly speaking, it did not matter from which section not acoustic properties of the song per se, but the entire
in the song the clips were sampled from. The only context and associations that were encoded when first
exception to this was the comparison between clips encountering the song prior to the experiment. If this
from the Intro or Outro compared to subjectively most were the case, we would predict that such an effect
representative clips, with the latter being slightly stron- would boost the correlation between clips and song
ger correlated with the song preference ratings. This is preference ratings, as these memories could be used to
plausible, as the Intro or Outro can be quite drastically rate the clip, consistent with the previously encountered
different from the rest of the song, in terms of acoustic song. Thus, this correlation should be higher for clips
properties. where participants recognize the song compared to
Again, there were strong and statistically significant those where they do not.
differences for clips presented before vs. after the song, There are indications that are consistent with this
with clips presented after the song being consistently model. For instance, we recorded the time it took for
more correlated with the song preference ratings, participants to click on the ‘‘next trial’’ button. Looking
regardless of the song section from which the clip came. at the reaction times for 5s clips is fair, as participants
Consistent with the idea that when participants could already report their preference ratings while the
encountered a particular song, any subsequent clip can clip is still playing, but not continue to the next trial
then serve as a cue to evoke the memory of the feelings until the clip had finished playing. Looking at the reac-
that the whole song elicits, it is also possible that parti- tion times for these clips, the median reaction time for
cipants have encountered the whole song before the unrecognized clips is 7.16s, whereas it is 6.76s for rec-
study. In that case, a clip could act as a retrieval cue ognized clips. This difference was statistically signifi-
to the memory of the whole song, whenever that was cant, as evaluated with a Mann-Whitney U test
heard (Spivack et al., 2019). In other words, the effects (ranksum = 360980, p = 9.38e-7). Thus, it took longer
laid out above might well be mediated by recognition. In to rate an unrecognized song than a recognized one,
Brief Clips are Reliable Estimators of Whole Songs 227
TABLE 1. KS-test Results Corresponding to Data Presented in As one can see, song recognition can be predicted
Figure 5 from the number of clips that are recognized and is well
Comparison of clip described by a logistic function (ß0 = 4.12, ß1 = 0.64).
durations relative However, this analysis integrated information over all
to song Delta D p Significant? participants and clips and was challenging to do on
5s Before vs. 10s Before 0.011 0.062 .171 ns a per-participant level, as recognition rates for quite
5s Before vs. 15s Before 0.034 0.088 .0143 ns a few of our songs were close to zero. This is not sur-
10s Before vs. 15s 0.023 0.061 .176 ns prising, given that we took a substantial portion of our
Before sample from the Rentfrow et al. (2011) corpus, which
5s After vs. 10s After 0.014 0.050 .409 ns
5s After vs. 15s After 0.014 0.050 .389 ns
was deliberately sampled to consist of obscure music.
10s After vs. 15s After 5.5e-6 0.039 .720 ns Overall, only 20% of the clips were reported to be rec-
5s Before vs. 5s After 0.099 0.22 5.60e-14 *** ognized by our participants. Therefore, instead of cal-
10s Before vs. 10s After 0.101 0.26 1.82e-19 *** culating the average clip-to-song correlation, we
15s Before vs. 15s After 0.079 0.23 1.03e-14 *** calculated the mean absolute deviation in this analysis
(see Figure 8).
perhaps due to the fact that the memory serves as short- We attempted to tease apart the effects of recogni-
cut, which is remarkable, as it takes some time to click tion and whether clips were presented before or after
the ‘‘I recognize the clip’’ checkbox whereas it takes no the song. The mean absolute deviation (MAD) was
time to simply leave it unclicked (the default). Thus, calculated as the mean absolute difference between
everything else being equal, we would expect the unrec- song and clip ratings. As one can see, the MAD was
ognized clip trials to be completed faster. lowest for recognized clips presented after the song.
Moreover, there is information in the clip recognition The prediction accuracy of clips presented before the
data (see Figure 7). song but that were recognized was roughly equal to
228 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch
that of unrecognized clips presented after the song. recognition and presentation order seem to play
Unrecognized clips that were presented before the about equally strong roles in increasing song predict-
song were least predictive of the song preference ability from clips, with short-term and long-term
rating, but still far more predictive than one would memory serving as plausible mechanisms underlying
expect from random chance. We concluded that both this effect.
Brief Clips are Reliable Estimators of Whole Songs 229
FIGURE 8. Bar graph of mean absolute deviations as a function of whether clips were presented before or after the song, and whether they were
recognized or not. Orange error bars represent the standard error of the mean. For comparison, we also included a bar that corresponds to randomly
shuffled ratings.
230 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch
It is somewhat surprising that neither the duration of a function of duration and song section. As we can only
the excerpt (within a three-fold duration window of use the very first instance of the excerpts, we computed
commonly used excerpt durations from 5 to 15 seconds) the root mean-squared error (RMSE) across all partici-
nor the song section that the excerpt came from appre- pants and songs and then bootstrap confidence inter-
ciably affected the predictability of song preference rat- vals. To mimic the significance level used in the rest of
ings from the preference rating of excerpts (with the this publication and to guard against multiple compar-
exception of the contrast between intro/outro and rep- ison concerns, we computed 99% confidence intervals.
resentative ratings). With this in mind, note that we As is evident in Figure 9, the concerns about cross-
presented the excerpts in random order throughout the contamination between excerpts were not borne out
experiment. Given this design, it is possible that there empirically, as these results closely mirror those from
was cross-contamination between the excerpts; for previous analyses presented above.
FIGURE 9. Predictability of song preference ratings from excerpt preference ratings based on the very first pre-song excerpt only. Upper panel: The x-
axis denotes the duration of the excerpts. The y-axis indicates the root mean squared error (RMSE) between excerpt and song preference ratings. The
orange error bars represent the bootstrapped 99% confidence interval. Lower panel: Like the upper panel, but the x-axis now denotes the section of
the song the excerpt was extracted from.
Brief Clips are Reliable Estimators of Whole Songs 231
mattered; in other words, listening to the song had To conclude our analysis, we also considered the rela-
a much larger impact on the reported familiarity than tionship between clip and song familiarity as a function
a clip. This is plausible, as the average song was 21 times of song section from which the clip was sampled. We
longer than the average clip. As both observations show these distributions in Figure 13. As the same con-
impact any possible relationship between clips and siderations regarding correlation apply, we also present
songs, we restricted our analysis to clips that were the mean absolute deviation (MAD).
immediately preceding a song. We believe this was a Like the duration findings, these results are consistent
fair comparison, as their average familiarity was with the corresponding results from the preference rat-
comparable. ings; see Table 3 for the results of the hypothesis tests.
Thus, to answer the question of whether song famil-
iarity can be predicted from clip familiarity, we calcu- Discussion
lated the median Spearman correlation between songs
and the corresponding clips that immediately preceded In this study, we explored whether the psychological
them as .771 (see Figure 11 for the distribution). responses to excerpts were representative of the songs
Following the logic of the preference ratings analysis, from which they were sampled. We found that this was
we considered whether there was a difference in median the case, broadly speaking: Both the preference and
correlation as a function of clip duration. However, as familiarity ratings of a song could be well predicted
discussed above, only the immediately preceding clip from excerpt preference and familiarity ratings. This
was valid as a predictor of song familiarity. This meant pattern of results is remarkably consistent. Both song
that each correlation was only based on a few points, preference and familiarity ratings are well predicted
which made correlations of 1, 0 and -1 much more from clips, regardless of their duration, and regardless
likely, as evident in Figure 12. Thus, we paired the cor- of which section in the song they were sampled from,
relation figures with corresponding mean absolute devi- with the exception of a significant difference between
ation numbers and distributions. songs from the Intro or Outro vs. those from the sub-
Consistent with the findings from the preference rat- jectively most representative part of the song.
ings, the relationship between clip and song familiarity One strength of this research is that it was conducted
was strong, but there was no significant difference with high statistical power, as we used about an order of
between different clip durations (see Table 3). magnitude more participants in this study than most
232 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch
FIGURE 12. Distribution of Spearman rank correlations and mean absolute deviations between clip and song ratings by clip duration. Grey bars:
Binned correlations. Orange vertical lines: Median correlation. Top row: Spearman correlations. Bottom row: Mean absolute deviation.
Brief Clips are Reliable Estimators of Whole Songs 233
TABLE 3. KS-test Results Corresponding to Data Presented in problems (Buchanan & Scofield, 2017). In the age of
Figures 11 and 12 questionable replicability, power is a considerable con-
cern (Open Science Collaboration, 2015; Wallisch,
Comparison of
MAD (clip vs. 2015).
song familiarity) Delta D p Significant? A limitation of this work is that we were unable to
determine the concordance between clip and song rat-
5s vs. 10 s 0.0266 0.0544 .3085 ns ings as a function of clip duration, as the agreement was
5s vs. 15s 0.0236 0.0425 .6207 ns
10s vs. 15s 0.0031 0.0246 .9898 ns already high at the shortest duration we used (5 sec-
Intro/Outro vs. 0.1097 0.1022 .0032 * onds). This is perhaps not surprising given research
Chorus on ‘‘thin slices’’ in music perception: participants are
Intro/Outro vs. 0.0988 0.0891 .0169 ns able to recognize snippets of songs that are as short as
Verse 300 ms with a 25% success rate (Krumhansl, 2010). In
Intro/Outro vs. 0.1270 0.1287 8.586e-5 **
Representative general, the temporal fidelity of the human auditory
Chorus vs. Verse 0.0109 0.0283 .9645 ns system is exquisite; people are able to distinguish the
Chorus vs. 0.0173 0.0728 .0727 ns human voice from other sources with exposure dura-
Representative tions as short as 2 ms (Suied et al., 2013).
Verse vs. 0.0282 0.0676 .1228 ns Thus, the shortest duration at which song rating and
Representative
recognition can be perfectly predicted from clip
responses will lie somewhere below 5 seconds, perhaps
existing experimental studies on music perception and considerably so. However, the question of which clip
cognition. The only exceptions are music studies that duration predictions about song judgments dip from
rely on survey research or participants on Amazon perfect (given the limits of reliability imposed by judg-
mTurk, which come with their own host of inherent ments about the song ratings) is perhaps mostly of
234 Sara J. Philibotte, Stephen Spivack, Nathaniel H. Spilka, Ian Passman, & Pascal Wallisch
academic interest. An exposure duration of 5 seconds is very fast (Gjerdingen & Perrott, 2008; Mace et al.,
already allows the experimenter to rapidly present 2012), and people presumably know how much they
excerpts from many songs and it might take participants like a given genre. Our results suggest that while song
a while to make judgments about what they are hearing. recognition did play a role, it did not account for all of
For instance, presenting a rapid stream of clips with the effects we observed. However, one caveat is that we
a duration of 1 second might still yield reliable judg- might not have achieved a fair apples-to-apples com-
ments about the songs, but might be too exhausting for parison, as so many clips were unrecognized. As this
participants if this stream is too long. Nevertheless, study was not designed to address this question specif-
determining the exact slope of this ‘‘rise of the temporal ically, we recommend doing so in future research, with
kernel’’ could be of interest and thus constitute an area a more balanced sample of music in terms of popular
of future research. recognition.
References
A LLAIN (2014). Why are songs on the radio about the same K REUTZ , G., O TT, U., T EICHMANN , D., O SAWA , P., & VAITL , D.
length? Wired. https://www.wired.com/2014/07/why-are- (2008). Using music to induce emotions: Influences of music
songs-on-the-radio-about-the-same-length/ preference and absorption. Psychology of Music, 36(1),
A NDERSON , N. H. (1962). Application of an additive model to 101–126.
impression formation. Science, 138(3542), 817–818. K RUMHANSL , C. L. (1997). An exploratory study of music emo-
B ARRETT, F. S., G RIMM , K. J., R OBINS , R. W., W ILDSCHUT, T., tions and psychophysiology. Canadian Journal of Experimental
S EDIKIDES , C., & J ANATA , P. (2010). Music-evoked nostalgia: Psychology, 51(4), 336–353.
Affect, memory, and personality. Emotion, 10(3), 390. K RUMHANSL , C. L. (2010). Plink:" Thin slices" of music. Music
S UIED, C., A GUS , T. R., T HORPE , S. J., & P RESSNITZER , D. (2013). WALLISCH , P., & W HRITNER , J. A. (2017). Strikingly low agree-
Processing of short auditory stimuli: The rapid audio sequen- ment in the appraisal of motion pictures. Projections, 11(1),
tial presentation paradigm (RASP). In B. C. J. Moore (Ed.), 102–120.
Basic aspects of hearing (pp. 443–451). Springer. WARRENBURG , L. A. (2020). Choosing the right tune: A review of
V UOSKOSKI , J. K., T HOMPSON , W. F., M C I LWAIN , D., & E EROLA , music stimuli used in emotion research. Music perception,
T. (2012). Who enjoys listening to sad music and why? Music 37(3), 240-258.
Perception, 29(3), 311–317. Z ENTNER , M., G RANDJEAN , D., & S CHERER , K. R. (2008).
WALLISCH , P. (2015). Brighter than the sun: Powerscape visuali- Emotions evoked by the sound of music: characterization,
zations illustrate power needs in neuroscience and psychology. classification, and measurement. Emotion, 8(4), 494.
Preprint from arXiv. https://doi.org/10.48550/arXiv.1512.09368