Professional Documents
Culture Documents
Influence of The Listening Environment On Recognition of Immersive Reproduction of Orchestral Music Sound Scenes
Influence of The Listening Environment On Recognition of Immersive Reproduction of Orchestral Music Sound Scenes
PAPERS
S. Kim and W. Howie, “Influence of the Listening Environment
on Recognition of Immersive Reproduction of Orchestral Music Sound Scenes”
J. Audio Eng. Soc., vol. 69, no. 11, pp. 834–848, (2021 November).
DOI: https://doi.org/10.17743/jaes.2021.0035
Electrical, Computer and Telecommunication Engineering Technology, Rochester Institute of Technology, Rochester, NY
This study investigates how a listening environment (the combination of a room’s acoustics
and reproduction loudspeaker) influences a listener’s perception of reproduced sound fields.
Three distinct listening environments with different reverberation times and clarity indices were
compared for their perceptual characteristics. Binaural recordings were made of orchestral
music, mixed for 22.2 and 2-channel audio reproduction, within each of the three listening
rooms. In a subjective listening test, 48 listeners evaluate these binaural recordings in terms
of overall preference and five auditory attributes: perceived width, perceived depth, spatial
clarity, impression of being enveloped, and spectral fidelity. Factor analyses of these five
attribute ratings show that listener perception of the reproduced sound fields focused on two
salient factors, spatial and spectral fidelity, yet the attributes’ weightings in those two factors
differs depending on a listener’s previous experience with audio production and 3D immersive
audio listening. For the experienced group, the impression of being enveloped was the most
salient attribute, with spectral fidelity being the most important for the non-experienced group.
834 J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November
PAPERS INFLUENCE OF THE LISTENING ENVIRONMENT
of a room’s acoustics influences their experience hearing the reproduction room, and sound reproduction medium.
reproduced sound fields. The authors call this cognitive Among those three aspects, the acoustics of the reproduc-
room conflict “room divergence.” The listener’s previous tion room and sound reproduction medium are determined
experience could form an expectation-oriented bias [8] and by an end-user’s circumstances, resulting in numerous vari-
modify their listening experience of any reproduced sound ations.
field within a given space. To avoid or at least reduce influence associated with
Kaplanis et al. conducted a systematic study [9] that in- such variations, there has been an effort to create industry
vestigated listeners’ perception and preference of reverber- standards for listening room acoustics and monitoring char-
ation in rooms for two-channel reproduction. They presume acteristics, including [13] and [14]. With those standards,
that room acoustics and associated reverberations cause the researchers and developers can conduct controllable and re-
reproduced sound scene to be distorted, “as the recorded peatable experiments (acoustic measurements and psychoa-
signal is superimposed with the spatiotemporal response of coustic observations) and provide meaningful guidelines
the reproduction room.” To determine the perceptual influ- for listeners in non-standard situations. It is worthwhile
ences of room reverberations, they captured nine distinct to mention again that the room acoustics and loudspeaker
spatial room impulse responses for two-channel reproduc- characteristics are intertwined, influencing a listener’s per-
tion from four listening rooms using a 3D vector intensity ceptual and cognitive responses. In this paper, therefore, the
probe. Subsequently the captured room impulse responses authors treat the interaction of those two aspects as a single
were processed to playback via a 43-channel loudspeaker variable, to be referred to as the “listening environment.”
array in an anechoic chamber. This “auralization” tech-
nique allowed listeners to virtually experience nine target 0.2 The New Era of Immersive Audio
room acoustics and compare them in real time. The study We are currently experiencing an extensive, fast growth
found that perception of “Reverberance” is highly corre- of “immersive” audio technologies and research.1 This is
lated with “Width & Envelopment,” negatively correlated also true for the broader area of virtual reality, which in-
with “Proximity,” and decorrelated with (perceptual domi- cludes the paradigms known as augmented reality, mixed
nance of) “Bass.” In addition, the results show that listeners reality, and extended reality often collectively called “XR
prefer high proximity and less reverberant room acoustics technologies.” How applicable will existing guidelines for
for two-channel reproduction. sound reproduction be for these new types of listening ex-
Moreover, Toole has exhaustively researched room inter- periences? Slater [16] asserts that a visual immersive ex-
actions between loudspeakers and loudspeaker-reproduced perience is directly related to the peripheral system capac-
sound fields [10]. Through a series of controlled listening ity, i.e., the fidelity that preserves “equivalent real-world
experiments and associated psychoacoustic models, Toole sensory modalities.” Therefore it follows that reproduction
discovered how listeners’ affective responses and percep- channels may serve as the peripheral system capacity re-
tual judgments are influenced by peripherals in the listening quired for auditory immersion, which has been a motivation
chain. Toole specifically showed how the combination of for the development of multi-layer loudspeaker reproduc-
loudspeakers’ radiation patterns and wall reflections within tion to properly incorporate sonic information from above
the listening room affect both timbral and spatial attributes (and sometimes below) the listeners [17, chap. 7].
of reproduced music. Oode et al. [18] found that for sound reproduction sys-
Toole coined the term “Circle of Confusion” to explain tems that include height channels, the sensation of listener
the “never-ending cycle of subjectivity” [10, p. 19] within envelopment (LEV) increases as the number of loudspeak-
audio production, listening, and evaluation. Essentially, a ers increases. 22.2 Multichannel Sound (22.2), an example
lack of technical standards within audio control room de- of one of several standardized immersive audio formats,
sign has lead to a situation in which “recordings vary, utilizes nine loudspeakers in the upper layer, 10 loudspeak-
sometimes quite widely, in their sound quality, spectral ers in the middle layer, and three loudspeakers in the bot-
balances, and imaging.” A recording is made while mon- tom (floor) layer. Originally proposed by the Science and
itoring with a certain set of loudspeakers, while one or Technology Research Laboratories of the Japan Broadcast-
many entirely different set(s) of loudspeakers are used in ing Corporation (NHK) in 2003, the system is designed to
the reproduction and subjective evaluation of said record- provide audiences with a highly precise, natural, and fully
ing. This is further complicated by interactions between “immersive” experience [19]. A key idea behind many such
monitor speakers and room acoustics that may modify the large-scale 3D audio reproduction systems is that as the
reproduced sound field of the original recording, thus ren- number of sound reproduction locations increases, we ap-
dering the subsequent evaluation process erroneous. For proach a point where one could capture and reproduce all
five-channel stereophonic audio recording and reproduc- of the details that make up a space’s unique sonic “finger-
tion [11], Toole’s “Circle of Confusion” is equally applica- print” as independently as possible. This may imply that
ble: there is a strong “interaction of multiple loudspeakers
and the listening room when those loudspeakers are repro-
ducing a multichannel recording” [10, p. 100][12]. 1
A discussion on the definition of auditory immersion and im-
A reproduced sound field, therefore, should be under- mersive experience is beyond the scope of this topic. Interested
stood as a complex aural product of a three-way interaction readers are advised to read [15], which is an excellent review and
across the acoustics of the recording venue, acoustics of proposal of the concept of “immersion.”
J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November 835
KIM AND HOWIE PAPERS
Power (dB)
-30
are calculated from the impulse response between the front
center loudspeaker and listening position. -45
-60
Room Width Length Height Volume T30 C80 -75
Power (dB)
-30
-45
room acoustics become less influential for a listening envi- -60
ronment that possesses a greater number of loudspeakers, -75
eventually becoming a nuisance variable.
The current study is based on the following research 63 125 250 500 1k 2k 4k 8k 16k
question: Frequency (Hz)
ROOM 3
Power (dB)
-30
By manipulating immersive audio content and the repro-
-45
duction system, can we reduce undesired room-loudspeaker
interactions and corresponding variations in listener per- -60
ception of stimuli across various sonic attributes? More -75
simply, when compared with stereo reproduction, does im- 63 125 250 500 1k 2k 4k 8k 16k
mersive reproduction deliver a listening experience that Frequency (Hz)
reduces the influence of the listening environment on the
listening experience? Fig. 1. Power spectra at the listening position of the three listening
environments. These spectra are results from an interaction of both
the room acoustics and 22 loudspeakers.
To investigate this question, three listening environments,
along with a reference anechoic chamber, were compared.
22.2 was the sound reproduction format used in each space.
The proceeding section introduces the listening rooms’ reverberation time than ITU-R BS.1116 recommends: the
acoustic profiles as well as reproduction systems. This is space was originally designed as a recording studio. Back-
followed by a description of the main experiment and a ground noise and floor area, however, are compliant with
subjective test of those rooms and describes the creation of ITU-R BS.1116. Room 3 is equipped with 22 KS Digital
stimuli, the participants, and test results. Finally a summary C5 Powered Studio Monitors.
is provided of the study’s implications for understanding Distinct differences were found within the rooms’ rever-
the relationship between physical and perceptual factors beration times (T30), as shown in the second right-most
within a listener’s recognition of 3D sound fields, based column of Table 1. Ashok et al. [20] previously simu-
on the coupling of listening room acoustics and 3D sound lated these three rooms using CATT-Acoustics software
reproduction systems. and found that clarity (C80), among many other measured
acoustic characteristics, was the parameter that differs the
1 PHYSICAL CHARACTERISTICS OF LISTENING most across the three listening rooms. C80 compares the
ENVIRONMENTS ratio between energy within the first 80 ms of the impulse
response with energy from the remaining part of the impulse
We chose three listening rooms, labeled as 1, 2, and 3, response and has been found to relate well to perceived clar-
that show distinct acoustic profiles and represent various ity within a space [21]. Results of the current measurements
listening scenarios ranging from a professional mixing sit- (the right-most column of Table 1) correspond well with the
uation to causal domestic listening. previous simulation results, showing that the three rooms
The acoustic characteristics of the three rooms are sum- differ within the clarity metric. Specifically Room 2 does
marized in Table 1. Room 1’s dimension ratios, reverb time, not properly diffuse nor absorb acoustic energies, resulting
and background noise level fulfill ITU-R BS.1116 [13] re- in an overabundance of reflected sound after the first 80
quirements for multichannel audio critical listening envi- ms of the impulse response. While the room is equipped
ronments. This room is equipped with a total of 28 Musik- with several bass traps to reduce unwanted low-frequency
electronic Geithain ME-25 loudspeakers. Room 2 does not modes, it seems to fail to effectively suppress strong early
comply with ITU-R BS.1116 recommendations: it was cho- reflections.
sen as an example of an ordinary, everyday listening envi- Fig. 1 displays the power spectra of the three listening
ronment. The loudspeakers are GENELEC 8020B. Room 2 environments as an interaction of both the room acoustics
is a converted conference room with added acoustical treat- and all loudspeakers, excluding the two subwoofers, at the
ment to remove strong wall reflections. This room is the listening position, measured with an omni-directional mi-
smallest of the three yet has the longest reverberation time. crophone (DPA 4006). Each listening environment shows
Room 3, the largest room of the three, has a slightly longer unique spectral characteristics. For example, rooms 1 and
836 J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November
PAPERS INFLUENCE OF THE LISTENING ENVIRONMENT
J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November 837
KIM AND HOWIE PAPERS
communications. Their ages ranged from 19 to 47 years. ticipants found it difficult to quantify those two attributes,
Each subject performed the complete test twice. Stim- they were allowed to compare the eight stimuli with a “ref-
uli were reproduced via Sennheiser HD-650 headphones erence” condition, activated using a separate button within
through a Grace Design m902 headphone preamplifier. A the GUI. The “reference” condition (a hidden reference)
custom graphic user interface (GUI) and data collection was the anechoic space with 22.2 reproduction. Subjects
system was created using Cycling ’74’s Max software. were not told that the reference corresponded to a specific
It is worth acknowledging that the authors could not use stimulus but rather that it represented a rating of “50” on
each listener’s individualized head-related transfer func- the perceptual scale and could therefore use the reference to
tion (HRTF) for this study because it would have been estimate the relative strengths and weaknesses of the eight
highly impractical for each listener to visit each of the four stimuli within the GUI. This reference signal, therefore,
listening rooms under investigation in different countries. provided all participants with the same baseline.
Therefore, as previously discussed, subjects evaluated au- Similar to the method described above, a GUI using
ditory images projected through the ears of the Brüel & ranged sliders and a reference signal could have been used
Kjær 4100D binaural microphone and Sennheiser HD-650 to rate the stimuli in terms of the various spatial attributes
headphones. While there are on-going efforts to adapt indi- under investigation, as is often seen in perceptual audio
vidualized HRTFs in 3D audio home listening experiences, evaluation [31–33]. Quantifying specific spatial attributes,
this method of using non-individualized HRTFs for binaural however, is challenging when one is asked to generate a
rendering also reflects the reality typical of most “current” number to describe them. For example, if a listener rated
commercial virtual reality/augmented reality. a perceived width of 50, what is the meaning of this spe-
cific number? Researchers, therefore, have endeavored to
develop a new method that indirectly elicits and quantifies
2.3 Perceptual Attributes the spatial attributes of a sound field [34–37]. The general
Participants evaluated eight binaural stimuli based on assumption of these methods is that within the cognitive
five salient perceptual attributes: perceived width, perceived mechanism there is a dual-coding system [38] that allows
depth, spatial clarity, impression of being enveloped, and perceived characteristics to be recognized with language
spectral fidelity of the sound field across. These attributes and another to be recognized with mental imagery.
were chosen based on previous literature on spatial audio Spatial attributes in auditory evaluation can be repre-
perceptual evaluation. Below are the definitions of the at- sented and quantified effectively and with less variance
tributes: when using non-verbal methods, drawing and pointing, re-
sulting in fewer cognitive loads and chances for misinter-
r Spectral Fidelity (or Balance): “The balance in tone pretation. Two recent immersive audio studies by Martin et
color (spectrum) variation of the sound image” [24]. al. [39][40] made use of such a methodology, asking sub-
r Apparent Source Width (ASW): “The apparent hor- jects to draw the perceived extent of a sound image rather
izontal spatial extent of the sound image” [25]. than describe it. The first author used this method in a pre-
r (Spatial) Clarity: “Integration of various sound com- vious subjective evaluation [41] and found that this drawing
ponents of the sound image. The clearer the sound, method is faster, results in more consistent quantification,
the more details you can perceive in it” [26]. and is relatively language-independent, since this method
r Envelopment: “The sense of being enveloped by the is less reliant on exact translations of the attributes under
sound field (both horizontally and vertically)” [27]. investigation to the subject’s native language. In the cur-
r Depth: “The overall impression of the depth of the rent study, therefore, four spatial attributes (Width, Depth,
sound image. Takes into consideration both overall Envelopment, and Clarity) were quantified via an indirect
depth of scene, and the relative depth of the individ- drawing method that asked listeners to determine the shape
ual sound sources” [28]. of a perceived sound image using the GUI.
Fig. 3 shows the custom GUI for the indirect drawing
The authors did not include “localization” or “located- method. By controlling number boxes on the right side
ness” of sound source(s) in this specific study for two rea- [within a range between 0 (minimum) and 100 (maximum)],
sons: (1) previous studies have already thoroughly covered a participant can control the corresponding shape (a dark-
the topic of room-related influence on auditory localiza- purple oval) of the orchestra image. For instance, the num-
tion [29][30], and (2) the current study does include the ber boxes of “Auditory Source Width” and “Depth” re-
attribute “(Spatial) Clarity,” a multi-dimensional attribute spectively adjust left-right extent (width) and front-back
that broadly includes positional precision of sound sources. extent (depth) of the oval and display the new shape in the
For the attributes preference and spectral fidelity, listen- GUI. Subjects were asked to match their perceived auditory
ers directly compared eight stimuli for each musical selec- image size with the oval shape for those two attributes. “En-
tion, using continuous sliders ranging in value from 0 to velopment” adjusts the strength of an outer brown-colored
100. As such, participants were asked to convert relative circle, indicating how much the sound field envelops the
perceptual magnitudes to a number range of 0 to 100, with subject. As for “Clarity,” the number box increases the
50 being the baseline. Below 50, therefore, would indicate thickness of two black lines inside the purple oval. Darker
not-preferred or ill-balanced spectrum, while above would color of the black lines indicates that details of sound com-
indicate preferred or well-balanced spectrum. When par- ponents are more clearly reproduced and perceived, while
838 J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November
PAPERS INFLUENCE OF THE LISTENING ENVIRONMENT
J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November 839
KIM AND HOWIE PAPERS
Table 2. The two-way mixed ANOVA results of preference ratings. The results show that interaction between the ROOM and
GROUP factor is significant (p < 0.001) regardless of the number of reproduction format. However the effect size of ROOM factor
[the ges (η2 ) values] is different for the 22.2-channel (0.034: small effect size) from the two-channel (0.255: big effect size).
60
50
40
30
20
EXP Preference
NOE Preference
10
EXP Timbral Fidelity
NOE Timbral Fidelity
0
1 2 3 R
840 J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November
PAPERS INFLUENCE OF THE LISTENING ENVIRONMENT
produced by acoustic instruments. Perhaps, in this instance, rooms. However, the participants perceived extended depth
the more robust low-frequency power spectra in Room 1 for the 22.2-channel reproduced binaural stimuli without
paradoxically could have resulted in a low-frequency re- head tracking and corresponding dynamic binaural process-
production within the recordings that was aesthetically too ing, even within the non-experienced listener group. It is
strong or exaggerated, which relatively impacted the per- possible that the ambient sound information in the record-
ceived spectral fidelity ratings. ings, when reproduced from the side, rear, and height chan-
The two-way mixed ANOVA results of spectral fidelity nels, helped participants to better perceive an externalized
ratings (Table 3) also reveal a similar pattern with the pref- sound image.
erence ratings, showing that an interaction between the Also of interest is a comparison between ASW and LEV
ROOM and GROUP factor is significant (p < 0.001) regard- ratings. The two top panels (left: ASW and right: LEV)
less of the reproduction format. Specifically, the ROOM of Fig. 6 show no significant difference between the ASW
effect size [the ges (η2 ) values] of the 22.2-channel system and LEV ratings from the non-experienced group (square
is 0.054, which is regarded as a “small” effect, while the symbols). For this group, each listening environment (room
two-channel is 0.3, a “large” effect. These values indicate plus loudspeaker) delivered the same magnitude of ASW
that the ROOM variable had a stronger effect on listeners’ and LEV. In contrast the experienced group (circle sym-
spectral fidelity ratings for two-channel reproduced music. bols) clearly differentiated the ASW from LEV. In other
More loudspeakers seem to reduce differences in spectral attribute ratings, the non-experienced group tended to rate
fidelity. more “generously” by giving higher scores; Fig. 4 shows
The spatial attribute ratings collected from participants that this group’s preference ratings are much higher than
suggest some important implications as to how listeners the experienced group. This is similar to results from sev-
recognize a given listening environment. Statistical analy- eral previous studies by Olive and his co-authors investi-
ses of the ratings show that the listening environment for gating preferences among loudspeakers [45][46] or head-
two-channel reproduced music significantly modulated lis- phones [47], wherein it was observed that experienced
teners’ judgment of all spatial attributes except “Depth,” the and naı̈ve listeners tend to use different parts of the rat-
perpendicular extension of a sound image. Moreover, the ing scale, with less-experienced listeners generally giving
mean values of all two-channel stimuli ratings were smaller higher ratings.
than those associated with 22.2 reproduction. Since the spa- However, the same group’s ASW ratings are significantly
tial attribute ratings for all two-channel stimuli showed a lower than the other. ASW and LEV are both associated
similar tendency, we chose to exclude the two-channel data with distinct spatial characteristics of a reproduced sound
in the subsequent analyses and focused on the 22.2-channel field. ASW relates to the perception of an auditory image’s
ratings. Means and confidence intervals of four attribute horizontal dimension, while LEV relates to a listener’s im-
ratings associated with 22.2-channel reproduced music are pression of being enveloped or surrounded by sound. Hence
plotted in Fig. 6. “extent” is a key word within the definition of ASW, while
Results indicate that the 22.2-channel system allowed “sense” is more pertinent for LEV. It is worthwhile to note
listeners to consistently judge spatial attributes in four that LEV is also closely related with auditory immersion.
different playback rooms (including the reference ane- These findings suggest a cognitive difference between the
choic room). Two-way mixed ANOVA tests (similar to two groups: the experienced group could cognitively sep-
the ANOVA test for the preference ratings) were con- arate the concept of the physical extension of a sound im-
ducted for all spatial attributes: no interaction between the age from being enveloped (or immersed), while the non-
ROOM and GROUP factor was found for all attributes. experienced group mixed those two concepts.
The GROUP factor significantly modulates the ASW [F(3, These possible cognitive differences between the two lis-
282) = 22.963, p < 0.001], LEV [F(3, 282) = 4.644, p tener groups can be better visualized using factor analysis.
= 0.034], and (Spatial) Clarity [F(3, 282) = 15.134, p < Two factors were extracted with eigenvalues equal to or
0.001] ratings. The ROOM factor only significantly mod- greater than 1 after the Promax rotation. The correspond-
ulates the LEV rating [F(3, 282) = 6.162, p < 0.001], but ing loadings of five attributes are plotted in Fig. 7. The left
the effect size [the ges (η2 ) values] is quite small (0.031). panel is for the experienced group and right panel for the
The ANOVA results of all five attribute ratings are listed in non-experienced group. For the experienced group, the first
Table 4 located in APPENDIXA. factor (42.7%) is associated with LEV as the sole leading
Considered together, this data suggests that the perceived attribute, and the second factor (16.8%) with a combination
magnitudes of spatial attributes are not significantly influ- of four other attributes. In contrast, the non-experienced
enced by the acoustics of the room (except a small effect on group factorizes all the space-related attributes (ASW, LEV,
LEV). The “Depth” ratings are of particular interest consid- Clarity, and Depth) as the first factor (46.9%) and spectral
ering the static nature of binaural sound capture, with actual fidelity as the second factor (18.8%).
listening typically involving head movement and rotation. Both groups seem to rely on LEV and spectral fidelity as
Non-dynamic representations of binaural sound (without a base axes of their semantic spaces, yet the dominant factor
head tracker) have been known to fail in generating an “ex- and corresponding attribute is quite different between the
ternalized sound image” [44]. The two-channel reproduc- two groups. The experienced group primarily recognizes
tion and its binaural capture thus failed to deliver success- the differences of listening environments based on LEV,
ful perpendicular extension of the sound field for all four while the other group does so with spectral fidelity. In the
J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November 841
KIM AND HOWIE PAPERS
Table 3. The two-way mixed ANOVA results of spectral fidelity ratings. The results show that interaction between the ROOM and
GROUP factor is significant (p < 0.001) regardless of the reproduction format. However, similar to the preference ratings, the effect
size of ROOM factor [the ges (η2 ) values] is different for the 22.2-channel (0.054: small effect size) from the two-channel (0.3: big
effect size).
60 60
50 50
40 40
30 30
20 20
10 10
Experienced Experienced
Non-experienced Non-experienced
0 0
1 2 3 R 1 2 3 R
60 60
50 50
40 40
30 30
20 20
10 10
Experienced Experienced
Non-experienced Non-experienced
0 0
1 2 3 R 1 2 3 R
Fig. 6. Means and confidence intervals for four spatial attribute ratings associated with the 22.2-channel system. Twenty experienced
listeners’ ratings are plotted with circle symbols and 27 non-experienced listeners with square symbols. The top-left panel is for the
attribute ASW, top-right for LEV, bottom-left for Clarity, and bottom-right for the Depth.
842 J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November
PAPERS INFLUENCE OF THE LISTENING ENVIRONMENT
Timbre
Depth
Clarity
re b
Tim
0.5 0.5
FACTOR 2: 16.78%
FACTOR 2: 18.83%
W
AS
Clarity LEV
0 LEV 0
ASW
Depth
-0.5 -0.5
-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
FACTOR 1: 42.74% FACTOR 1: 46.89%
Fig. 7. A semantic space derived from a factor analysis visualizing the interrelation of five attributes. Factor 1 has a strong relation with
LEV and Factor 2 has a relation with spectral fidelity for both the experienced and non-experienced groups.
two-factor semantic space, one axis is for a spatial charac- a previous study [52] examined the influence of musical
teristic and another for a timbral characteristic. This finding selection (as a context) on listeners’ preference of surround
aligns with results from a previous study on multichan- microphone techniques, showing that the preference is sig-
nel audio [48] that showed a difference between naı̈ve and nificantly modulated by musical selection (an aggregate
experienced listeners in the weightings of “timbral and spa- of the composition in question, performance practice used
tial fidelities.” by the musician, and performance itself). This implies that
In summary this study found that when an optimally people listen with respect to the semantic content of musical
recorded orchestral musical piece was reproduced through performance paradigms within the reproduced sound field.
a 22.2-channel system, listeners perceived a similar spa- Hence the current study incorporates orchestral music se-
tial dimensionality for the reproduced sound field in any of lections that are recognizably different in terms of musical
the listening environments under investigation, even though style/content and instrumentation.
these environments differed in terms of physical dimensions However in this study variations from two distinct musi-
and reflecting surfaces. However the unique acoustical fea- cal selections did not induce significant differences in pref-
tures of each reproduction room changed listeners’ impres- erence or perceptual attribute ratings. This does not imply
sions of the spectral fidelity of the reproduced sound fields, that musical selection is meaningless within the evaluation
regardless of the reproduction system (stereo or 22.2). of reproduced sound fields for 22.2. Rather it indicates that
differences between the two orchestral music selections
4 DISCUSSION used in the current study were not large enough to modu-
late the listeners’ responses. The two primary experimental
This study investigates several contextual effects in lis- factors in this study, the listening environment and pre-
tener recognition processes of reproduced orchestral music, vious audio experience among subjects, their interactions,
which include the listening environment and listeners’ pre- and effects on listener perception, may be less significant or
vious experience with audio production and 3D immersive insignificant under different experimental conditions. It is
audio listening. These environmental factors can have an therefore not possible to generalize the results of the current
influence on a subject’s perception of a given stimuli, re- study to all potential sound reproduction scenarios, which
sulting in a particular form of systemic “bias” that affects was never the objective of this paper. Ideally further studies
a psychophysical test [49]. While Zielinski asserts [8] that will be devised to collect more local evidence of influences
it was not possible to find strong evidence or “direct data” associated with various contextual factors and listeners’
within relevant literature supporting the existence of in- preference and to form a compilation of case studies for the
fluences associated with contextual effects, other studies research community. In particular a relationship between
[50][51] support their significance. This contrast occurs preference ratings and other perceptual ratings is of interest
because a given contextual effect interacts with other con- yet is beyond the scope of the current study. This is partially
textual effect(s), as the current study shows. In other words, due to a lack of physical parameters that can account for
a certain contextual effect may be latent until another condi- perceptual attribute magnitudes influenced by contextual
tion starts inducing meaningful difference(s). For example, factors.
J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November 843
KIM AND HOWIE PAPERS
The results discussed in SEC. 4.2 suggest that in this study actively listen for or control when comparing loudspeakers
listener perception of many spatial attributes under inves- or adjusting a filter on a stereo system. In the current study,
tigation were not significantly influenced by the listening it is not possible to say which of the two previous experience
environment (the combination of room and loudspeaker). types, audio production or immersive audio listening, had
This is especially important for multichannel-audio control a greater impact on the perception of the experience group,
rooms, where one wishes to avoid the room-in-room effect. since they are fundamentally correlated within the data. In
Interestingly, however, the listening environments in this Howie et al.’s study [42], previous experience “hearing 3D
study did appear to have a small effect on perception of audio” was not found to have an effect on listener perfor-
LEV. This is quite surprising, given that LEV is typically mance. However, the focus of that study was consistency
associated with diffuse lateral sound reflections arriving af- of making evaluations, not how the experience modulated a
ter the first 50–80 ms [21][53], the kind of acoustic energy listener’s perception of the sound scene or listening space.
that is not typically well represented in small rooms. Toole In the current study, it follows somewhat more logically
[10] and Griesinger [54] have both discussed envelopment that previous experience hearing immersive sound scenes
in small rooms. Both assert that envelopment associated through a 22.2 system could be a factor in “training” sub-
with reproduced sound within a small room must primarily jects to focus more or equally on spatial attributes, such as
be a result of diffuse reflected energy within the record- LEV, rather than purely timbral differences between sound
ing itself, reproduced from surrounding loudspeakers, par- scenes. This begs the question: is LEV an auditory attribute
ticularly those at or near the side walls. Toole, however, that is understood differently in comprehension of immer-
suggests that the sense of envelopment created by a mul- sive reproduced sound field, depending on previous training
tichannel recording could be enhanced if the side walls of or experience? Further experiments would be required to
the listening room are treated in such as way as to provide examine this question adequately.
diffuse reflections of sounds emanating from loudspeakers
on opposite walls [10]. This idea becomes particularly rele-
vant for 3D audio formats with dedicated side channels (i.e.,
loudspeakers positioned at +/−90 degrees), such as 22.2 or
Dolby Atmos. It is possible then that the small differences 5 CONCLUSION
in LEV associated with the rooms used in the current study
are a result of their acoustical design: the rooms that bet- This study investigated the effect of the listening en-
ter reflect and diffuse reverberant sound from within the vironment (listening room acoustics and loudspeaker) on
recordings give listeners a better sense of envelopment or listener perception of loudspeaker-based music reproduc-
spaciousness. tion. Three listening environments were compared in terms
In the current study, listeners can be broadly separated of both their physical and perceptual characteristics. The
into two different “experience” groups: those with previ- acoustics of the reproduction rooms used to create the stim-
ous experience or proficiency with audio production and uli for this study differed in terms of physical dimensions,
hearing 3D audio reproduced through a 22.2-channel sys- reverberation time (T30), and acoustic clarity (C80) values.
tem and those without either of said types of experience. Results of a subjective listening test show that for repro-
Howie et al. have examined the effect of different types duction of two different orchestral music sound scenes, an
of previous experience on listener performance [42] and immersive 3D audio system reduced room-induced effects
skill level [55] in 3D audio evaluation. As with the current on listeners’ recognition of the target music pieces, as com-
study, Howie et al. used immersive recordings of orches- pared with a traditional two-channel, stereo reproduction
tral music as stimuli. In [42], it was found that among system. Immersive sound reproduction (22.2 in this study)
several different types of previous experience, audio pro- was found to preserve enhanced spatial attributes of the
duction experience was the best predictor of listener con- reproduced orchestral music, regardless of the listening en-
sistency in making preference or ranking judgments of 3D vironment.
audio stimuli. For those trained or working as audio en- However, for both the immersive and stereo playback
gineers, even in stereo, it is normal to divide and analyze conditions, listeners’ perceptual ratings related to the spec-
a sound scene based on many complex factors covering a tral fidelity of stimuli were modulated by the differing
wide range of spatial, timbral, dynamic, and musical con- room acoustics of each listening environment. In addi-
siderations. It follows logically, therefore, that listeners with tion listeners’ previous experience with audio production
this type of previous experience would be better able to use or loudspeaker-based 3D audio reproduction may have dif-
complex and multi-modal concepts such as LEV to dif- ferentiated their perceptual responses, resulting in an id-
ferentiate between listening environments or sound scenes, iosyncratic cognitive base difference between the experi-
particularly if they were already familiar with immersive enced and non-experienced groups. The experienced group
audio reproduction. appears to better understand small inter-differences in the
For inexperienced or “naı̈ve” listeners, who are generally four spatial attributes under investigation. For the experi-
unfamiliar with this type of advanced sound scene analysis, enced group, LEV was the primary factor in distinguish-
timbre might be a more effective or natural attribute for ing between the four listening environments, while spectral
differentiating between sound scenes, immersive or oth- fidelity was the primary distinguishing factor for the non-
erwise; timbre is one of the few factors a consumer can experienced group.
844 J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November
PAPERS INFLUENCE OF THE LISTENING ENVIRONMENT
J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November 845
KIM AND HOWIE PAPERS
Soc. Am., vol. 121, no. 1, pp. 388–400 (2007 Jan.). vention of the Audio Engineering Society (2004 May), paper
https://doi.org/10.1121/1.2385043. 6054.
[27] J. S. Bradley and G. A. Soulodre, “Objec- [38] A. Paivio, “The Relationship Between Verbal and
tive Measures of Listener Envelopment,” J. Acoust. Perceptual Codes,” in E. C. Carterette and M. P. Fried-
Soc. Am., vol. 98, no. 5, pp. 2590–2597 (1995 Nov.). man (Eds.), Handbook of Perception, Volume VIII: Percep-
https://doi.org/10.1121/1.413225. tual Coding, pp. 375–397 (Academic Press, New York,
[28] N. Zacharov, C. Pike, F. Melchoir, and T. Worch, NY, 1978). https://doi.org/10.1016/B978-0-12-161908-4.
“Next Generation Audio System Assessment Using the 50017-6.
Multiple Stimulus Ideal Profile Method,” in Proceedings [39] B. Martin, R. King, and W. Woszczyk, “Sub-
of the 8th International Conference on Quality of Multime- jective Graphical Representation of Microphone Arrays
dia Experience (QoMEX), pp. 1–6 (Lisbon, Portugal) (2016 for Vertical Imaging and Three-Dimensional Capture of
Jun.). Acoustic Instruments, Part I,” presented at the 141st
[29] W. M. Hartmann, “Localization of Sound in Convention of the Audio Engineering Society Conven-
Rooms,” J. Acoust. Soc. Am., vol. 74, no. 5, pp. 1380–1391 tion of the Audio Engineering Society (2016 Sep.), paper
(1983 Nov.). https://doi.org/10.1121/1.390163. 9613.
[30] N. Kopčo and B. Shinn-Cunningham, “Auditory [40] B. Martin, D. Martin, R. King, and W. Woszczyk,
Localization in Rooms: Acoustic Analysis and Behav- “Subjective Graphical Representation of Microphone Ar-
ior,” in Proceedings of the 32nd International Acousti- rays for Vertical Imaging and Three-Dimensional Cap-
cal Conference—EAA Symposium, pp. 109–112 (Banská ture of Acoustic Instruments, Part II,” presented at the
Štiavnica, Slovakia) (2002 Sep.). 147th Convention of the Audio Engineering SocietyCon-
[31] K. Hamasaki, K. Hiyamah, T. Nishiguchi, and R. vention of the Audio Engineering Society (2019 Oct.), paper
Okumura, “Effectiveness of Height Information for Repro- 10265.
ducing the Presence and Reality in Multichannel Audio [41] S. Kim, N. Shime, and M. Ikeda, “Enhanced Pres-
System,” presented at the 120th Convention of the Audio ence of Electronic Orchestral Music (Part I): A Comparison
Engineering Society Convention of the Audio Engineering of Loudspeakers and Their Perceptual Effect,” presented at
Society (2006 May), paper 6679. the 14th Regional Convention of the Audio Engineering
[32] S. Kim, Y. W. Lee, and V. Pulkki, “New 10.2- Society, Japan Section Regional Convention of the Audio
Channel Vertical Surround System (10.2-VSS); Compar- Engineering Society, Japan Section (Tokyo, Japan) (2009
ison Study of Perceived Audio Quality in Various Multi- Jul.).
channel Sound Systems With Height Loudspeakers,” pre- [42] W. Howie, D. Martin, S. Kim, T. Kamekawa, and
sented at the 129th Convention of the Audio Engineering R. King, “Effect of Audio Production Experience, Mu-
Society Convention of the Audio Engineering Society (2010 sical Training, and Age on Listener Performance in 3D
Nov.), paper 8296. Audio Evaluation,” J. Audio Eng. Soc., vol. 67, no. 10,
[33] H. Shim, E. Oh, S. Ko, and S. H. Park, “Percep- pp. 782–794 (2019 Oct.). https://doi.org/10.17743/jaes.
tual Evaluation of Spatial Audio Quality,” presented at 2019.0031.
the 129th Convention of the Audio Engineering Society [43] S. Kim, R. King, T. Kamekawa, and S. Sakamoto,
Convention of the Audio Engineering Society (2010 Nov.), “Recognition of an Auditory Environment: Investigating
paper 8300. Room-Induced Influences on Immersive Experience,” in
[34] S. Choisel and K. Zimmer, “A Pointing Technique Proceedings of the AES International Conference on Spa-
With Visual Feedback for Sound Source Localization Ex- tial Reproduction – Aesthetics and Science (Tokyo, Japan)
periments,” presented at the 115th Convention of the Audio (2018 Jul.), paper P6-1.
Engineering Society Convention of the Audio Engineering [44] K. Inanaga, Y. Yamada, and H. Koizumi, “Head-
Society (2003 Oct.), paper 5904. phone System With Out-of-Head Localization Applying
[35] N. Ford, F. Rumsey, and B. de Bruyn, “Graphical Dynamic HRTF (Head Related Transfer Function),” pre-
Elicitation Techniques for Subjective Assessment of the sented at the 98th Convention of the Audio Engineering
Spatial Attributes of Loudspeaker Reproduction – A Pilot Society (1995 Feb.), paper 4011.
Investigation,” presented at the 110th Convention of the [45] S. E. Olive, “Differences in Performance and Prefer-
Audio Engineering Society Convention of the Audio Engi- ence of Trained Versus Untrained Listeners in Loudspeaker
neering Society (2001 May), paper 5388. Tests: A Case Study,” J. Audio Eng. Soc., vol. 51, no. 9, pp.
[36] J. Usher and W. Woszczyk, “Design and Testing 806–825 (2003 Sep.).
of a Graphical Mapping Tool for Analyzing Spatial Au- [46] S. E. Olive, “Some New Evidence That Teenagers
dio Scenes,” in Proceedings of the AES 24th International and College Students May Prefer Accurate Sound Repro-
Conference on Multichannel Audio: The New Reality In- duction,” presented at the 132nd Convention of the Audio
ternational Conference on Multichannel Audio: The New Engineering Society (2012 Apr.), paper 8683.
Reality, pp. 157–170 (Banff, Canada) (2003 Jun.), paper [47] S. E. Olive, T. Welti, and E. McMullin, “The Influ-
11. ence of Listeners’ Experience, Age, and Culture on Head-
[37] J. Usher and W. Woszczyk, “Visualizing Auditory phone Sound Quality Preferences,” presented at the 137th
Spatial Imagery of Multi-Channel Audio,” presented at the Convention of the Audio Engineering Society (2014 Oct.),
116th Convention of the Audio Engineering Society Con- paper 9177.
846 J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November
PAPERS INFLUENCE OF THE LISTENING ENVIRONMENT
[48] F. Rumsey, S. Zieliński, R. Kassier, and S. Bech, International Conference on The Future of Audio Technol-
“On the Relative Importance of Spatial and Timbral Fideli- ogy – Surround Sound and Beyond International Confer-
ties in Judgments of Degraded Multichannel Audio Qual- ence on The Future of Audio Technology – Surround Sound
ity,” J. Acoust. Soc. Am., vol. 118, no. 2, pp. 968–976 (2005 and Beyond (Piteå, Sweden) (2006 Jun.), paper 9-2.
Aug.). https://doi.org/10.1121/1.1945368. [53] T. Hanyu and S. Kimura, “A New Objective Mea-
[49] E. C. Poulton, Bias in Quantifying Judgments sure for Evaluation of Listener Envelopment Focusing on
(Lawrence Erlbaum Associates, Hove, UK, 1989). the Spatial Balance of Reflections,” Appl. Acoust., vol. 62,
[50] D. Begault, “Preferences Versus References: Lis- no. 2, pp. 155–184 (2001 Feb.).
teners as Participants in Sound Reproduction,” in Proceed- [54] D. Griesinger, “Spatial Impression and Envelop-
ings of the Spatial Audio & Sensory Evaluation Techniques ment in Small Rooms,” presented at the 103rd Con-
Workshop (Guilford, UK) (2006 Apr.). vention of the Audio Engineering Society Convention
[51] H. L. Meiselman (Ed.), Context: The Effect of En- of the Audio Engineering Society (1997 Sep.), paper
vironment on Product Design and Evaluation (Elsevier, 4638.
Duxford, UK, 2019). [55] W. Howie, D. Martin, S. Kim, T. Kamekawa,
[52] S. Kim, M. de Francisco, K. Walker, A. Marui, and and R. King, “Effect of Skill Level on Listener Per-
W. L. Martens, “An Examination of the Influence of Mu- formance in 3D Audio Evaluation,” J. Audio Eng.
sical Selection on Listener Preferences for Multichannel Soc., vol. 68, no. 9, pp. 628–637 (2020 Sep.).
Microphone Technique,” in Proceedings of the AES 28th https://doi.org/10.17743/jaes.2020.0050.
THE AUTHORS
J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November 847
KIM AND HOWIE PAPERS
848 J. Audio Eng. Soc., Vol. 69, No. 11, 2021 November