Professional Documents
Culture Documents
Ambisonics Comes of Age
Ambisonics Comes of Age
Ambisonics Comes of Age
A
mbisonics is a means by which a compact structure for capturing, stor- tually, as one increases the order above this,
spatial audio sound fields can be ing, and transferring 3D audio alongside with the suggestion that there is limited
represented in the form of a hier- 360 video and virtual reality content. The value in using orders above 5th for typical
archy of directional components. It has MPEG-H audio coding standard, for exam- applications.
become increasingly important in recent ple, incorporates a mode known as scene- Encoded content can be rendered for
years, resulting in a number of papers and based audio (SBA) for transmitting spatial reproduction using decoding algorithms
e-Briefs at AES conventions, and a num- audio as ambisonic components. designed for almost any loudspeaker
ber of these are summarized in this arti- The first manifestations of ambisonics layout, as long as the locations of the
cle. First developed primarily in the 1970s, mainly used what is known as first-order loudspeakers are known, although irreg-
ambisonic principles were manifested in representation (B-format), whereby the ular layouts are more challenging than
a relatively small number of systems and sound field is represented as a pressure regular ones. It can also be decoded in the
products for many years—principally the (omnidirectional) component, plus three form of two-channel binaural signals suit-
Soundfield microphone. Like a number of orthogonal velocity components (like able for headphone listening. Ambisonic
other inventions that have emerged over figure-eight microphone patterns point- representation lends itself particularly
the years, it was perhaps “before its time,” ing front-back, left-right, and up-down). well to interactive contexts where the
and remained mainly of interest in aca- This gives relatively low spatial resolution, listener’s head may be rotated or tilted, as
demic and research circles. Fascinatingly, and later manifestations have increased the the decoding (rendering) algorithms can
though, its time as a mainstream tech- order of representation by including further easily be adapted.
nology seems now to have come, brought “spatial harmonics” that have tighter direc-
about by the need for an intermediate tional patterns. These formats are typically OPTIMIZED DECODERS FOR
method of representation for 3D audio that known under the generic hearing “high- MIXED-ORDER AMBISONICS
can be rendered to a variety of different er-order ambisonics” or HOA. Increasing In Paper 10507 (“Optimized Decoders
speaker or headphone configurations. This the order leads to a larger number of chan- for Mixed-Order Ambisonics”) from the
is perhaps not so surprising, as 3D audio nels needed for storage or transmission, and 150th Convention, Aaron Heller, Eric Ben-
formats have proliferated in the recent past, recent ambisonic streaming applications jamin, and Fernando Lopez Lezcano dis-
and the growing interest in interactive or typically use 2nd or 3rd order representa- cuss decoding in systems where the ver-
immersive media has generated a need tion, needing 9 or 16 channels respectively. tical order is lower than the horizontal
for things like rotatable sound fields and There can be diminishing returns, percep- order. They use a form of shorthand from
Fig. 1. “The Stage” at CCRMA (Center for Computer Research in Music and Acoustics)—a 56-loudspeaker array for spatial audio rendering.
(Courtesy of Heller, Benjamin, and Lopez-Lezcano)
Chris Travis to describe mixed-order sys- Toolbox (ADT), previously introduced by the in e-Brief 403. The toolkit described by
tems, giving the example that 3H2V means first author, and available online at https:// this paper’s authors was based around a set
third-order horizontal, second-order ver- bitbucket.org/ambidecodertoolbox/adt. of ambisonic decoding plug-ins developed
tical. Such mixed-order systems will have git. This new version is said to include a by Matthias Kronlachner known as ambiX
different spatial resolution in some direc- fast and robust nonlinear optimizer and a plug-ins (http://www.matthiaskronlachner.
tions to others. The authors point out that new design procedure for dual-band decod- com/?p=2015). ambiX is a channel order-
mixed-order microphones are available, and ers. The high-frequency performance is ing convention used for ambisonic record-
also that in domestic settings it’s easier to optimized first, then the low-frequency ing exchange, and the ambiX plug-ins are a
deploy, say, eight loudspeakers in a horizon- performance is optimized to match it. The set of ambisonic decoding plug-ins, which
tal array than it is to have a lot of elevated solution described here involves designing includes a binaural renderer. Tylka and
loudspeakers. Most real-world applications a mixed-order decoder for a spherical 240 Choueri’s contribution was essentially an
involve arrays that are either irregular or loudspeaker array. Apparently the choice open source collection of MATLAB func-
incomplete, it’s suggested. Approximately of a spherical array design creates a well- tions known as the SABRE toolkit (https://
hemispherical arrays are one example, the behaved optimization problem and an opti- github.com/PrincetonUniversity/3D3A-SA-
missing bottom half of a full sphere leading mal decoder matrix. The high frequency BRE-Toolkit), which facilitated the cre-
to increasing localization errors at or below energy vector (rE, which “points” in the ation of customized or personalized ambi-
the horizon. In arrays that have more hor- direction that a high frequency source sonic-to-binaural decoding presets, using
izontal than elevated speakers the spatial would be perceived) is then computed over SOFA-formatted head-related transfer func-
accuracy will inevitably be better in this 5200 virtual speaker points on a sphere, and tions (HRTFs).
plane than it is for height cues. values at corresponding locations to those HRTFs represent the unique effects of
A useful explanation of the task of an points are used as target values for real one’s head, torso, and outer ear on the
ambisonic decoder, given here, is “to loudspeaker positions. The low frequency acoustic cues processed by the auditory
create the best perceptual impression velocity vector (rv, which “points” in the system and are the principal means by
possible that the sound field is being direction that a low frequency source would which that system decodes the directions
reproduced accurately, given the available be perceived) is then optimized such that it of sound sources. Every direction has a
resources.” Historically, various criteria points the same way as rE at every spherical different effect on the frequency response
have needed to be optimized in ambi- point, and has a magnitude of 1. and time delays of binaural signals. HRTFs
sonic decoding, involving amplitude and One of the interesting outcomes of the are often represented in their time-
energy gain in different directions, veloc- work is a visualization tool that helps domain equivalent form, the head-related
ity, and energy-based localization vectors designers decide which of an available array impulse response, or HRIR. SOFA is an
at low and high frequencies respectively. of loudspeakers should be used for horizon- increasingly popular standard format for
Some of these are said to be more import- tal and height layers if parameters are to be distributing individual HRTFs, standing for
ant than others, and it is important to optimized. The authors applied the tool to “spatially-oriented format for acoustics.”
avoid conflicting perceptual cues. It’s both a 56-loudspeaker professional installa- Tylka and Choueri explained that there are
pointed out that a number of rendering tion (Fig. 1) and a 13-loudspeaker domestic more comprehensive tools than theirs for
systems deal with mixed-order signals by “dome” with good results, at least from creating ambisonic decoders, such as the
using full-order decoders and just leav- a modelling point of view, and an initial ADT mentioned in the previous section. For
ing out the missing channels, but decod- one-listener listening test. this reason they said that one application
ers designed specifically for mixed-order of their toolkit was simply to add custom
signals seem always to outperform this BINAURAL AMBISONIC RENDERING HRTFs to existing ambiX decoding presets.
rather crude approach in terms of spatial To give a bit of background, back in 2017 This needed the user to specify loudspeaker
accuracy in the rendered directions. Joseph Tylka and Edgar Choueiri pre- locations because the information about
In this paper the emphasis is on a new sented “A Toolkit for Customizing the those is apparently not explicitly included in
implementation of the Ambisonic Decoder ambiX Ambisonics-to-Binaural Renderer” ambiX decoding presets.
HRTFs included in SOFA files can be the measurement of them, partly because
represented as HRIRs gathered on a grid of measuring people’s individual HRTFs is
measuring locations. It is possible that these difficult. A novel feature is the extension of
may need to be interpolated if measured individual HRTFs generated in just a few
values are not available at desired grid posi- directions to rendering in any direction.
tions, so the toolbox had various means As shown in Fig. 2, a typical binau-
of doing this, including nearest neighbor, ral ambisonic renderer takes ambisonic
time domain, and frequency domain meth- components (spherical harmonics),
ods. There was an option not to interpolate decodes them to loudspeaker feeds for a
when a grid point in the SOFA file was close particular layout, and then uses HRTF
enough to a desired location, according to a processing to filter the loudspeaker signals
specified threshold value. Means were also from different directions, generating
provided in the toolkit to apply headphone left and right ear signals. As the authors
equalization if needed, in order to compen- explain, using non-individual HRTFs can Fig. 4. 20 vertices of a regular dodecahedron
sate for specific headphone responses that give rise to perception errors such as used as virtual loudspeaker locations for
might otherwise interfere with optimal in-head localization, front-back confusion ambisonic decoding.
binaural reproduction. For example, some or poor discrimination of elevation cues.
headphones are equalized to the diffuse- Individual HRTF modelling is therefore conducted a listening experiment. 3rd-
field standard developed some years back, done according to the scheme shown in order HOA signals of source material were
whereas others may employ free-field equal- Fig. 3. Here HRTFs from a database based decoded to 20 virtual loudspeakers at loca-
ization or indeed none at all. on the KEMAR head, known as CIPIC, tions corresponding to the vertices of a
are decomposed using a process known as regular dodecahedron, as shown in Fig. 4.
INDIVIDUALIZED BINAURAL SPCA (spatial principal component analy- The virtual loudspeaker signals were then
RENDERER FOR HOA sis). Eight different anthropometric param- each convolved with the HRIR of the near-
More recently at the 150th Convention, eters, including head dimensions, shoulder est spatial direction in the CIPIC database,
Mengfan Zhang and colleagues from China width, and various outer ear measurements for the “generic” renderer, or the individ-
published a paper on an “Individualized are chosen to represent the main physi- ual HRIR created by the modelling process
HRTF-Based Binaural Renderer for High- cal factors that affect HRTFs in practice. described above for the individual renderers.
er-Order Ambisonics” (Paper 10454). These are captured from photographs of Source material for the tests was simply a
They described a system whereby HOA sig- individuals, and the parameters used to train of 250 ms bursts of windowed noise
nals could be rendered for headphones with estimate new SPC weights for these indi- panned ambisonically so as to appear to
better results using an individual binaural viduals (new people, whose HRTFs were come from different directions. As far as
renderer than using a generic one. This was not contained in the original database). can be determined from the paper, head
based on their previous work which empha- Deep neural networks (DNNs) are trained tracking was not employed in the render-
sised the modelling of HRTFs, rather than to predict new values of these in arbitrary ing process, so the rendered signals would
spatial directions, not have been affected by listener head
and converted movements. Listeners were expected to
back into HRTF use a graphical interface to indicate the
magnitude spectra. spatial locations of the noise bursts. Results
Related interaural suggested that front-back confusion was
time differences significantly lower for the individualized
(ITDs) are also rendering than for generic rendering.
Fig. 2. A typical binaural ambisonic renderer takes ambisonic computed using a
components, decodes them to loudspeaker feeds for a particular layout, similar approach, AMBISONIC DECODER TEST
and then uses HRTF processing to filter the loudspeaker signals from
different directions, generating left and right ear signals. (Figs 2–4 also possible in METHODOLOGY
courtesy of Zhang et al.) arbitrary spatial Virtual loudspeakers and binaural ren-
directions. dering were also used in work described
Binaural HRIRs are by Enda Bates, William David, and Daniel
then reconstructed Dempsey, in Paper 10457 (“Ambisonic
from minimum Decoder Test Methodologies Based on
phase monophonic Binaural Reproduction”). The authors
magnitude spectra, point out that there is an increasingly
based on modelled large number of ways to decode ambi-
ITDs. sonic source material, including passive
In order to eval- approaches, parametric methods based
Fig. 3. Framework for modelling of individual HRTFs using spatial uate the approach, on time-frequency analysis, and hybrid
principal components. the authors approaches. It’s difficult, they say, to
DATA COMPRESSION
OF HOA SIGNALS
Xu et al. explain that HOA signals need
a lot of channels to deliver high spatial
resolution, so it is worth considering effi-
cient data compression for efficient stor-
age and transmission. (“Higher Order
Ambisonics Compression Method Based
on Independent Component Analysis,”
Paper 10456). Of course independent
Fig. 5. Process for creating test and reference stimuli to compare ambisonic decoders. data compression could be employed on
(a) Object-based tests; (b) recording-based tests. (Courtesy of Bates, David and Dempsey)
each mono channel, but this would not
take advantage of redundancy between
compare ambisonic decoders, as it is to decode for, owing to its symmetry, but has channels. An alternative approach may
compare spatial recording techniques, relatively few loudspeaker locations in the use a form of principal component anal-
without a clear “ground truth” reference horizontal plane. The 7.1.4 layout has more ysis to extract predominant components
against which to compare the results. loudspeakers in the horizontal plane but from a group of channels, encoding these
Mostly people have assumed either the is irregular in its layout. Source material with conventional mono encoders, then
expert subjects have a good internal consisted of both recorded scenes (using the residuals (background components)
reference (memorized), against which first order ambisonic microphones) and are encoded as ambisonic signals at a
to compare what they hear, or ratings object-based signals with sources panned lower bit rate. This is the basis of the
have been made on specific spatial attri- using an ambisonic panner. The ambisonic so-called single value decomposition
butes such as source width. For the tests source material was rendered binaurally by (SVD) method given as an example in the
described in this paper, it seems, the convolving the loudspeaker feeds, decoded MPEG-H standard. It’s pointed out that
authors chose to use eight different attri- using three different decoding algorithms the predominant components extracted
butes for evaluation of three ambisonic to be compared, with HRIRs relating to the using such a method may not correspond
decoding strategies. A web-based multi- loudspeaker locations taken from a database directly to the sound sources, and the
ple comparison paradigm was employed known as SADIE II. Reference scenes were components are calculated on a frame-by-
enabling stimuli to be compared against created using direct binaural recordings, or frame basis, which can result in abrupt
a reference and high-low anchor signals. direct binaural rendering of object sources, transitions. Consequently, the authors
Two different loudspeaker layouts were rather than using ambisonic pickup/ propose an alternative based on indepen-
used, one a cube and the other a typical panning and decoding as an intermediate dent component analysis (ICA), a method
multichannel “cinematic” 11.1 (7.1.4) step. A form of equalization was used to of signal decomposition widely used in
layout. A cube layout is potentially easier to compensate for timbral differences between blind source separation applications. In