Ambisonics Comes of Age

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

FEATURE ARTICLE

Ambisonics comes of age


Francis Rumsey
Consultant Technical Writer

Ambisonic representation provides a relatively versatile and compact way of storing


or transmitting spatial audio scenes. Signals in this format may exist in mixed
orders, requiring special decoders to be constructed that optimize spatial rendering.
When it comes to decoding ambisonics for different speaker layouts, there may be
advantages to active or hybrid decoders. Irregular layouts are harder to decode for
than regular ones. Binaural rendering is a popular means of reproducing ambisonic
content, and there is some evidence that individualized HRTF processing is useful
in this context. Ambisonic information can also be data compressed, and there may
be advantages to doing this after some form of signal decomposition has been done,
which takes advantage of interchannel redundancy.

A
mbisonics is a means by which a compact structure for capturing, stor- tually, as one increases the order above this,
spatial audio sound fields can be ing, and transferring 3D audio alongside with the suggestion that there is limited
represented in the form of a hier- 360 video and virtual reality content. The value in using orders above 5th for typical
archy of directional components. It has MPEG-H audio coding standard, for exam- applications.
become increasingly important in recent ple, incorporates a mode known as scene- Encoded content can be rendered for
years, resulting in a number of papers and based audio (SBA) for transmitting spatial reproduction using decoding algorithms
e-Briefs at AES conventions, and a num- audio as ambisonic components. designed for almost any loudspeaker
ber of these are summarized in this arti- The first manifestations of ambisonics layout, as long as the locations of the
cle. First developed primarily in the 1970s, mainly used what is known as first-order loudspeakers are known, although irreg-
ambisonic principles were manifested in representation (B-format), whereby the ular layouts are more challenging than
a relatively small number of systems and sound field is represented as a pressure regular ones. It can also be decoded in the
products for many years—principally the (omnidirectional) component, plus three form of two-channel binaural signals suit-
Soundfield microphone. Like a number of orthogonal velocity components (like able for headphone listening. Ambisonic
other inventions that have emerged over figure-eight microphone patterns point- representation lends itself particularly
the years, it was perhaps “before its time,” ing front-back, left-right, and up-down). well to interactive contexts where the
and remained mainly of interest in aca- This gives relatively low spatial resolution, listener’s head may be rotated or tilted, as
demic and research circles. Fascinatingly, and later manifestations have increased the the decoding (rendering) algorithms can
though, its time as a mainstream tech- order of representation by including further easily be adapted.
nology seems now to have come, brought “spatial harmonics” that have tighter direc-
about by the need for an intermediate tional patterns. These formats are typically OPTIMIZED DECODERS FOR
method of representation for 3D audio that known under the generic hearing “high- MIXED-ORDER AMBISONICS
can be rendered to a variety of different er-order ambisonics” or HOA. Increasing In Paper 10507 (“Optimized Decoders
speaker or headphone configurations. This the order leads to a larger number of chan- for Mixed-Order Ambisonics”) from the
is perhaps not so surprising, as 3D audio nels needed for storage or transmission, and 150th Convention, Aaron Heller, Eric Ben-
formats have proliferated in the recent past, recent ambisonic streaming applications jamin, and Fernando Lopez Lezcano dis-
and the growing interest in interactive or typically use 2nd or 3rd order representa- cuss decoding in systems where the ver-
immersive media has generated a need tion, needing 9 or 16 channels respectively. tical order is lower than the horizontal
for things like rotatable sound fields and There can be diminishing returns, percep- order. They use a form of shorthand from

696 J. Audio Eng. Soc., Vol. 69, No. 9, 2021 September


FEATURE ARTICLE

Fig. 1. “The Stage” at CCRMA (Center for Computer Research in Music and Acoustics)—a 56-loudspeaker array for spatial audio rendering.
(Courtesy of Heller, Benjamin, and Lopez-Lezcano)

Chris  Travis to describe mixed-order sys- Toolbox (ADT), previously introduced by the in e-Brief 403. The toolkit described by
tems, giving the example that 3H2V means first author, and available online at https:// this paper’s authors was based around a set
third-order horizontal, second-order ver- bitbucket.org/ambidecodertoolbox/adt. of ambisonic decoding plug-ins developed
tical. Such mixed-order systems will have git. This new version is said to include a by Matthias Kronlachner known as ambiX
different spatial resolution in some direc- fast and robust nonlinear optimizer and a plug-ins (http://www.matthiaskronlachner.
tions to others. The authors point out that new design procedure for dual-band decod- com/?p=2015). ambiX is a channel order-
mixed-order microphones are available, and ers. The high-frequency performance is ing convention used for ambisonic record-
also that in domestic settings it’s easier to optimized first, then the low-frequency ing exchange, and the ambiX plug-ins are a
deploy, say, eight loudspeakers in a horizon- performance is optimized to match it. The set of ambisonic decoding plug-ins, which
tal array than it is to have a lot of elevated solution described here involves designing includes a binaural renderer. Tylka and
loudspeakers. Most real-world applications a mixed-order decoder for a spherical 240 Choueri’s contribution was essentially an
involve arrays that are either irregular or loudspeaker array. Apparently the choice open source collection of MATLAB func-
incomplete, it’s suggested. Approximately of a spherical array design creates a well- tions known as the SABRE toolkit (https://
hemispherical arrays are one example, the behaved optimization problem and an opti- github.com/PrincetonUniversity/3D3A-SA-
missing bottom half of a full sphere leading mal decoder matrix. The high frequency BRE-Toolkit), which facilitated the cre-
to increasing localization errors at or below energy vector (rE, which “points” in the ation of customized or personalized ambi-
the horizon. In arrays that have more hor- direction that a high frequency source sonic-to-binaural decoding presets, using
izontal than elevated speakers the spatial would be perceived) is then computed over SOFA-formatted head-related transfer func-
accuracy will inevitably be better in this 5200 virtual speaker points on a sphere, and tions (HRTFs).
plane than it is for height cues. values at corresponding locations to those HRTFs represent the unique effects of
A useful explanation of the task of an points are used as target values for real one’s head, torso, and outer ear on the
ambisonic decoder, given here, is “to loudspeaker positions. The low frequency acoustic cues processed by the auditory
create the best perceptual impression velocity vector (rv, which “points” in the system and are the principal means by
possible that the sound field is being direction that a low frequency source would which that system decodes the directions
reproduced accurately, given the available be perceived) is then optimized such that it of sound sources. Every direction has a
resources.” Historically, various criteria points the same way as rE at every spherical different effect on the frequency response
have needed to be optimized in ambi- point, and has a magnitude of 1. and time delays of binaural signals. HRTFs
sonic decoding, involving amplitude and One of the interesting outcomes of the are often represented in their time-
energy gain in different directions, veloc- work is a visualization tool that helps domain equivalent form, the head-related
ity, and energy-based localization vectors designers decide which of an available array impulse response, or HRIR. SOFA is an
at low and high frequencies respectively. of loudspeakers should be used for horizon- increasingly popular standard format for
Some of these are said to be more import- tal and height layers if parameters are to be distributing individual HRTFs, standing for
ant than others, and it is important to optimized. The authors applied the tool to “spatially-oriented format for acoustics.”
avoid conflicting perceptual cues. It’s both a 56-loudspeaker professional installa- Tylka and Choueri explained that there are
pointed out that a number of rendering tion (Fig. 1) and a 13-loudspeaker domestic more comprehensive tools than theirs for
systems deal with mixed-order signals by “dome” with good results, at least from creating ambisonic decoders, such as the
using full-order decoders and just leav- a modelling point of view, and an initial ADT mentioned in the previous section. For
ing out the missing channels, but decod- one-listener listening test. this reason they said that one application
ers designed specifically for mixed-order of their toolkit was simply to add custom
signals seem always to outperform this BINAURAL AMBISONIC RENDERING HRTFs to existing ambiX decoding presets.
rather crude approach in terms of spatial To give a bit of background, back in 2017 This needed the user to specify loudspeaker
accuracy in the rendered directions. Joseph Tylka and Edgar Choueiri pre- locations because the information about
In this paper the emphasis is on a new sented “A Toolkit for Customizing the those is apparently not explicitly included in
implementation of the Ambisonic Decoder ambiX Ambisonics-to-Binaural Renderer” ambiX decoding presets.

J. Audio Eng. Soc., Vol. 69, No. 9, 2021 September 697


FEATURE ARTICLE

HRTFs included in SOFA files can be the measurement of them, partly because
represented as HRIRs gathered on a grid of measuring people’s individual HRTFs is
measuring locations. It is possible that these difficult. A novel feature is the extension of
may need to be interpolated if measured individual HRTFs generated in just a few
values are not available at desired grid posi- directions to rendering in any direction.
tions, so the toolbox had various means As shown in Fig. 2, a typical binau-
of doing this, including nearest neighbor, ral ambisonic renderer takes ambisonic
time domain, and frequency domain meth- components (spherical harmonics),
ods. There was an option not to interpolate decodes them to loudspeaker feeds for a
when a grid point in the SOFA file was close particular layout, and then uses HRTF
enough to a desired location, according to a processing to filter the loudspeaker signals
specified threshold value. Means were also from different directions, generating
provided in the toolkit to apply headphone left and right ear signals. As the authors
equalization if needed, in order to compen- explain, using non-individual HRTFs can Fig. 4. 20 vertices of a regular dodecahedron
sate for specific headphone responses that give rise to perception errors such as used as virtual loudspeaker locations for
might otherwise interfere with optimal in-head localization, front-back confusion ambisonic decoding.
binaural reproduction. For example, some or poor discrimination of elevation cues.
headphones are equalized to the diffuse- Individual HRTF modelling is therefore conducted a listening experiment. 3rd-
field standard developed some years back, done according to the scheme shown in order HOA signals of source material were
whereas others may employ free-field equal- Fig. 3. Here HRTFs from a database based decoded to 20 virtual loudspeakers at loca-
ization or indeed none at all. on the KEMAR head, known as CIPIC, tions corresponding to the vertices of a
are decomposed using a process known as regular dodecahedron, as shown in Fig. 4.
INDIVIDUALIZED BINAURAL SPCA (spatial principal component analy- The virtual loudspeaker signals were then
RENDERER FOR HOA sis). Eight different anthropometric param- each convolved with the HRIR of the near-
More recently at the 150th Convention, eters, including head dimensions, shoulder est spatial direction in the CIPIC database,
Mengfan Zhang and colleagues from China width, and various outer ear measurements for the “generic” renderer, or the individ-
published a paper on an “Individualized are chosen to represent the main physi- ual HRIR created by the modelling process
HRTF-Based Binaural Renderer for High- cal factors that affect HRTFs in practice. described above for the individual renderers.
er-Order Ambisonics” (Paper 10454). These are captured from photographs of Source material for the tests was simply a
They described a system whereby HOA sig- individuals, and the parameters used to train of 250 ms bursts of windowed noise
nals could be rendered for headphones with estimate new SPC weights for these indi- panned ambisonically so as to appear to
better results using an individual binaural viduals (new people, whose HRTFs were come from different directions. As far as
renderer than using a generic one. This was not contained in the original database). can be determined from the paper, head
based on their previous work which empha- Deep neural networks (DNNs) are trained tracking was not employed in the render-
sised the modelling of HRTFs, rather than to predict new values of these in arbitrary ing process, so the rendered signals would
spatial directions, not have been affected by listener head
and converted movements. Listeners were expected to
back into HRTF use a graphical interface to indicate the
magnitude spectra. spatial locations of the noise bursts. Results
Related interaural suggested that front-back confusion was
time differences significantly lower for the individualized
(ITDs) are also rendering than for generic rendering.
Fig. 2. A typical binaural ambisonic renderer takes ambisonic computed using a
components, decodes them to loudspeaker feeds for a particular layout, similar approach, AMBISONIC DECODER TEST
and then uses HRTF processing to filter the loudspeaker signals from
different directions, generating left and right ear signals. (Figs 2–4 also possible in METHODOLOGY
courtesy of Zhang et al.) arbitrary spatial Virtual loudspeakers and binaural ren-
directions. dering were also used in work described
Binaural HRIRs are by Enda Bates, William David, and Daniel
then reconstructed Dempsey, in Paper 10457 (“Ambisonic
from minimum Decoder Test Methodologies Based on
phase monophonic Binaural Reproduction”). The authors
magnitude spectra, point out that there is an increasingly
based on modelled large number of ways to decode ambi-
ITDs. sonic source material, including passive
In order to eval- approaches, parametric methods based
Fig. 3. Framework for modelling of individual HRTFs using spatial uate the approach, on time-frequency analysis, and hybrid
principal components. the authors approaches. It’s difficult, they say, to

698 J. Audio Eng. Soc., Vol. 69, No. 9, 2021 September


FEATURE ARTICLE

binaural responses used in the processes


involved. Fig. 5 shows the process involved
for both types of scene capture.
The results suggested, among other
things, that hybrid and active decoding
methods could outperform traditional
passive ambisonic decoding for some source
location-related attributes, but only when
the loudspeaker layout was well matched to
the decoding method. No significant differ-
ences could be detected between the results
for the three decoders for more general
attributes such as timbre, spatial impression
or naturalness/presence. The cube loud-
speaker layout, though, was rated higher in
most cases for these than the 7.1.4 layout,
and overall it seemed there was a trend
towards better results with the cube layout.
It was noted that improvements to the
testing approach could include headphone
equalization, head tracking and personal-
ized HRIRs.

DATA COMPRESSION
OF HOA SIGNALS
Xu et al. explain that HOA signals need
a lot of channels to deliver high spatial
resolution, so it is worth considering effi-
cient data compression for efficient stor-
age and transmission. (“Higher Order
Ambisonics Compression Method Based
on Independent Component Analysis,”
Paper 10456). Of course independent
Fig. 5. Process for creating test and reference stimuli to compare ambisonic decoders. data compression could be employed on
(a) Object-based tests; (b) recording-based tests. (Courtesy of Bates, David and Dempsey)
each mono channel, but this would not
take advantage of redundancy between
compare ambisonic decoders, as it is to decode for, owing to its symmetry, but has channels. An alternative approach may
compare spatial recording techniques, relatively few loudspeaker locations in the use a form of principal component anal-
without a clear “ground truth” reference horizontal plane. The 7.1.4 layout has more ysis to extract predominant components
against which to compare the results. loudspeakers in the horizontal plane but from a group of channels, encoding these
Mostly people have assumed either the is irregular in its layout. Source material with conventional mono encoders, then
expert subjects have a good internal consisted of both recorded scenes (using the residuals (background components)
reference (memorized), against which first order ambisonic microphones) and are encoded as ambisonic signals at a
to compare what they hear, or ratings object-based signals with sources panned lower bit rate. This is the basis of the
have been made on specific spatial attri- using an ambisonic panner. The ambisonic so-called single value decomposition
butes such as source width. For the tests source material was rendered binaurally by (SVD) method given as an example in the
described in this paper, it seems, the convolving the loudspeaker feeds, decoded MPEG-H standard. It’s pointed out that
authors chose to use eight different attri- using three different decoding algorithms the predominant components extracted
butes for evaluation of three ambisonic to be compared, with HRIRs relating to the using such a method may not correspond
decoding strategies. A web-based multi- loudspeaker locations taken from a database directly to the sound sources, and the
ple comparison paradigm was employed known as SADIE II. Reference scenes were components are calculated on a frame-by-
enabling stimuli to be compared against created using direct binaural recordings, or frame basis, which can result in abrupt
a reference and high-low anchor signals. direct binaural rendering of object sources, transitions. Consequently, the authors
Two different loudspeaker layouts were rather than using ambisonic pickup/ propose an alternative based on indepen-
used, one a cube and the other a typical panning and decoding as an intermediate dent component analysis (ICA), a method
multichannel “cinematic” 11.1 (7.1.4) step. A form of equalization was used to of signal decomposition widely used in
layout. A cube layout is potentially easier to compensate for timbral differences between blind source separation applications. In

J. Audio Eng. Soc., Vol. 69, No. 9, 2021 September 699


FEATURE ARTICLE

ate eight predomi- owing to its potential to be rendered for a


nant components, variety of different reproduction formats,
each encoded by an and for easy spatial manipulation of sound
EVS mono encoder scenes.
at 24 or 48 kbit/s, Signals in this format may exist in mixed
plus 3 kbit/s for orders, requiring special decoders to be
side information. constructed that optimize spatial render-
The third method ing. When it comes to decoding ambisonics
was the new for different speaker layouts, it seems that
method proposed there may be advantages to certain active
i n t h i s p a p e r, or hybrid decoders, but only if the decoder
again resulting in type is well matched to the speaker layout
eight components in question. Irregular layouts are harder
and encoded the to decode for than regular ones. On the
same way as the other hand, binaural rendering is a popular
second method. In means of reproducing ambisonic content,
both of the cases often by virtualizing decoded loudspeaker
involving signal feeds using HRTF processing. There is some
decomposition the evidence that individualized HRTF process-
background infor- ing is useful in this context.
mation was omit- Finally, ambisonic information can
Fig. 6. (a) Encoder and (b) decoder block diagram for low bit-rate ted. Compressed be data compressed, and there may be
ambisonic encoder based on independent component analysis. signals were advantages to doing this after some form
(Courtesy of Xu et al.)
decoded back to of signal decomposition has been done,
HOA signals and which takes advantage of interchannel
such a case, they say, the primary com- rendered for two-channel headphone redundancy. The jury is still out, to some
ponents extracted are directly related to listening using a binaural renderer. extent, on the best method of decom-
sound sources, while the side information The source material used for these tests posing ambisonic components into fore-
contains background spatial information. was either simulated from mono sources ground and background elements, but
The ICA algorithm thus employed works panned in different directions, or recorded preliminary evidence suggests some bene-
to “unmix” independent sources from directly, and consisted of a variety of fits from a method based on independent
an HOA signal matrix. These foreground program types such as singing, speaking, component analysis.
components form one element of the and instrumental music. There was also a
output, while the spatial characteristics, helicopter recording, some moving instru-
Editor’s note: All the papers referred to
in the form of a mixing matrix, form the ments, and a noisy room. Based on a multi- in this article can be found in the AES
side information. Residuals left after this ple stimulus paradigm, measuring overall E-Library at http://www.aes.org/e-lib/.
separation are regarded as Gaussian (noise- sound quality and spatial characteristics AES members get free access to the
E-Library.
like), and treated as a background stream (presumably combined into a single value),
that can be transmitted at a lower bit rate, employing a hidden reference and low
or truncated to a lower order. Foreground anchors (low-pass filtered and mono), 12
and background streams are then encoded listeners compared the results of the differ-
conventionally using an entropy encoder. ent coding approaches and bit rates. The
The overall structure of encoding and new approach was rated significantly higher
decoding using such an approach is shown than the other two, at both bit rates, provid-
in Fig. 6. ing some initial support for the effectiveness
In order to evaluate the success of this of the new approach. The authors acknowl-
approach Xu and his colleagues compared edged, nonetheless, that they should still
three different methods of data compres- evaluate the performance on more critical
sion on 4th order ambisonic source mate- test material such as very tonal or tran-
rial (25 channels), aiming for broadly sient signals, or those that do not contain
similar total bit rates around 200 and predominant sounds.
400 kbit/s for all three methods. The first
compression method encoded each of the SUMMARY
25 ambisonic channels separately, at either Ambisonic representation provides a rel-
8 or 16 kbit/s per channel, using an EVS atively versatile and compact way of stor-
mono encoder. The second method used ing or transmitting spatial audio scenes. It
SVD (the MPEG-H example), to gener- has become more popular in recent years,

700 J. Audio Eng. Soc., Vol. 69, No. 9, 2021 September

You might also like