Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Audio Engineering Society

Convention Paper 9148


Presented at the 137th Convention
2014 October 9–12 Los Angeles, USA

This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least
two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced
from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no
responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society,
60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper,
or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

An object-based audio system for interactive


broadcasting
Robert Oldfield1 , Ben Shirley1 , Jens Spille2
1 University of Salford, Salford, M5 4WT, UK
2 Technicolor, Research and Innovation, Hannover, Germany

Correspondence should be addressed to Robert Oldfield (r.g.oldfield@salford.ac.uk)

ABSTRACT
This paper describes audio recording, delivery and rendering for an end-to-end broadcast system allowing
users free navigation of panoramic video content with matching interactive audio. The system is based on
one developed as part of the EU FP7 funded project, FascinatE. The premise of the system was to allow users
free navigation of an ultra-high denition 180 degree video panorama for a customisable viewing experience.
From an audio perspective the complete audio scene is recorded and broadcast so the rendered sound scene
at the user end may be customised to match the viewpoint. The approach described here uses and object-
based audio paradigm. This paper presents an overview of the system and describes how such an approach
is useful for facilitating an interactive broadcast.

1. INTRODUCTION fore that there is a need for a change in paradigm from a


Over the past decade, there has been a rapid change in traditional, channel-based approach to one that can facil-
the way that audiovisual (AV) content is consumed. Not itate multiple hardware setups without the need for many
only has the number and type of devices changed (vary- different delivery networks. One such approach is the
ing from mobile devices to much larger systems) but format-agnostic delivery method [1] as utilised by the
there is now a far greater appetite for an individual cus- FascinatE project [2]. One of the key challenges of this
tomisable experience augmented by internet-based con- is capturing the entire scene rather than simply record-
tent and second screen applications. Despite this shift, ing a channel-based representation of the scene at a point
AV media broadcast and production systems have not in space. The premise is to shift the decision as to how
changed significantly to keep up. It is apparent there- the content is consumed as close as possible to the user
Oldfield et al. Object-based Audio Broadcast

end rather than the production end as has been the tradi- approach records and transmits all of the audio ingredi-
tional method. In essence, this provides the user with all ents needed to reassemble the audio scene at the user end.
the necessary information about the recorded scene such As all of the positions of the sources are known, it is pos-
that they can customise their experience in their specific sible to rotate the scene and even change the position and
environment. level of the sound sources such that a complete customi-
sation of the audio can be facilitated at the user end. It is
From an audio perspective, this presents some new chal- also possible to include different processing algorithms
lenges with respect to the recording, broadcast and re- that may be applied to individual audio objects such as
production of the scene. Standard channel-based record- compression, filtering and effects etc.
ing techniques can give way to an object-based approach
where much more information about the original scene is Object-based audio (OBA) has therefore been considered
retained right through to the user end. This provides the by many to be the future of spatial audio and there have
user with all the necessary audio components to recom- been some suggestions on how to represent object-based
pile the sound scene based on their viewing perspective audio scenes such as MPEG-4 AudioBIFS [5], Spatial
or preferences. Audio Object Coding (SAOC) [6] and the Audio Scene
Description Format (ASDF) [7]. OBA enables reproduc-
This paper describes the approach developed as part of tion on any loudspeaker system providing the relevant
the FascinatE Project to facilitate such an audio system decoders are available on the user end computer to ren-
that allows both an interactive and immersive experience. der the sound scene on the user’s audio setup. OBA is be-
The paper begins with an overview of object-based au- ginning to gather momentum on a commercial level too
dio and then precedes with a description of the Fasci- with the implementation of Dolby Atmos [8] and also
natE Project and of the audio system developed as part the DTS Multi-dimensional Audio (MDA) open source
of the project. The paper continues then to discuss some format being two current examples.
specifics of the scene capture, delivery and reproduction
Some non-spatial applications of OBA have also been
techniques used and finishes with some concluding re-
proposed; BBC research has implemented object-based
marks and further work.
audio in a test radio broadcast which used audio objects
to tailor the radio programme depending on the listeners
2. OBJECT-BASED AUDIO geographical location [9]. OBA allowed specific audio
There are many different ways of representing a sound events that made up the programme such as sound ef-
scene, with varying degrees of accuracy depending on fects, actors’ voices and music to be customised based
the required spatial resolution and whether the aim is on geographical location and the date and time of access
a perceptual approximation or a mathematically exact to the radio programme. The programme was delivered
representation. Channel-based systems such as two- over IP and used the HTML5 standard to carry out all au-
channel stereophony, 5.1, 7.1 etc. essentially sample dio processing and mixing at the user end. Another use of
the sound scene at a discrete location or attempt to syn- audio objects that has been proposed by the BBC was for
thesise an impression of that sound scene by delivering users to be able to change the duration of a programme
specific audio content to the available loudspeaker chan- by adjusting the spaces between audio events, without
nels/target format. Other techniques which can be con- any need for time stretching or other process that may
sidered transformation-based systems [3] aim at utilising be detrimental to audio quality or intelligibility. Addi-
a mathematical representation of the sound scene such tionally OBA also enables remixing of the relative levels
as a spatially orthogonal basis function e.g. [4]. In this between objects in the scene such as between commen-
case transformation coefficients are transmitted rather tary and different areas of the crowd in a football match
than loudspeaker signals and these are then decoded at [10]. Other possibilities exist to utilise this approach to
the user end to the specific rendering system. facilitate improvements to the speech intelligibility for
A more transparent method of representing a sound scene people with hearing impairments [11].
is to utilise an object-based approach where each sound 2.1. FascinatE Project
source in the scene is recorded separately along with its The recently completed FascinatE Project [2] was a Eu-
position in space and some associated metadata describ- ropean research project aiming at the development of a
ing other source characteristics. Such an object-based complete end-to-end future broadcast system designed to

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 2 of 10
Oldfield et al. Object-based Audio Broadcast

be format agnostic and interactive, based on user naviga- 14] base. The advantage of adopting such an approach
tion of an ultra-high definition panorama with accompa- is that the audio objects can be moved according to the
nying 3D audio. FascinatE stands for Format-Agnostic customised visuals and the sound field component can
SCript-based INterAcTive Experience and brought to- also be rotated to match the correct viewing perspective
gether 11 institutions from 8 different European coun- so the final output is spatially coherent.
tries. Fundamental to the system developed was a for-
As shown in Figure 1, the FascinatE system comprised
mat agnostic approach such that only one set of capture
three components, audio scene extractor/capture, audio
and broadcast infrastructure was needed for all differ- composer and audio presenter modules. The scene ex-
ent potential use-cases of the system i.e. ranging from tractor was responsible for the capture of the sound field
an individual watching on a mobile device and listen- component and the extraction of the audio objects from
ing on headphones right through to someone watching
the scene. Information was gathered at this stage and
the broadcast in a public setting with a large scale wrap
metadata generated as described later. The Audio Com-
around screen and a large multi-channel immersive audio
poser (AC) takes this information and composes and au-
system. dio scene based on global preferences, scripting informa-
In the FascinatE Project, object-based audio was utilised tion and individual scene navigation decisions etc. such
as a means to provide a dynamically matching audio for that the audio perspective matches as close as possible
interactive navigation through an AV scene. The project with the visual perspective. Once the scene has been
captured a very high definition video panorama of 7K composed, the Audio Presenter (AP) module receives in-
resolution (approx 7K x 2K pixels) utilising Fraunhofer formation on the user’s audio system and decodes the
HHI’s OMNICAM [12] and allowed pan, tilt and zoom sound scene for reproduction accordingly, thus the sys-
navigation of the panorama by the user. In order to pro- tem is format agnostic, allowing replay on any output
vide matching audio for the user-defined scene it was system providing the necessary decoders are installed on
necessary to move away from a conventional channel- the user’s system.
based audio paradigm. Instead a hybrid object-based and
transformation-based approach was adopted to capture 4. SCENE CAPTURE
the audio scene without reference to any specific target To capture the entire audio scene, it is important that
loudspeaker configuration. Instead of defining captured each discrete sound source (audio object) is separated
audio events as emanating from a given loudspeaker or and positioned accurately in space and also that the
from between two loudspeakers of a target reproduc- sound field component (providing a spatially accurate
tion system, sound events were captured complete with background/ambience) is recorded correctly and to an
3D coordinate information specifying where in the audio appropriate degree of accuracy. In some cases it is also
scene the event had taken place. Additionally the spatial desirable to capture so-called diffuse audio objects which
sound field was also captured using Higher Order Am- can be useful for representing features such as the late re-
bisonics to provide the ambience/background. This is flection energy in a room’s impulse response, this is sub-
analogous to a gaming audio scenario and, in the same ject to current research activity and not covered in any
way as a first person game allows navigation around and more detail in this contribution.
between audio objects, the object-based audio capture 4.1. Sound Field Component
enabled users to pan around and zoom into the AV scene Microphone arrays like SoundField R or Eigenmike R
with audio events remaining in their correct locations, are used to capture the entire three-dimensional sound
similarly, the sound field can also be rotated and manip- field at the microphone array position. The Ambison-
ulated corresponding to user navigation. It was possible ics or Higher Order Ambisonics (HOA) representation is
in the FascinatE system to zoom across a scene and past used to preserve the 3D acoustic/audio scene.
audio objects which would then move behind the user’s
viewpoint thus realising a realistic audio scene to match In principle, Ambisonics uses spherical harmonic func-
the chosen visual viewpoint. tions on the surface of a sphere to model the superposi-
tion of single audio sources distributing acoustic waves
3. SYSTEM OVERVIEW from different directions as a linear combination of or-
The basis of the FascinatE audio system is an object- thogonal basis directions. The components in each ba-
based paradigm with a Higher Order Ambisonics [13, sis direction are the spherical harmonics of one single

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 3 of 10
Oldfield et al. Object-based Audio Broadcast

NE
NW

SW

S
SE
E

Scene Composer Composed: Presenter Stereo


Surround
Sound Sources Sound Sources Sourround
Pan

Loundspeaker Setup Mapping


with Position with Position

Ambisonics
WFS
Decoder
WFS
Encoder
Sound Field
Sound Field Ambisonics
Signals Decomposer
Signals

Ambisonics
Decoder

Device
3D Ambisonics
User
Request Paremeter
Script Parameters

Figure 1: Block diagram of the FascinatE audio system

direction. Normally, more than one source is encoded, matrixed to an intermediate representation describing the
therefore coefficients from different directions/sources pressure distribution exactly on the surface of the sphere.
are combined in form of matrices. For each sample time, The second step removes the impact of the capsules as
an encoder matrix maps all the sound source signals from well as the impact of the array arrangement to obtain the
different directions into a single vector where its compo- HOA coefficients of the free field. These impacts are de-
nents represent the Ambisonics coefficients of one audio scribed by the term microphone array response which is
source/direction at a specific time. So at each sample basically a filter term [15].
time a vector is derived containing all recorded source 4.2. Audio Object Extractor
information encoded in its vector components. This is a In order to have a more complete description of the sound
format-agnostic approach because the Ambisonics repre- scene it is desirable to also capture the audio objects in
sentation is space invariant, thus it can be decoded such the scene. Once captured these can be individually ma-
that the sound field can be reproduced on arbitrary loud- nipulated to provide a fully customisable sound scene at
speaker configurations. An important parameter of the the user end (as described later). The principle task of
HOA description is the order of the spherical harmonic the Audio Object Extractor (AOE) is to use the avail-
functions because it controls the number of coefficients, able microphones to pick out and separate the discrete
i.e., the accuracy/resolution of the sound field descrip- sound sources in the scene. The AOE aims to find the
tion: the more coefficients (the higher the order) the bet- audio content that is deemed salient for the given record-
ter the accuracy. ing scenario and to locate the audio objects in space.
For example an order of one can be achieved by the 4.2.1. Audio Objects
SoundField microphone and the Tetra Mic which both These key, discrete audio events in the scene are de-
utilise four microphone capsules. The first order repre- scribed by audio objects. An audio object contains the
sentation provides the basic spatial information that com- audio content of a source with a specific location in time
pares to a traditional mid-side recording. The Eigenmike and space. An audio object therefore has audio data, a
contains 32 microphone capsules on a rigid sphere and position, an onset/offset time and potentially some addi-
delivers HOA coefficients up to order four. As a con- tional metadata such as source directivity, reverberation
sequence, the reproduction yields a much better spatial- time etc. Generally speaking, in terms of recording, au-
ization of the content. The HOA representation is ob- dio objects can be split into two categories depending on
tained in two steps. In a first step the capsule signals are the nature of the audio capture techniques involved; so-

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 4 of 10
Oldfield et al. Object-based Audio Broadcast

called explicit audio objects and implicit audio objects. on a time-difference-of-arrival method to position the au-
dio object as depicted in Figure 3, more information on
Explicit Audio Objects this process can be found in [16].
If a sound source can be closely miked with a tracked or
stationary location, it can be classed as an explicit audio
object. An example of an explicit audio object would
be the violinist in an orchestra. In this case the sound
source can be recorded with very little bleed from other
audio sources and the position is generally stationary or
at least can be tracked easily by local GPS type tracking
systems. Another example of an explicit audio object
would be the feed from an interview or commentary in
this case the microphone feed is generally clean and the
position of the object is not important

Implicit Audio Objects


For more complex cases such as a football match where
it is not possible to use close microphones or tracking
devices on audio sources and the available microphone
feeds may contain many sources, the captured audio ob-
jects need to be derived/extracted from the available mi-
crophone signals. Audio objects of this type are said to
be implicit audio objects. The techniques for the extrac-
tion of implicit audio objects are scenario dependant al- Figure 2: Extracting explicit audio objects
though many of the techniques will work across different
genres. An example of how implicit audio objects can be 4.3. Extracting metatdata
extracted in terms of their content and position for the A major element of the FascinatE system is the script-
on-pitch sounds at a football match can be found in [16]. ing engine which enables the system to both run from
In this case the available microphone feeds are analysed and generate scripts in either a feedback or feed-forward
for content matching a given template (e.g. the char- sense. A script, for example may provide shot framing
acteristics of a ball-kick or whistle-blow), if a match is information, pan, tilt, zoom decisions, regions of interest
found, the content is cut out of the microphone feed and etc. and therefore allow the system to alter both the vi-
then positioned in space using time-difference-of-arrival sual and audio output automatically based on the scripts.
techniques with information from the other microphones. At each stage in the FascinatE system scripting infor-
The content and positional information is then encoded mation is generated allowing easy scene segmentation
as an audio object. and for searching for specific attributes of the content.
From an audio perspective metadata is generated using
The actions of the AOE (as depicted in Figures 2 and
the AOE to provide the wider system with information
3) are therefore slightly different for implicit audio ob-
about the audio sources in the scene. This information
jects as opposed to explicit audio objects. For explicit
can then be used to make automatic production choices
objects the position of the audio object is either static or
such as zooming in on the referee once he’s blown his
is tracked, therefore the position localising stage is not
whistle etc.
present in the flow diagram as shown in Figure 2. The
AOE ingests the raw audio feeds from the microphones, Principally the audio object extractor will provide meta-
the search algorithm then looks for audio content that data of audio object location, onset time and offset time
matches the target audio signature and extracts the con- but output from the audio analysis can also provide event
tent from the microphone feed accordingly. In the im- type (i.e. ball-kick, whistle-blow, players/managers talk-
plicit case where the position of the objects is not tracked, ing, interview etc.) and other information about the audio
the extracted content for each active microphone channel content such as signal statistics, this information can be
is fed to the position extraction algorithm, which is based used to define a region of interest centred around a key

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 5 of 10
Oldfield et al. Object-based Audio Broadcast

number of both of which may vary with time). A suitable


audio broadcast format will therefore be able to represent
both aspects of the complete scene and will enable some
metadata to also be included in the data stream. An ex-
ample schematic of a possible file/streaming format is
shown in Figure 5.

Figure 3: Extracting implicit audio objects

event in the recording scene. For example the object ex-


tractor can provide information on when and where on
the pitch the referee blew his whistle which can be used
to identify a region of interest. Results from the audio Figure 5: Relation between audio tracks and frames for
analysis engine can also state which area of the crowd is an object-based streaming format
cheering the loudest at any point in time. This informa-
The stream is split into frames enabling the data to
tion could be used for defining a visual region of interest
change over time, i.e. both audio objects and sound
or allowing the user to choose to hear the crowd from that
fields can become active and inactive at different points,
given location, if there are sufficient microphones in that
thus only the necessary content of the scene is trans-
area of the stadium. Figure 4 shows the data chain for
mitted at any point in time. Each audio frame contains
the generation of audio metadata in the system. Informa-
the frame payload (the audio content) and also a frame
tion gained during the extraction process such as audio
header which provides some metadata about the frame
object type, saliency measure, confidence level, position
itself. Additionally a frame can be used as an access
and the onset and offset times is written to an MPEG-7
point, allowing random access to arbitrary points in time.
description file and written to a knowledge base for fu-
Rather than transmitting the entire frame track by track
ture content retrieval.
(which would require a long buffer) each frame is split
into smaller segments as shown in Figure 6.
5. SCENE TRANSMISSION
Once the audio objects have been extracted and the
sound field recorded and encoded appropriately all the
data can be broadcast together in a file format specifi-
cally designed to facilitate an object-based paradigm to
deliver the user with a fully customisable audio scene.
5.1. Requirements for an audio transmission Figure 6: Sequence of data packages per audio frame an
format object-based streaming format
A recorded sound scene will most likely consist of sev-
eral audio objects and some sound field descriptions (the Each frame has a unique number and time stamp which

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 6 of 10
Oldfield et al. Object-based Audio Broadcast

Figure 4: Data chain for audio metadata generation

are all part of the frame header, this frame header also 6. SCENE REPRODUCTION
contains the sample rate, number of tracks, number of The reproduction end of the FascinatE audio system (see
samples per segment and potentially some additional in- Figure 1) consists of an Audio Composer (AC) and an
formation such as room environment etc. Each track Audio Presenter (AP) module. The Audio Composer re-
header will also contain a unique number and time stamp ceives Audio Objects (AO) and Sound Field Descriptions
offset and will include information about the track type (SFD) and processes them, like a mixing desk would do
(i.e. sound field description or audio object). Also in in a studio, based on information from the User Control
the track header is positional information, orientation, di- Node and the Video Rendering Node. The updated sound
rectivity etc. allowing a dynamically varying scene with scene is then communicated to the Audio Presenter for
moving objects etc. For a sound field track, parameters rendering.
like 2D/3D, the Ambisonics order, the orientation of the
sound field, the rotation, the bit depths, the coefficient 6.1. Audio Composer
order and the normalization method used (e.g. “Furse- The Audio Composer positions the audio objects in the
Malham weights”, “Schmidt Semi-Normalized” or 4π sound scene along with the recorded sound field descrip-
Normalized etc.) should also be part of this header in- tions which are positioned in the scene at the location in
formation. which they were recorded. The Audio Composer is con-
nected via a TCP/IP connection to the Video Rendering
Node (VRN). The VRN communicates an updated, pan,
5.2. Scene compression tilt and zoom with each video frame. The AC uses this
Whilst the FascinatE Project explicitly did not focus on data to update the positions of the audio objects in the
compression techniques there has been some recent stan- sound scene and applies a rotation to the recorded SFDs.
dardisation activity in this area by MPEG[17] which is
currently looking at 3D audio scene compression. To re- The input and output formats for the Audio Composer
duce the amount of audio data to be transmitted, MPEG are designed to be identical such that several ACs can be
is working on MPEG-H 3D Audio “High efficiency cod- concatenated from the production to the user side. This
ing and media delivery in heterogeneous environments”. allows the separate distribution of operations on AOs and
The first phase of this activity focused on bitrates of 256 SFDs throughout the whole chain. Furthermore, if non-
kbits/s and above, while the second phase, for bitrates linear effects in the audio processing are required, e.g.,
below 256 kbits/s, is just beginning. translational movement by a zooming operation; such
operations can be differentiated for AOs and SFDs. A
first AC could react to script inputs and could preselect
MPEG-H 3D Audio can compress channel-based, AOs, as defined by a service provider; a second AC could
object-based as well as scene-based (HOA) input for- be placed at the end-user terminal, controlled by the user
mats so is ideally suited to compression within the format inputs, to do the final positioning of the objects.
mentioned here as any mixture of audio objects, sound
field descriptions or channel-based formats can be ac- 6.1.1. Adjusting the sound scene
commodated. The positions of the audio objects need to be updated by

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 7 of 10
Oldfield et al. Object-based Audio Broadcast

Figure 7: FascinatE Audio Renderer

the Audio Composer to match the current viewing per- It is also possible to apply zoom to a sound field although
spective. The origin of the sound scene is taken as the this is a non-trivial task. However various translations
OMNICAM position and the relative positions of the au- can be applied to the sound field as described in [19] and
dio objects are changed based on the pan, tilt and zoom [20] which approximate zoom. Practically, the applica-
information from the VRN. When a pan or tilt command tion of zoom to the sound field component may depend
is parsed by the AC the received angle is added to the cur- both on user preferences and the recording scenario as
rent azimuth or elevation angle of the audio objects re- described below.
spectively so they move appropriately in the reproduced
Regardless of pan, tilt and zoom information, some audio
sound field.
objects should remain stationary such as the commen-
For a given pan angle it is also required that the ambient tary/interview feeds and are therefore left static in the
sound field be rotated in order to simulate the change in scene based on flags in the metadata. The individual lev-
the viewing direction. This is done in the Audio Com- els of some of the audio objects can be manually altered
poser by multiplying the sound field by a correspond- as well, providing the listener with a basic mixing desk
ing rotation matrix. This is important for spatial coher- GUI in their rendering software that allows the control
ence as the sound field recording will contain informa- of some of the audio objects and the balance between
tion from some of the separately recorded audio objects background and foreground levels which can be useful
and these need to be rendered to the same position as for increasing speech intelligibility for hearing impaired
the corresponding audio objects in the reproduced sound users [11].
field at the user end.
After the specific sound fields have been appropriately
For zoom, the AC parses the zoom angle from the up-
rotated and the audio objects manipulated we combine
date XML message and changes the relative levels and
the component sound fields and audio objects into a re-
positions of the audio objects accordingly. A level dif-
sultant sound field and deliver this to the Audio Presenter
ference is applied between the sound field signals and
for rendering. This is possible due to the linear character
the audio objects to match what would be expected in a
of the SFD. The resulting HOA description can be trans-
real listening scenario (i.e. a level factor of 1/distance
mitted to the end user which can render the sound field
is applied based on the calculated listener distance from
onto a specific loudspeaker setup.
the given audio object or sound field). As a user zooms
into the scene the angles of the audio objects will also 6.2. User Preferences
change (they will increase up to the point that the object The way the audio is processed based on the navigation
appears behind the listener). The degree to which this of the visual content depends very much on user pref-
angle increases is a function of the screen size and the erences. With modern first person shooter games, peo-
listener distance from the screen, thus some approxima- ple are used to the audio changing based on their every
tions have to be made to accommodate multiple listeners move. The relative directions of audio sources alters as
in the space. The AC therefore has a series of variables they rotate their head and get louder as they move to-
that allow the user to control the extent to which they wards them etc. However this is a completely different
would like to experience audio zoom or disable the fea- paradigm to that of current televisions broadcasts where
ture if they wish to do so. the audio content is traditionally very static with discrete

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 8 of 10
Oldfield et al. Object-based Audio Broadcast

audio sources panned front central and only ambience 8. CONCLUSIONS


coming from the rear channels. For the object-based sys- This paper has presented a format agnostic audio system
tem described in this paper a more dynamic scene is en- for interactive broadcast applications. The system com-
abled but there are open questions as to whether people prises of an object-based paradigm with a higher order
enjoy a moving sound scene when they are watching tele- ambisonics base and is based on the system developed as
vision. Particular problems could arise with cuts between part of the FascinatE Project. The paper has described
different camera angles as this would introduce a drastic the sound scene capture, transmission and rendering of
change in the audio scene which some viewers may find the audio in the system and has described how the con-
disturbing. Initial subjective assessments of the system tent can be manipulated to facilitate an interactive broad-
presented here indicate that people show a preference to cast. Further work is required to fully asses the percep-
the rendering scenario where the audio objects and the tual preferences of interactive audio and the development
sound field are rotated with pan angle as opposed to the of a suitable GUI to allow manual changes to be made to
static situation but there was a split opinion as to whether the reproduced sound scene. Also further developments
audio zoom should be disabled in the system as some lis- will look into a reduced bandwidth system for interaction
teners found the experience disorientating even when the on mobile devices.
zoom was implemented by simply changing the relative
levels of the audio objects and the sound field compo- 9. ACKNOWLEDGEMENT
nents. The research leading to these results has received fund-
ing from the European Union’s Seventh Framework Pro-
6.3. Audio Presenter
gramme ([FP7/2007-2013]) under grant agreement no.
In the context of format-agnostic production, the audio
[248138].
presenter tool enables the rendering of both channel-
oriented formats and sound field descriptions on a large
variety of loudspeaker setups from simple stereo sys- 10. REFERENCES
tems up to larger, multi-channel, spatial loudspeaker se- [1] Oliver Schreer, Jean-François Macq, Omar Aziz
tups. At the user end the final, manipulated HOA coef- Niamut, Javier Ruiz-Hidalgo, Ben Shirley, Georg
ficients from the Audio Composer have to be mapped to Thallinger, and Graham Thomas. Media Produc-
the individual loudspeaker setup of the user. Therefore, tion, Delivery and Interaction for Platform Inde-
the HOA decoder matrix performs an inverse operation pendent Systems: Format-Agnostic Media. John
to map these Ambisonics coefficients from all sources Wiley & Sons, 2013.
onto given loudspeaker positions by matrix multiplica-
tion, thus facilitating a format agnostic playback. Fur- [2] Joanneum Research, Technicolor, TNO, Univer-
thermore the Audio Presenter could map the sound to sity of Salford, BBC, Softeco Sismat, Interac-
virtual speakers, which then could be mapped to a head- tive Institute, Fraunhofer HHI, ARRI, and Uni-
phone using Head Related Transfer Functions (HRTF) versitat de Catalunya. Fascinate Project: EU
for 3D listening via headphones. FP7/2007-2013 under grant agreement no: 248138;
http://www.fascinate-project.eu, 2010.
7. FURTHER WORK
More work is required to fully determine what the per- [3] Ben Shirley, Rob Oldfield, Frank Melchior, and
ceptual ramifications of dynamically changing the ren- Johann-Markus Batke. Platform Independent Au-
dered sound scene at the user end and how people like dio. Media Production, Delivery and Interac-
the sound scene to change as the visuals change. Addi- tion for Platform Independent Systems: Format-
tional next steps are to look at how this format agnostic Agnostic Media, pages 130–165, 2013.
system could be applied to mobile devices in a real-world [4] Mark A Poletti. Three-dimensional surround sound
scenario where the bandwidth is limited such that it may systems based on spherical harmonics. Journal of
not be possible to deliver the complete sound field to the the Audio Engineering Society, 53(11):1004–1025,
device. Also a simple GUI needs to be developed to en- 2005.
able people to have some manual control over the sound
field and to customise the output of the system to match [5] Jürgen Schmidt and Ernst F. Schröder. New and
their specific requirements/preferences. Advanced Features for Audio Presentation in the

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 9 of 10
Oldfield et al. Object-based Audio Broadcast

MPEG-4 Standard. In 116th Conv. Audio Eng. Soc., Conference on Spatial Audio, Detmold, Germany,
Berlin, Germay, 2004. November 2011.
[6] Jürgen Herre, Heiko Purnhagen, Jeroen Koppens, [16] Robert Oldfield, Ben Shirley, and Jens Spille.
Oliver Hellmuth, Jonas Engdegå rd, Johannes Object-based audio for interactive football broad-
Hilper, Lars Villemoes, Leon Terentiv, Cornelia cast. Multimedia Tools and Applications, pages 1–
Falch, Andreas Hölzer, María Luis Valero, Barbara 25, 2013. doi: 10.1007/s11042-013-1472-2.
Resch, Harald Mundt, and Hyen-O Oh. MPEG
Spatial Audio Object Coding – The ISO/MPEG [17] MPEG-H. State of the Art in compres-
Standard for Efficient Coding of Interactive Audio sion and transmission of 3D Video: Part
Scenes. J. Audio Eng. Soc, 60(9):655 – 673, 2012. 3 - 3D Audio, ISO/IEC 23008, draft/open,
http://mpeg.chiariglione.org/standards/mpeg-h/3d-
[7] Matthias Geier, Jens Ahrens, and Sascha Spors. audio.
Object-based Audio Reproduction and the Audio
Scene Description Format. Organised Sound, 15 [18] MPEG. ISO 14496-3 (MPEG-4 Audio) Final Com-
(3):219 – 227, 2010. mittee Draft. MPEG Document W2203, 1998.

[8] C. Q Robinson, S Mehta, and N Tsingos. Scalable [19] Jens Ahrens. Analytic Methods of Sound Field Syn-
Format and Tools to Extend the Possibilities of Cin- thesis. Springer, Heidelberg, Germany, 2012.
ema Audio. SMPTE Motion Imaging Journal, 121
[20] Nail A. Gumerov and Ramani Duraiswami. Fast
(8):63 – 69, 2012.
Multipole Methods for the Helmholtz Equation in
[9] I Forrester and A Churnside. The creation of a per- Three Dimensions. Elsevier, first edition, 2004.
ceptive audio drama. NEM Summit, 2012.
[10] Mark Mann, Anthony WP Churnside, Andrew
Bonney, and Frank Melchior. Object-based audio
applied to football broadcasts. In Proceedings of
the 2013 ACM international workshop on Immer-
sive media experiences, pages 13–16. ACM, 2013.
[11] Benjamin Guy Shirley. Improving Television sound
for people with hearing impairments. PhD thesis,
University of Salford, 2013.
[12] Oliver Schreer, Ingo Feldmann, Christian Weissig,
Peter Kauff, and Ralf Schäfer. Ultrahigh-
Resolution Panoramic Imaging for Format-
Agnostic Video Production. Proceedings of IEEE,
101(1):99 – 114, 2013.
[13] Jerome Daniel. Représentation de champs acous-
tiques, application à la transmission et à la repro-
duction de scènes sonores complexes dans un con-
texte multimédia,. PhD thesis, PhD thesis Univer-
sitè Paris, 2001.
[14] D Malham. Space in Music – Music in Space,. PhD
thesis, University of York, UK, 2003.
[15] Sven Kordon, Alexander Krüger, Johann-Markus
Batke, and Holger Kropp. Optimization of Spheri-
cal Microphone Array Recordings. In International

AES 137th Convention, Los Angeles, USA, 2014 October 9–12


Page 10 of 10

You might also like